1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

401 commits

Author SHA1 Message Date
Stanley.Yang
7ec11c2f65 drm/amdgpu: Fix ineffective ras_mask settings
Check amdgpu_ras_mask to fix ineffective ras_mask setting
due to special asic without sram ecc enable but with poison
supported.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-26 11:14:37 -05:00
Lijo Lazar
1b6ef74b2b drm/amdgpu: Add fatal error detected flag
For a RAS error that needs a full reset to recover, set the fatal error
status. Clear the status once the device is reset.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-26 11:14:24 -05:00
Tao Zhou
edfdde9013 drm/amdgpu: disable RAS feature when fini
Send RAS disable feature command in fini.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-31 14:05:18 -05:00
Hawking Zhang
1731ba9b64 drm/amdgpu: Update boot time errors polling sequence
Update boot time errors polling sequence to align with
the latest firmware change.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Frank Min <Frank.Min@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-31 14:04:55 -05:00
YiPeng Chai
ed1e1e42fd drm/amdgpu: Support passing poison consumption ras block to SRIOV
Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25 14:58:03 -05:00
Yang Wang
c0c48f0d61 drm/amdgpu: adjust aca init/fini sequence to match gpu reset
- move aca init/fini function into ras init/fini to adapt gpu reset
  sequence.
- add new function amdgpu_aca_reset()

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25 14:58:02 -05:00
Mukul Joshi
c84a7e21db drm/amdgpu: Fix module unload hang with RAS enabled
The driver unload hangs because the page retirement
kthread cannot be stopped as it is sleeping and waiting
on page retirement event to occur. Add kthread_should_stop()
to the event condition to wake up the kthread when kthread
stop is called during driver unload.

Fixes: 3fdcd0a31d ("drm/amdgpu: Prepare for asynchronous processing of umc page retirement")
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25 14:57:52 -05:00
Yang Wang
2866a4549c drm/amdgpu: skip call ras_late_init if ras block is not supported
skip call ras_late_init callback if ras block is not supported.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:28 -05:00
YiPeng Chai
0795b5d234 drm/amdgpu:Support retiring multiple MCA error address pages
Support retiring multiple MCA error address pages in
one in-band query for umc v12_0.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:25 -05:00
YiPeng Chai
6c23f3d12a drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoning
Use asynchronous polling to handle umc_v12_0 poisoning.

v2:
  1. Change function name.
  2. Change the debugging information content.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:25 -05:00
Stanley.Yang
ee9c3031d0 drm/amdgpu: Fix ras features value calltrace
The high three bits of ras features mask indicate socket
id, it should skip to check high three bits of ras features
mask before disable all ras features.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:25 -05:00
YiPeng Chai
3fdcd0a31d drm/amdgpu: Prepare for asynchronous processing of umc page retirement
Preparing for asynchronous processing of umc page retirement.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:25 -05:00
Stanley.Yang
2c7a1560e8 drm/amdgpu: Show deferred error count for UMC
Show deferred error count for UMC syfs node

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-18 15:47:07 -05:00
Yang Wang
7ed97155b2 drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[]
fix array index out of bounds issue for ras_block_string[] array.

Fixes: 30df05fb74 ("drm/amdgpu: Align ras block enum with firmware")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-18 15:46:07 -05:00
Hawking Zhang
4e2965bd3b drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported
Move ras capablity check to amdgpu_ras_check_supported.
Driver will query ras capablity through psp interace, or
vbios interface, or specific ip callbacks.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:39 -05:00
Candice Li
46e2231ce0 drm/amdgpu: Log deferred error separately
Separate deferred error from UE and CE and log it
individually.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:37 -05:00
Yang Wang
37973b69ea drm/amdgpu: add aca sysfs support
add aca sysfs node support

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:36 -05:00
Yang Wang
04c4fcd263 drm/amdgpu: add amdgpu ras aca query interface
v1:
add ACA error query interface

v2:
Add a new helper function to determine whether to use ACA or MCA.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:36 -05:00
Yang Wang
33dcda51e9 drm/amdgpu: add ACA bank dump debugfs support
add ACA bank dump debugfs support

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:35 -05:00
Hawking Zhang
cce4febb27 drm/amdgpu: Add ras helper to query boot errors v2
Add ras helper function to query boot time gpu
errors.
v2: use aqua_vanjaram smn addressing pattern

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:35 -05:00
Hawking Zhang
73cb81dc54 drm/amdgpu: Packed socket_id to ras feature mask
Initialize RAS feature mask bit[31:29] with socket_id.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:44:13 -05:00
Candice Li
fb1e917199 drm/amdgpu: Support poison error injection via ras_ctrl debugfs
Support poison error injection.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:44:13 -05:00
Candice Li
90bd01471d drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
Remove "no user action is needed" for correctable and deferred error
to avoid confusion.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:43:54 -05:00
Hawking Zhang
6697dbf0af Revert "drm/amdgpu: enable mca debug mode on APU by default"
Not needed any more with firmware fixes

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-05 16:04:36 -05:00
Srinivasan Shanmugam
b8d55a90fd drm/amdgpu: Fix possible NULL dereference in amdgpu_ras_query_error_status_helper()
Return invalid error code -EINVAL for invalid block id.

Fixes the below:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1183 amdgpu_ras_query_error_status_helper() error: we previously assumed 'info' could be null (see line 1176)

Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-03 11:16:06 -05:00
Srinivasan Shanmugam
091411be7a drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.c
Fixes the below smatch warnings:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init() warn: Please consider using kzalloc instead of kmalloc
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn: Please consider using kzalloc instead of kmalloc

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-03 11:16:06 -05:00
YiPeng Chai
9f91e983ee drm/amdgpu: MCA supports recording umc address information
MCA supports recording umc address information.

V2:
  Move err_addr variable from struct ras_err_node to
struct ras_err_info.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-19 14:59:03 -05:00
Dave Airlie
c1ee197d64 Linux 6.7-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmV2PMQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhgQIAKSdjtpkv+IeDwph
 umLHxDlCTT8nl+TAhEYuxSdMiHt1i7+zyF0snnHhGthBRnBZBf+PDdkQOmqtjLZj
 e591VGY7tdNCAVUwmRRaI82JFz9Pl1k8DXL3f+CE1+MfbpsirAecoIAL0rsvg+4f
 ICKSAOBZonL7GT7IGrsbxgqGKCLC+2PcgNT4pADBdjtbC1nmVSLK21dtWr6i4xsn
 CT9MxAnej1+EOpreyHhNZUVqfQy0/04x4OY0u/EOXU3PgpfGTrAzkyfwzcd/eit0
 x59TZ0owDVXfvvdXUUAC0zG1th2ek5LKNtL+C4rKOoAbq7GZoQfjN5M/Ug+yykwE
 aqm6CJ8=
 =tZ+R
 -----END PGP SIGNATURE-----

Backmerge tag 'v6.7-rc5' into drm-next

Linux 6.7-rc5

Alex requested this for some amdkfd work relying on the symbols exports.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2023-12-12 11:32:33 +10:00
Yang Wang
dbf3850d12 drm/amdgpu: optimize the printing order of error data
sort error data list to optimize the printing order.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-06 16:05:32 -05:00
Yang Wang
f875f61b1f drm/amdgpu: enable mca debug mode on APU by default
enable MCA debug mode on APU device by default.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-30 18:26:31 -05:00
Lijo Lazar
201761b5eb drm/amdgpu: Move mca debug mode decision to ras
Refactor code such that ras block decides the default mca debug mode,
and not swsmu block.

By default mca debug mode is set to false.

v2: squash in uninitialized value fix (Alex)

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29 16:49:01 -05:00
Yang Wang
07ee43faeb drm/amdgpu: fix ras err_data null pointer issue in amdgpu_ras.c
fix ras err_data null pointer issue in amdgpu_ras.c

Fixes: 8cc0f5669e ("drm/amdgpu: Support multiple error query modes")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-17 00:51:58 -05:00
Vitaly Prosyak
4638e0c29a drm/amdgpu: fix software pci_unplug on some chips
When software 'pci unplug' using IGT is executed we got a sysfs directory
entry is NULL for differant ras blocks like hdp, umc, etc.
Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group'
check that 'sd' is  not NULL.

[  +0.000001] RIP: 0010:sysfs_remove_group+0x83/0x90
[  +0.000002] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
[  +0.000001] RSP: 0018:ffffc90002067c90 EFLAGS: 00010246
[  +0.000002] RAX: 0000000000000000 RBX: ffffffff824ea180 RCX: 0000000000000000
[  +0.000001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  +0.000001] RBP: ffffc90002067ca8 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  +0.000001] R13: ffff88810a395f48 R14: ffff888101aab0d0 R15: 0000000000000000
[  +0.000001] FS:  00007f5ddaa43a00(0000) GS:ffff88841e800000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007f8ffa61ba50 CR3: 0000000106432000 CR4: 0000000000350ef0
[  +0.000001] Call Trace:
[  +0.000001]  <TASK>
[  +0.000001]  ? show_regs+0x72/0x90
[  +0.000002]  ? sysfs_remove_group+0x83/0x90
[  +0.000002]  ? __warn+0x8d/0x160
[  +0.000001]  ? sysfs_remove_group+0x83/0x90
[  +0.000001]  ? report_bug+0x1bb/0x1d0
[  +0.000003]  ? handle_bug+0x46/0x90
[  +0.000001]  ? exc_invalid_op+0x19/0x80
[  +0.000002]  ? asm_exc_invalid_op+0x1b/0x20
[  +0.000003]  ? sysfs_remove_group+0x83/0x90
[  +0.000001]  dpm_sysfs_remove+0x61/0x70
[  +0.000002]  device_del+0xa3/0x3d0
[  +0.000002]  ? ktime_get_mono_fast_ns+0x46/0xb0
[  +0.000002]  device_unregister+0x18/0x70
[  +0.000001]  i2c_del_adapter+0x26d/0x330
[  +0.000002]  arcturus_i2c_control_fini+0x25/0x50 [amdgpu]
[  +0.000236]  smu_sw_fini+0x38/0x260 [amdgpu]
[  +0.000241]  amdgpu_device_fini_sw+0x116/0x670 [amdgpu]
[  +0.000186]  ? mutex_lock+0x13/0x50
[  +0.000003]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
[  +0.000192]  drm_minor_release+0x4f/0x80 [drm]
[  +0.000025]  drm_release+0xfe/0x150 [drm]
[  +0.000027]  __fput+0x9f/0x290
[  +0.000002]  ____fput+0xe/0x20
[  +0.000002]  task_work_run+0x61/0xa0
[  +0.000002]  exit_to_user_mode_prepare+0x150/0x170
[  +0.000002]  syscall_exit_to_user_mode+0x2a/0x50

Cc: Hawking Zhang <hawking.zhang@amd.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09 17:02:49 -05:00
Hawking Zhang
8cc0f5669e drm/amdgpu: Support multiple error query modes
Direct error query mode and firmware error query mode
are supported for now.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09 17:01:58 -05:00
Tao Zhou
18eae367cb drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_count
Handle xgmi hive case.

Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03 12:18:32 -04:00
Tao Zhou
d1d4c0b7b6 drm/amdgpu: check RAS supported first in ras_reset_error_count
Not all platforms support RAS.

Fixes: 73582be11a ("drm/amdgpu: bypass RAS error reset in some conditions")
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31 16:40:15 -04:00
Tao Zhou
73582be11a drm/amdgpu: bypass RAS error reset in some conditions
PMFW is responsible for RAS error reset in some conditions, driver can
skip the operation.

v2: add check for ras->in_recovery, it's set earlier than
amdgpu_in_reset.

v3: fix error in gpu reset check.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:22 -04:00
Tao Zhou
f3a3bbf156 drm/amdgpu: enable RAS poison mode for APU
Enable it by default on APU platform.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:22 -04:00
Yang Wang
ec3e0a9167 drm/amdgpu: refine ras error kernel log print
refine ras error kernel log to avoid user-ridden ambiguity.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:21 -04:00
Yang Wang
53d4d77927 drm/amdgpu: fix find ras error node error
the origin function might return the wrong node.

Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:21 -04:00
Tao Zhou
8096df7664 drm/amdgpu: add set/get mca debug mode operations
Record the debug mode status in RAS.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:28 -04:00
Stanley.Yang
66d64e4e03 drm/amdgpu: Enable RAS feature by default for APU
Enable RAS feature by default for aqua vanjaram on apu
platform.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:28 -04:00
Yang Wang
49c260bef3 drm/amdgpu: fix typo for amdgpu ras error data print
typo fix.

Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:28 -04:00
Stanley.Yang
8a65661114 drm/amdgpu: Fix delete nodes that have been relesed
Fix delete nodes that it has been freed.

Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:27 -04:00
Hawking Zhang
9248462d7e drm/amdgpu: Enable software RAS in vcn v4_0_3
Set VCN/JPEG RAS masks to enable software RAS for
VCN and JPEG.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:26 -04:00
Tao Zhou
472c5fb297 drm/amdgpu: define ras_reset_error_count function
Make the code architecture more simple.

v2: reuse ras_reset_error_count in ras_reset_error_status.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:26 -04:00
Asad Kamal
53dd920c1f drm/amdgpu : Add hive ras recovery check
If one of the devices in the hive detects a
fatal error, need to send ras recovery reset
message to PMFW of all devices in the hive.
For that add a flag in hive to indicate that
it's undergoing ras recovery

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-19 18:26:51 -04:00
Yang Wang
5b1270beb3 drm/amdgpu: add ras_err_info to identify RAS error source
introduced "ras_err_info" to better identify a RAS ERROR source.

NOTE:
For legacy chips, keep the original RAS error print format.

v1:
RAS errors may come from different dies during a RAS error query,
therefore, need a new data structure to identify the source of RAS ERROR.

v2:
- use new data structure 'amdgpu_smuio_mcm_config_info' instead of
  ras_err_id (in v1 patch)
- refine ras error dump function name
- refine ras error dump log format

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-13 11:35:35 -04:00
Asad Kamal
625e5f3851 drm/amdgpu: Expose ras version & schema info
Expose ras table version & schema info to sysfs

v2: Updated schema to get poison support info
from ras context, removed asic specific checks

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-13 11:02:54 -04:00
Stanley.Yang
a83f2bf1f4 drm/amdgpu: Fix false positive error log
It should first check block ras obj whether be set, it should
return 0 directly if block ras obj or hw_ops is not set.
If block doesn't support RAS just return 0 is fine.

Changed from V1:
	return 0 directly if block ras obj or hw ops is not set

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:24:09 -04:00