1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

35 commits

Author SHA1 Message Date
YiPeng Chai
e023874081 drm/amdgpu: support ACA logging ecc errors
support ACA logging ecc errors.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:42 -04:00
YiPeng Chai
314c38cde6 drm/amdgpu: retire bad pages for umc v12_0
Retire bad pages for umc v12_0.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:42 -04:00
YiPeng Chai
f27defca68 drm/amdgpu: umc v12_0 logs ecc errors
1. umc v12_0 logs ecc errors.
2. Reserve newly detected ecc error pages.
3. Add tag for bad pages, so that they can
   be retired later.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
b2aa6b108d drm/amdgpu: umc v12_0 converts error address
Umc v12_0 converts error address.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
YiPeng Chai
95b4063de4 drm/amdgpu: add interface to update umc v12_0 ecc status
Add interface to update umc v12_0 ecc status.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:41 -04:00
Tao Zhou
4b0cb230bd drm/amdgpu: retire UMC v12 mca_addr_to_pa
RAS TA will handle it, the function is useless.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:09:15 -04:00
Tao Zhou
8e4617c25d drm/amdgpu: simplify convert_error_address interface for UMC v12
Replace separate parameters with struct ta_ras_query_address_input.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:56:18 -04:00
Tao Zhou
8b3495eafb drm/amdgpu: add socket id parameter for psp query address cmd
And set the socket id.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:54:54 -04:00
Yang Wang
f7bcfb7a56 drm/amdgpu: retrieve umc odecc error count for aca umc v12.0
retrieve umc odecc error count for aca umc v12.0

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-22 15:48:03 -04:00
Yang Wang
b93d759f54 drm/amdgpu: add umc v12.0.0 deferred error support
add umc v12.0.0 deferred error support.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Yang Wang
e3d4de8d8b drm/amdgpu: retire unused aca_bank_report data structure
retire unused aca_bank_report data structure.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Yang Wang
69bf42fbb2 drm/amdgpu: refine aca error cache for umc v12.0
refine aca error cache for umc v12.0

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:15 -04:00
Yang Wang
abc3b5d21d drm/amdgpu: add new aca_smu_type support
Add new types to distinguish between ACA error type and smu mca type.

e.g.:
the ACA_ERROR_TYPE_DEFERRED is not matched any smu mca valid bank
channel, so add new type 'aca_smu_type' to distinguish aca error type
and smu mca type.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:14 -04:00
Yang Wang
9dc57c2adf drm/amdgpu: add ras event id support
add amdgpu ras event id support to better distinguish different
error information sources in dmesg logs.

the following log will be identify by event id:
{event_id} interrupt to inform RAS event
{event_id} ACA logs
{event_id} errors statistic since from current injection/error query
{event_id} errors statistic since from gpu load

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20 13:38:13 -04:00
Tao Zhou
2c684b9342 drm/amdgpu: add deferred error check for UMC v12 address query
Both RAS UE and deferred errors need page retirement.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-29 20:35:14 -05:00
Tao Zhou
01087a1974 drm/amdgpu: use PSP address query command
Get UMC physical address from PSP in RAS error address coversion.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-31 14:05:19 -05:00
YiPeng Chai
0795b5d234 drm/amdgpu:Support retiring multiple MCA error address pages
Support retiring multiple MCA error address pages in
one in-band query for umc v12_0.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:25 -05:00
YiPeng Chai
afb617f38f drm/amdgpu: add interface to check mca umc status
Add interface to check mca umc status.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:25 -05:00
YiPeng Chai
22f6e3e112 drm/amdgpu: Add log info for umc_v12_0
Add log info for umc_v12_0.

v2:
 Delete redundant logs.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-22 17:13:25 -05:00
Tao Zhou
a9e4f61df1 drm/amdgpu: update error condition check for umc_v12_0_query_error_address
Deferred error is also taken into account.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-18 15:47:24 -05:00
Candice Li
46e2231ce0 drm/amdgpu: Log deferred error separately
Separate deferred error from UE and CE and log it
individually.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:37 -05:00
Yang Wang
f38765de83 drm/amdgpu: add umc v12.0 ACA support
add umc v12.0 ACA driver support

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:36 -05:00
YiPeng Chai
99cab331a4 drm/amdgpu: Add umc page retirement for umc v12_0
Add umc page retirement for umc v12_0.

V2:
  1. Changed umc page retirement check condition
     to call umc_v12_0_is_uncorrectable_error.
  2. Use memset to clear the contents of the umc
     error address structure.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-19 14:59:03 -05:00
YiPeng Chai
a8c77a121c drm/amdgpu: Add poison mode check error condition for umc v12_0
Add poison mode check error condition for umc v12_0.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-19 14:59:03 -05:00
YiPeng Chai
9f91e983ee drm/amdgpu: MCA supports recording umc address information
MCA supports recording umc address information.

V2:
  Move err_addr variable from struct ras_err_node to
struct ras_err_info.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-19 14:59:03 -05:00
Yang Wang
bf13da6ae1 drm/amdgpu: correct smu v13.0.6 umc ras error check
correct smu v13.0.0 umc ras error check

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09 17:01:20 -05:00
Candice Li
e020d01575 drm/amdgpu: Drop deferred error in uncorrectable error check
Drop checking deferred error which can be handled by poison
consumption.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31 16:40:15 -04:00
Candice Li
d59fcfb084 drm/amdgpu: Identify data parity error corrected in replay mode
Use ErrorCodeExt field to identify data parity error in replay mode.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-27 14:15:03 -04:00
Candice Li
afcf949cf3 drm/amdgpu: Log UE corrected by replay as correctable error
Support replay mode where UE could be converted to CE.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:26 -04:00
Yang Wang
3bba4bc6a0 drm/amdgpu: add RAS error info support for umc_v12_0
add RAS error info support for umc_v12_0.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-13 11:36:11 -04:00
Tao Zhou
f8754f58d6 drm/amdgpu: print channel index for UMC bad page
Print channel index for UMC v12.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:25:17 -04:00
Tao Zhou
ced575203a drm/amdgpu: print more address info of UMC bad page
Print out row, column and bank value of UMC error address for UMC v12.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:15:15 -04:00
Tao Zhou
3cb9ebc9d6 drm/amdgpu: add channel index table for UMC v12
Get UMC phyical channel index according to node id, umc instance and
channel instance.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:10:58 -04:00
Tao Zhou
40a08fe890 drm/amdgpu: add address conversion for UMC v12
Convert MCA error address to physical address and find out all pages in
one physical row.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:10:35 -04:00
Candice Li
7e6ec09974 drm/amdgpu: Add umc v12_0 ras functions
Add umc v12_0 ras error querying.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-06 14:38:00 -04:00