1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

348 commits

Author SHA1 Message Date
Hawking Zhang
9f9d4651f7 drm/amdgpu: fallback to old RAS error message for aqua_vanjaram
So driver doesn't generate incorrect message until
the new format is settled down for aqua_vanjaram

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:15:02 -04:00
Hawking Zhang
bf7aa8bea9 drm/amdgpu: Free ras cmd input buffer properly
Do not access the pointer for ras input cmd buffer
if it is even not allocated.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-30 15:51:16 -04:00
Hawking Zhang
ec70578c83 drm/amdgpu: Allow issue disable gfx ras cmd to firmware
Disable gfx ras command is needed in some use cases

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-30 15:25:54 -04:00
YiPeng Chai
80578f1641 drm/amdgpu: Enable ras for mp0 v13_0_6 sriov
Enable ras for mp0 v13_0_6 sriov

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-30 14:58:16 -04:00
YiPeng Chai
1b98a5f8e0 drm/amdgpu: mode1 reset needs to recover mp1 for mp0 v13_0_10
Mode1 reset needs to recover mp1 in fatal error case
for mp0 v13_0_10.

v2:
  Define a macro to wrap psp function calls.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-15 18:07:41 -04:00
Hawking Zhang
8b3a7a707c drm/amdgpu: Remove unnecessary ras cap check
RAS global isr will only be invoked by hardware
interrupt. Don't need to query ras capability in isr
In addition, amdgpu_ras_interrupt_fatal_error_handler
ensures the isr won't be called from guest linux
side by accident. The RAS cap check in isr that
introduced to fix sriov crash is not needed any more

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-15 18:07:41 -04:00
Candice Li
bc0f80802d drm/amdgpu: Extend poison mode check to SDMA/VCN/JPEG
Treat SDMA/VCN/JPEG as RAS capable IP blocks in poison mode.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-09 09:46:05 -04:00
Tao Zhou
7692e1ee24 drm/amdgpu: add RAS fatal error handler for NBIO v7.9
Register RAS fatal error interrupt and add handler.

v2: only register NBIO RAS for dGPU platform.
    change nbio_v7_9_set_ras_controller_irq_state and nbio_v7_9_set_ras_err_event_athub_irq_state
    to dummy functions.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-09 09:46:04 -04:00
Hawking Zhang
6fc9d92c3d drm/amdgpu: Issue ras enable_feature for gfx ip only
For non-GFX IP blocks, set up ras obj if ras feature
is allowed. For GFX IP blocks, force issue ras
enable_feature command to firmware and only set up ras
obj if ras feature is allowed

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07 17:12:49 -04:00
Hawking Zhang
62c4b772bd drm/amdgpu: Apply poison mode check to GFX IP only
For GFX IP that only supports poison consumption, GFX
RAS won't be marked as enabled. i.e., hardware doesn't
support gfx sram ecc. But driver still needs to issue
firmware to enable poison consumption mode for GFX IP.
In such case, check poison mode and treat GFX IP as
RAS capable IP block.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07 17:12:49 -04:00
Hawking Zhang
f957138cc3 drm/amdgpu: Only create err_count sysfs when hw_op is supported
Some IP blocks only support partial ras feature and don't
have ras counter and/or ras error status register at all.
Driver should not create err_count sysfs node for those
IP blocks.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07 17:12:49 -04:00
Stanley.Yang
fcb7a1849a drm/amdgpu: Check APU flag to disable RAS
Only disable RAS by default for aqua vanjaram on APU platform.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-25 13:47:26 -04:00
Mario Limonciello
adf64e2142 drm/amd: Avoid reading the VBIOS part number twice
The VBIOS part number is read both in amdgpu_atom_parse() as well
as in atom_get_vbios_pn() and stored twice in the `struct atom_context`
structure. Remove the first unnecessary read and move the `pr_info`
line from that read into the second.

v2: squash in unused variable removal

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-21 16:52:24 -04:00
Stanley.Yang
276f6e8cb7 drm/amdgpu: Disable RAS by default on APU flatform
Disable RAS feature by default for aqua vanjaram on APU platform.

Changed from V1:
	Splite Disable RAS by default on APU platform into a
	separated patch.

Changed from V2:
	Avoid to modify global variable amdgpu_ras_enable.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18 11:11:36 -04:00
Stanley.Yang
cb906ce32b drm/amdgpu: Enable aqua vanjaram RAS
Enable RAS for aqua vanjaram.

Changed from V1:
	Split the change in amdgpu_ras_asic_supported into a
	separated patch.

Changed from V2:
	Avoid to modify global variable amdgpu_ras_enable.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18 11:11:23 -04:00
Tao Zhou
a80fe1a698 drm/amdgpu: skip address adjustment for GFX RAS injection
The address parameter of GFX RAS injection isn't related to XGMI node
number, keep it unchanged.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-07 13:51:47 -04:00
YiPeng Chai
2c7cd280e5 drm/amdgpu: gpu recovers from fatal error in poison mode
Fatal error occurs in ras poison mode, mode1 reset
is used to recover gpu.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-30 13:12:15 -04:00
Stanley.Yang
43aedbf4da drm/amdgpu: Add checking mc_vram_size
Do not compare injection address with mc_vram_size
if mc_vram_size is zero.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:59 -04:00
Stanley.Yang
38298ce6fc drm/amdgpu: Optimize checking ras supported
Using "is_app_apu" to identify device in the native
APU mode or carveout mode.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:59 -04:00
Luben Tuikov
71344a718a drm/amdgpu: Fix usage of UMC fill record in RAS
The fixed commit listed in the Fixes tag below, introduced a bug in
amdgpu_ras.c::amdgpu_reserve_page_direct(), in that when introducing the new
amdgpu_umc_fill_error_record() and internally in that new function the physical
address (argument "uint64_t retired_page"--wrong name) is right-shifted by
AMDGPU_GPU_PAGE_SHIFT. Thus, in amdgpu_reserve_page_direct() when we pass
"address" to that new function, we should NOT right-shift it, since this
results, erroneously, in the page address to be 0 for first
2^(2*AMDGPU_GPU_PAGE_SHIFT) memory addresses.

This commit fixes this bug.

Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Fixes: 400013b268 ("drm/amdgpu: add umc_fill_error_record to make code more simple")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://lore.kernel.org/r/20230610113536.10621-1-luben.tuikov@amd.com
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:58 -04:00
Luben Tuikov
740f42a28f drm/amdgpu: Report ras_num_recs in debugfs
Report the number of records stored in the RAS EEPROM table in debugfs.

This can be used by user-space to calculate the capacity of the RAS EEPROM
table since "bad_page_cnt_threshold" is also reported in the same place in
debugfs.

See commit 7fb6407145 ("drm/amdgpu: Add bad_page_cnt_threshold to debugfs").

ras_num_recs can already be inferred by dumping the RAS EEPROM table, also in
the same debugfs location, see commit reference c65b0805e7 (drm/amdgpu:
RAS EEPROM table is now in debugfs, 2021-04-08). This commit makes it an
integer value easily shown in a single file.

Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Stanley Yang <Stanley.Yang@amd.com>
Cc: John Clements <john.clements@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://lore.kernel.org/r/20230603051043.211548-1-luben.tuikov@amd.com
Acked-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:58 -04:00
Stanley.Yang
7f599fed3b drm/amdgpu: Add support EEPROM table v2.1
Add ras info to EEPROM table, app can analyse device ECC
status without GPU driver through EEPROM table ras info.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:44:34 -04:00
Stanley.Yang
e3959cb547 drm/amdgpu: support check vcn jpeg block mask
Support VCN/JPEG instance mask checking, pass logical
mask directly except GFX/SDMA/VCN/JPEG blocks.

Changed from V1:
	correct a typo

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:39:54 -04:00
YiPeng Chai
6c47a79b3b drm/amdgpu: perform mode2 reset for sdma fed error on gfx v11_0_3
perform mode2 reset for sdma fed error on gfx v11_0_3.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:38:19 -04:00
Tao Zhou
f464c5dd4d drm/amdgpu: add check for RAS instance mask
The mask is only needed to be set when RAS block instance number is
more than 1 and invalid bits should be also masked out.
We only check valid bits for GFX and SDMA block for now, and will
add check for other RAS blocks in the future.

v2: move the check under injection operation since the mask is only
    used by RAS error inject.
v3: add valid bits handling for SDMA.
v4: print message if the mask is adjusted.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:37:18 -04:00
Tao Zhou
27c5f29526 drm/amdgpu: reorganize RAS injection flow
So GFX RAS injection could use default function if it doesn't define its
own injection interface.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:37:13 -04:00
Tao Zhou
2c22ed0bdb drm/amdgpu: add instance mask for RAS inject
User can specify injected instances by the mask. For backward
compatibility, the mask value is incorporated into sub block index
without interface change of RAS TA.
User uses logical mask and driver should convert it to physical value
before sending it to RAS TA.

v2: update parameter name.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:37:10 -04:00
Hawking Zhang
9b337b7d62 drm/amdgpu: Adjust the sequence to query ras error info
It turns out STATUS_VALID_FLAG needs to be checked
ahead of any other fields. ADDRESS_VALID_FLAG and
ERR_INFO_VALID_FLAG only manages ADDRESS and ERR_INFO
field respectively. driver should continue poll
ERR CNT field even ERR_INFO_VALD_FLAG is not set.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:58:32 -04:00
Hawking Zhang
8107e4996f drm/amdgpu: Enable persistent edc harvesting in APP APU
Persistent edc harvesting is supported in APP APU

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:53:45 -04:00
Hawking Zhang
e53a3250f7 drm/amdgpu: Add common helper to reset ras error
Add common helper to reset ras error status. It
applies to IP blocks that follow the new ras error
logging register design, and need to write 0 to
reset the error status. For IP blocks that don't
support the new design, please still implement ip
specific helper.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:52:54 -04:00
Hawking Zhang
322a7e005d drm/amdgpu: Add common helper to query ras error (v2)
Add common helper to query ras error status and
log error information, including memory block id
and erorr count. The helpers are applicable to IP
blocks that follow the new ras error logging design.
For IP blocks that don't support the new design,
please still implement ip specific helper to query
ras error.

v2: optimize struct amdgpu_ras_err_status_reg_entry
and the implementaion in helper (Lijo/Tao)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:52:50 -04:00
Candice Li
8eba72053c drm/amdgpu: Drop pcie_bif ras check from fatal error handler
Some ASICs support fatal error event but do not
support pcie_bif ras.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-21 08:49:37 -04:00
Jane Jian
b805d8d785 Revert "drm/amdgpu: enable ras for mp0 v13_0_10 on SRIOV"
This reverts commit fe120b9f5c.
This patch impacts sriov multi-vf stability

Signed-off-by: Jane Jian <Jane.Jian@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-14 13:47:49 -04:00
Stanley.Yang
58bc2a9cbf drm/amdgpu: correct ras enabled flag
XGMI RAS should be according to the gmc xgmi physical nodes number,
XGMI RAS should not be enabled if xgmi num_physical_nodes is zero.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-11 18:03:45 -04:00
Hawking Zhang
9af357bc3e drm/amdgpu: Add fatal error handling in nbio v4_3
GPU will stop working once fatal error is detected.
it will inform driver to do reset to recover from
the fatal error.

v2: squash in logic fix (Srinivasan)
v3: squash in logic fix (Dan)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-31 11:18:32 -04:00
YiPeng Chai
fe120b9f5c drm/amdgpu: enable ras for mp0 v13_0_10 on SRIOV
Enable ras for mp0 v13_0_10 on SRIOV.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-22 00:58:37 -04:00
Hawking Zhang
370808876b drm/amdgpu: drop ras check at asic level for new blocks
amdgpu_ras_register_ras_block should always be invoked
by ras_sw_init, where driver needs to check ras caps
at ip level, instead of asic level.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
fdc94d3a8c drm/amdgpu: Rework pcie_bif ras sw_init
pcie_bif ras blocks needs to be initialized as early
as possible to handle fatal error detected in hw_init
phase. also align the pcie_bif ras sw_init with other
ras blocks

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Tao Zhou
f3cbe70e21 drm/amdgpu: change default behavior of bad_page_threshold parameter
Ignore ras umc bad page threshold by default, GPU initialization won't
be stopped in this mode.

v2: refine the description of bad_page_threshold.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-23 17:35:59 -05:00
Tao Zhou
4d33e0f134 drm/amdgpu: exclude duplicate pages from UMC RAS UE count
If a UMC bad page is reserved but not freed by an application, the
application may trigger uncorrectable error repeatly by accessing the page.

v2: add specific function to do the check.
v3: remove duplicate pages, calculate new added bad page number.
v4: reuse save_bad_pages to calculate new added bad page number.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-23 17:35:59 -05:00
YiPeng Chai
8f453c51cf drm/amdgpu: Adjust ras support check condition for special asic
[Why]:
     Amdgpu ras uses amdgpu_ras_is_supported to check whether
  the ras block supports the ras function. amdgpu_ras_is_supported
  uses .ras_enabled to determine whether the ras function of the
  block is enabled.
     But for special asic with mem ecc enabled but sram ecc not
  enabled, some ras blocks support poison mode but their ras function
  is not enabled on .ras_enabled, these ras blocks will run abnormally.

[How]:
    If the ras block is not supported on .ras_enabled but the asic
  supports poison mode and the ras block has ras configuration, it
  can be considered that the ras block supports ras function.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-17 16:11:51 -05:00
YiPeng Chai
8c305a3fdf drm/amdgpu: Remove unnecessary ras block support check
[Why]:
   For special asic with mem ecc enabled but sram ecc
not enabled, some ras blocks can register their ras
configuration to ras list, but these ras blocks are not
enabled on .ras_enabled, so it can not get ras block
object using amdgpu_ras_get_ras_block.

[How]:
   Remove ras block support check. Even if the ras block
checked is not in the ras list, it will return a null
pointer and will have no effect.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-17 16:11:51 -05:00
YiPeng Chai
ac7b25d92c drm/amdgpu: Perform gpu reset after gfx finishes processing ras poison consumption on gfx_v11_0_3
Perform gpu reset after gfx finishes processing
ras poison consumption on gfx_v11_0_3.

V2:
 Move gfx poison consumption handler from hw_ops to ip
 function level.

V3:
 Adjust the calling position of amdgpu_gfx_poison_consumation_handler.

V4:
   Since gfx v11_0_3 does not have .hw_ops instance, the .hw_ops null
 pointer check in amdgpu_ras_interrupt_poison_consumption_handler
 needs to be adjusted.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-17 16:11:51 -05:00
Hawking Zhang
4a1c9a444b drm/amdgpu: allow query error counters for specific IP block
amdgpu_ras_block_late_init will be invoked in IP
specific ras_late_init call as a common helper for
all the IP blocks.

However, when amdgpu_ras_block_late_init call
amdgpu_ras_query_error_count to query ras error
counters, amdgpu_ras_query_error_count queries
all the IP blocks that support ras query interface.

This results to wrong error counters cached in
software copies when there are ras errors detected
at time zero or warm reset procedure. i.e., in
sdma_ras_late_init phase, it counts on sdma/mmhub
errors, while, in mmhub_ras_late_init phase, it
still counts on sdma/mmhub errors.

The change updates amdgpu_ras_query_error_count
interface to allow query specific ip error counter.
It introduces a new input parameter: query_info. if
query_info is NULL,  it means query all the IP blocks,
otherwise, only query the ip block specified by
query_info.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-05 11:42:14 -05:00
Stanley.Yang
c26cd99918 drm/amdgpu: remove enable ras cmd call trace
[Why]
    [   41.285804] RIP: 0010:amdgpu_ras_feature_enable+0x15c/0x310 [amdgpu]
    [   41.285945] Code: 48 89 c1 48 c7 c2 b9 f2 88 c1 48 c7 c0 c0 f2 88 c1 49 8b 3c 24 48 0f 44 d0 48 c7 c6 98 33 80 c1 e8 5f 52 75 d9 e9 fa fe ff ff <0f> 0b e9 66 ff ff ff 48 8b 3d 86 8c 0f da ba 00 04 00 00 be c0 0d
    [   41.285946] RSP: 0018:ffffbccdc72efc90 EFLAGS: 00010246
    [   41.285948] RAX: 0000000000000004 RBX: ffff931897406980 RCX: 0000000000000002
    [   41.285949] RDX: 0000000000000dc0 RSI: 0000000000000002 RDI: ffff931500042b00
    [   41.285950] RBP: ffffbccdc72efcc0 R08: 0000000000000002 R09: ffff931885b87000
    [   41.285951] R10: 0000000000ffff10 R11: 0000000000000001 R12: ffff931893e20000
    [   41.285952] R13: 0000000000000001 R14: ffff931885b87000 R15: 0000000000000000
    [   41.285953] FS:  0000000000000000(0000) GS:ffff931c6f200000(0000) knlGS:0000000000000000
    [   41.285954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   41.285955] CR2: 000055dd6f532008 CR3: 000000061b010006 CR4: 00000000003706e0
    [   41.285956] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [   41.285957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [   41.285958] Call Trace:
    [   41.285959]  <TASK>
    [   41.285963]  ? gfx_v11_0_early_init+0x250/0x250 [amdgpu]
    [   41.286117]  gfx_v11_0_late_init+0x8c/0xb0 [amdgpu]
    [   41.286271]  amdgpu_device_ip_late_init+0x8d/0x3c0 [amdgpu]
    [   41.286401]  amdgpu_device_init.cold+0x1677/0x1fda [amdgpu]
    [   41.286616]  ? pci_bus_read_config_word+0x4a/0x70
    [   41.286621]  ? do_pci_enable_device+0xdb/0x110
    [   41.286625]  amdgpu_driver_load_kms+0x1a/0x160 [amdgpu]
    [   41.286762]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
    [   41.286898]  local_pci_probe+0x4b/0x90
    [   41.286901]  work_for_cpu_fn+0x1a/0x30
    [   41.286903]  process_one_work+0x22b/0x3d0
    [   41.286905]  worker_thread+0x223/0x420
    [   41.286907]  ? process_one_work+0x3d0/0x3d0
    [   41.286908]  kthread+0x12a/0x150
    [   41.286911]  ? set_kthread_struct+0x50/0x50
    [   41.286913]  ret_from_fork+0x22/0x30

[How]
    For specific asic, only mem ecc is enabled, sram ecc is not enabled,
    but it still need to send ras enable cmd to gfx block to support
    poison mode, so add check posion mode.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03 16:57:58 -05:00
Tao Zhou
2dd9032beb drm/amdgpu: define RAS query poison mode function
1. no need to query poison mode on SRIOV guest side, host can handle it.
2. define the function to simplify code.

v2: rename amdgpu_ras_poison_mode_query to amdgpu_ras_query_poison_mode.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-15 12:18:19 -05:00
Tao Zhou
3189501e6f drm/amdgpu: update VCN/JPEG RAS setting
Support VCN/JPEG RAS in both bare metal and SRIOV environment.

v2: update commit description.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-15 12:18:19 -05:00
Tao Zhou
248c9635b8 drm/amdgpu: skip RAS error injection in SRIOV
Injection on guest is not allowed.

v2: return directly in SRIOV environment.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-15 12:18:19 -05:00
ye xingchen
2cffcb6679 drm/amdgpu: use sysfs_emit() to instead of scnprintf()
Replace the open-code with sysfs_emit() to simplify the code.

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-01 15:21:34 -05:00
Tao Zhou
07615da1bf drm/amdgpu: enable RAS for VCN/JPEG v4.0
Set support flag for VCN/JPEG 4.0.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-17 18:07:58 -05:00