CP supports unmap queue with reset mode which only destroys specific queue without affecting others.
Replacing whole gpu reset with reset queue mode for RAS poison consumption
saves much time, and we can also fallback to gpu reset solution if reset
queue fails.
v2: Return directly if process is NULL;
Reset queue solution is not applicable to SDMA, fallback to legacy
way;
Call kfd_unref_process after lookup process.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The new interface unmaps queues with reset mode for the process consumes
RAS poison, it's only for compute queue.
v2: rename the function to reset_queues.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
So we can set reset mode for unmap operation, no functional change.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add a reset parameter for umc page retirement, let user decide whether
call gpu reset in umc page retirement.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Expand RLCG interface for new GC read & write commands.
New interface will only be used if the PF enables the flag in pf2vf msg.
v2: Added a description for the scratch registers
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: David Nieto <david.nieto@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Driver needs to call get_xgmi_info() before ip_init
to determine whether it needs to handle a pending hive reset.
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: David Nieto <david.nieto@amd.com>
Reviewed by: shaoyun.liu <Shaoyun.lui@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Modify GC register access from MMIO to RLCG if the indirect
flag is set
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: David Nieto <david.nieto@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Modify GC register access from MMIO to RLCG if the
indirect flag is set
v2: Replaced ternary operator with if-else for better
readability
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: David Nieto <david.nieto@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add helper macros to change register access
from direct to indirect.
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: David Nieto <david.nieto@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Recently, there is security policy update under SRIOV.
We need to filter the registers that hit the violation
and move the code to the host driver side so that
the guest driver can execute correctly.
Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Chips with no display hardware should return false for
DC support.
v2: drop Arcturus and Aldebaran
Fixes: f7f12b2582 ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reported-by: Tareque Md.Hanif <tarequemd.hanif@yahoo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
By setting mp1_state as PP_MP1_STATE_UNLOAD, MP1 will do some proper cleanups and
put itself into a state ready for PNP. That can workaround some random resuming
failure observed on BOCO capable platforms.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If the platform suspend happens to fail and the power rail
is not turned off, the GPU will be in an unknown state on
resume, so reset the asic so that it will be in a known
good state on resume even if the platform suspend failed.
v2: handle s0ix
Acked-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In the s0ix entry need retain gfx in the gfxoff state,so here need't
set gfx cgpg in the S0ix suspend-resume process. Moreover move the S0ix
check into SMU12 can simplify the code condition check.
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1712
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It's not only supported by HG/PX laptops. It's supported
by all dGPUs which supports BOCO/BACO functionality (runtime
D3).
BOCO - Bus Off, Chip Off. The entire chip is powered off.
This is controlled by ACPI.
BACO - Bus Active, Chip Off. The chip still shows up
on the PCI bus, but the device itself is powered
down.
v2: fix missed HG/PX reference
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
AMD systems currently lay out MCA bank types such that the type of bank
number "i" is either the same across all CPUs or is Reserved/Read-as-Zero.
For example:
Bank # | CPUx | CPUy
0 LS LS
1 RAZ UMC
2 CS CS
3 SMU RAZ
Future AMD systems will lay out MCA bank types such that the type of
bank number "i" may be different across CPUs.
For example:
Bank # | CPUx | CPUy
0 LS LS
1 RAZ UMC
2 CS NBIO
3 SMU RAZ
Change the structures that cache MCA bank types to be per-CPU and update
smca_get_bank_type() to handle this change.
Move some SMCA-specific structures to amd.c from mce.h, since they no
longer need to be global.
Break out the "count" for bank types from struct smca_hwid, since this
should provide a per-CPU count rather than a system-wide count.
Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The
values in this array should not change at runtime.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-3-yazen.ghannam@amd.com
Play a video on the raven (or PCO, raven2) platform, and then do the S3
test. When resume, the following error will be reported:
amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring
vcn_dec test failed (-110)
[drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block
<vcn_v1_0> failed -110
amdgpu 0000:02:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -110
[why]
When playing the video: The power state flag of the vcn block is set to
POWER_STATE_ON.
When doing suspend: There is no change to the power state flag of the
vcn block, it is still POWER_STATE_ON.
When doing resume: Need to open the power gate of the vcn block and set
the power state flag of the VCN block to POWER_STATE_ON.
But at this time, the power state flag of the vcn block is already
POWER_STATE_ON. The power status flag check in the "8f2cdef drm/amd/pm:
avoid duplicate powergate/ungate setting" patch will return the
amdgpu_dpm_set_powergating_by_smu function directly.
As a result, the gate of the power was not opened, causing the
subsequent ring test to fail.
[how]
In the suspend function of the vcn block, explicitly change the power
state flag of the vcn block to POWER_STATE_OFF.
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828
Signed-off-by: chen gong <curry.gong@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Fix the message argument.
0: Allow power down
1: Disallow power down
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Used on gfx9 based systems. Fixes incorrect CU counts reported
in the kernel log.
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1833
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Some old registers leftover from pre-silicon. No longer
relevant on real hardware. Remove.
Reviewed-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We want to be able to call virt data exchange conditionally
after gmc sw init to reserve bad pages as early as possible.
Since this is a conditional call, we will need
to call it again unconditionally later in the init sequence.
Refactor the data exchange function so it can be
called multiple times without re-initializing the work item.
v2: Cleaned up the code. Kept the original call to init_exchange_data()
inside early init to initialize the work item, afterwards call
exchange_data() when needed.
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed By: Shaoyun.liu <Shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use max() and min() in order to make code cleaner.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Play a video on the raven (or PCO, raven2) platform, and then do the S3
test. When resume, the following error will be reported:
amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring
vcn_dec test failed (-110)
[drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block
<vcn_v1_0> failed -110
amdgpu 0000:02:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -110
[why]
When playing the video: The power state flag of the vcn block is set to
POWER_STATE_ON.
When doing suspend: There is no change to the power state flag of the
vcn block, it is still POWER_STATE_ON.
When doing resume: Need to open the power gate of the vcn block and set
the power state flag of the VCN block to POWER_STATE_ON.
But at this time, the power state flag of the vcn block is already
POWER_STATE_ON. The power status flag check in the "8f2cdef drm/amd/pm:
avoid duplicate powergate/ungate setting" patch will return the
amdgpu_dpm_set_powergating_by_smu function directly.
As a result, the gate of the power was not opened, causing the
subsequent ring test to fail.
[how]
In the suspend function of the vcn block, explicitly change the power
state flag of the vcn block to POWER_STATE_OFF.
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828
Signed-off-by: chen gong <curry.gong@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix the message argument.
0: Allow power down
1: Disallow power down
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This is still needed for thoes in case the firmware fails to load
then the message is the only way to tell what firmware was on them
Suggested-by: Lijo Lazar <lijo.Lazar@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add svm_range_bo_unref_async to schedule work to wait for svm_bo
eviction work done and then free svm_bo. __do_munmap put_page
is atomic context, call svm_range_bo_unref_async to avoid warning
invalid wait context. Other non atomic context call svm_range_bo_unref.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In the s0ix entry need retain gfx in the gfxoff state,so here need't
set gfx cgpg in the S0ix suspend-resume process. Moreover move the S0ix
check into SMU12 can simplify the code condition check.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Since this variable was made available by the previous commit, use
it to make function access cleaner.
Suggested-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Backmerging for v5.16-rc5. Resolves a conflict between drm-misc-next
and drm-misc-fixes in the vc4 driver.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Memory is allocated for gpu_metrics_table in renoir_init_smc_tables(),
but not freed in int smu_v12_0_fini_smc_tables(). Free it!
Fixes: 95868b8576 ("drm/amd/powerplay: add Renoir support for gpu metrics export")
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pair the operations did in GMC ->hw_init and ->hw_fini. That
can help to maintain correct cached state for GMC and avoid
unintention gate operation dropping due to wrong cached state.
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
If the firmware wasn't reset by PSP or HW and is currently running
then the firmware will hang or perform underfined behavior when we
modify its firmware state underneath it.
[How]
Reset DMCUB before setting up cache windows and performing HW init.
Reviewed-by: Aurabindo Jayamohanan Pillai <Aurabindo.Pillai@amd.com>
Acked-by: Pavle Kotarac <Pavle.Kotarac@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
SMU now respects the PHY refclk disable request from driver.
This causes a hang during hotplug when PHY refclk was disabled
because it's not being re-enabled and the transmitter control
starts on dc_link_detect.
[How]
We normally would re-enable the clk with exit_optimized_pwr_state
but this is only set on DCN21 and DCN301. Set it for dcn31 as well.
This fixes DMCUB timeouts in the PHY.
Fixes: 64b1d0e8d5 ("drm/amd/display: Add DCN3.1 HWSEQ")
Reviewed-by: Eric Yang <Eric.Yang2@amd.com>
Acked-by: Pavle Kotarac <Pavle.Kotarac@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This value does not get cached into adev->pm.fw_version during
startup for smu13 like it does for other SMU like smu12.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Leave this bit as hardware default setting
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
should count on GC IP base address
Signed-off-by: Le Ma <le.ma@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Those are not today pulled by the sphinx doc, but better be ready.
Signed-off-by: Yann Dirson <ydirson@free.fr>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Memory is allocated for gpu_metrics_table in renoir_init_smc_tables(),
but not freed in int smu_v12_0_fini_smc_tables(). Free it!
Fixes: 95868b8576 ("drm/amd/powerplay: add Renoir support for gpu metrics export")
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
gmc bo will be pinned during loading amdgpu and reset in SRIOV while
only unpinned in unload amdgpu
[How]
add amdgpu_in_reset and sriov judgement to skip pin bo
v2: fix wrong judgement
Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: Horace Chen <horace.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
psp tmr bo will be pinned during loading amdgpu and reset in SRIOV while
only unpinned in unload amdgpu
[How]
add amdgpu_in_reset and sriov judgement to skip pin bo
v2: fix wrong judgement
Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: Horace Chen <horace.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Power states are not valid for arcturus and aldebaran, no need to
allocate memory.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pair the operations did in GMC ->hw_init and ->hw_fini. That
can help to maintain correct cached state for GMC and avoid
unintention gate operation dropping due to wrong cached state.
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Updated for consistency when accessing drm_device from amdgpu driver.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>