reset cause is requested by customer as additional
info for gpu reset smi event.
v2: integerate reset sources suggested by Lijo Lazar
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
flush_gpu_tlb may be called from another thread while
device_gpu_recover is running.
Both of these threads access registers through the VF
RLCG interface during VF Full Access. Add a lock around this interface
to prevent race conditions between these threads.
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use current speed/width on devices which don't support
dynamic PCIe switching.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3289
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
add prints before and after ip state is dumped.
It avoids user to think of system being
stuck/hung as dump could take some time after a
gpu hang.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
the inst passed to amdgpu_virt_rlcg_reg_rw should be physical instance.
Fix the miss matched code.
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The special case for VM passthrough doesn't check adev->nbio.funcs
before dereferencing it. If GPUs that don't have an NBIO block are
passed through, this leads to a NULL pointer dereference on startup.
Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de>
Fixes: 1bece222ea ("drm/amdgpu: Clear doorbell interrupt status for Sienna Cichlid")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
check the pointer hive before use.
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Validate tbo resource pointer, skip if NULL
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add gfx v9_4_4 ip block support
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This is part of the reset, move it into the reset function.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call
amdgpu_amdkfd_pre/post_reset different number of times and causes
kfd_locked count to be wrong. This will block all future attempts at
opening /dev/kfd. The retry loop also leakes resources by calling
amdgpu_virt_init_data_exchange multiple times without calling the
corresponding fini function.
Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
reset loop and calling amdgpu_device_pre_asic_reset each retry which
properly free the resources from previous try by calling
amdgpu_virt_fini_data_exchange.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.
Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.
Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.
Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Do not dump the ip registers during driver reload
in passthrough environment.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As we use wb slots more dynamically, we need to lock
access to avoid racing on allocation or free.
Reviewed-by: Shaoyun.liu <shaoyunl@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Invoke the dump_ip_state function for each ip before
the asic resets and save the register values for
debugging via devcoredump.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In passthrough environment, the driver triggers the mode-1 reset on
reload. The reset causes the core dump collection which is delayed task
and prevents driver from unloading until it is completed. Since we do
not need to collect data on "reset on reload" case, we can skip core
dump collection.
v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.
Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
refactor the code of runtime pm mode detection to support
amdgpu_runtime_pm =2 and 1 two cases
Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. change AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT from 30 to 5.
2. set fatel error detected flag.
Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add support for MACO flag checking.
MACO mode only works if BACO is supported.
Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
fix the high voltage issue after unload on smu 13.0.10
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Apply this rule to all newer asics in sriov case.
For asic with VF MMIO access protection avoid using CPU for VM table updates.
CPU pagetable updates have issues with HDP flush as VF MMIO access protection
blocks write to BIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL register
during sriov runtime.
Moved the check to amdgpu_device_init() to ensure it is done after
amdgpu_device_ip_early_init() where the IP versions are discovered.
Signed-off-by: Danijel Slivka <danijel.slivka@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Refractor devcoredump code into new files since its
functionality is expanded further and better to slit
and devcoredump to have its own file.
v2: Fix the build failure caught by arm compiler
of implicit function declaration with #ifdef
v3: squash in fix for implicit declaration error
Cc: Ivan Lipski <ivan.lipski@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
if reading pf2vf data failed 30 times continuously, it means something is
wrong. Need to trigger flr_work to recover the issue.
also use dev_err to print the error message to get which device has
issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL
timeout.
Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Acked-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Check if the device is present in the bus before trying to recover. It
could be that device itself is lost from the bus in some hang
situations.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This patch causes the following iounmap erorr and calltrace
iounmap: bad address 00000000d0b3631f
The original patch was unjustified because amdgpu_device_fini_sw() will
always cleanup the rmmio mapping.
This reverts commit eb4f139888.
Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Using the ring_muxer without preemption adds overhead for no
reason since mcbp cannot be triggered.
Moving back to a single queue in this case also helps when
high priority app are used: in this case the gpu_scheduler
priority handling will work as expected - much better than
ring_muxer with its 2 independant schedulers competing for
the same hardware queue.
This change requires moving amdgpu_device_set_mcbp above
amdgpu_device_ip_early_init because we use adev->gfx.mcbp.
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This ensures that the memory mapped by ioremap for adev->rmmio, is
properly handled in amdgpu_device_init(). If the function exits early
due to an error, the memory is unmapped. If the function completes
successfully, the memory remains mapped.
Reported by smatch:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4337 amdgpu_device_init() warn: 'adev->rmmio' from ioremap() not released on lines: 4035,4045,4051,4058,4068,4337
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For a RAS error that needs a full reset to recover, set the fatal error
status. Clear the status once the device is reset.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
commit ab4750332d ("drm/amdgpu/sdma5.2: add begin/end_use ring
callbacks") caused GFXOFF control to be used more heavily and the
codepath that was removed from commit 0dee726395 ("drm/amd: flush any
delayed gfxoff on suspend entry") now can be exercised at suspend again.
Users report that by using GNOME to suspend the lockscreen trigger will
cause SDMA traffic and the system can deadlock.
This reverts commit 0dee726395.
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Fixes: ab4750332d ("drm/amdgpu/sdma5.2: add begin/end_use ring callbacks")
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
commit 5095d54181 ("drm/amd: Evict resources during PM ops prepare()
callback") intentionally moved the eviction of resources to earlier in
the suspend process, but this introduced a subtle change that it occurs
before adev->in_s0ix or adev->in_s3 are set. This meant that APUs
actually started to evict resources at suspend time as well.
Explicitly set s0ix or s3 in the prepare() stage, and unset them if the
prepare() stage failed.
v2: squash in warning fix from Stephen Rothwell
Reported-by: Jürg Billeter <j@bitron.ch>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3132#note_2271038
Fixes: 5095d54181 ("drm/amd: Evict resources during PM ops prepare() callback")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Need to resume ras during gpu reset for
gfx v9_4_3 sriov
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix the warning info below during mode1 reset.
[ +0.000004] Call Trace:
[ +0.000004] <TASK>
[ +0.000006] ? show_regs+0x6e/0x80
[ +0.000011] ? __flush_work.isra.0+0x2e8/0x390
[ +0.000005] ? __warn+0x91/0x150
[ +0.000009] ? __flush_work.isra.0+0x2e8/0x390
[ +0.000006] ? report_bug+0x19d/0x1b0
[ +0.000013] ? handle_bug+0x46/0x80
[ +0.000012] ? exc_invalid_op+0x1d/0x80
[ +0.000011] ? asm_exc_invalid_op+0x1f/0x30
[ +0.000014] ? __flush_work.isra.0+0x2e8/0x390
[ +0.000007] ? __flush_work.isra.0+0x208/0x390
[ +0.000007] ? _prb_read_valid+0x216/0x290
[ +0.000008] __cancel_work_timer+0x11d/0x1a0
[ +0.000007] ? try_to_grab_pending+0xe8/0x190
[ +0.000012] cancel_work_sync+0x14/0x20
[ +0.000008] amddrm_sched_stop+0x3c/0x1d0 [amd_sched]
[ +0.000032] amdgpu_device_gpu_recover+0x29a/0xe90 [amdgpu]
This warning info was printed after applying the patch
"drm/sched: Convert drm scheduler to use a work queue rather than kthread".
The root cause is that amdgpu driver tries to use the uninitialized
work_struct in the struct drm_gpu_scheduler
v2:
- Rename the function to amdgpu_ring_sched_ready and move it to
amdgpu_ring.c (Alex)
v3:
- Fix a few more checks based on Vitaly's patch (Alex)
v4:
- squash in fix noticed by Bert in
https://gitlab.freedesktop.org/drm/amd/-/issues/3139
Fixes: 11b3b9f461 ("drm/sched: Check scheduler ready before calling timeout handling")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- move aca init/fini function into ras init/fini to adapt gpu reset
sequence.
- add new function amdgpu_aca_reset()
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 5f38ac54e6.
This causes issues with rebooting and the 7800XT.
Cc: Kenneth Feng <kenneth.feng@amd.com>
Cc: stable@vger.kernel.org
Fixes: 5f38ac54e6 ("drm/amd/pm: fix the high voltage and temperature issue")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3062
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Calling amdgpu_device_ip_resume_phase1() during shutdown leaves the
HW in an active state and is an unbalanced use of the IP callbacks.
Using the IP callbacks like this can lead to memory leaks, double
free and imbalanced reference counters.
Leaving the HW in an active state can lead to DMA accesses to memory now
freed by the driver.
Both is a complete no-go for driver unload so completely revert the
workaround for now.
This reverts commit f5c7e77970.
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Some customer platforms do not enable mmconfig for various reasons,
such as bios bug, and therefore cannot access the GPU extend configuration
space through mmio.
When the system enters the d3cold state and resumes, the amdgpu driver
fails to resume because the extend configuration space registers of
GPU can't be restored. At this point, Usually we only see some failure
dmesg log printed by amdgpu driver, it is difficult to find the root
cause.
Therefor print a warnning message if the system can't access the
extended configuration space register when using large bar.
Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1:
implement new RAS ACA driver code framework.
v2:
- rename aca_bank_set to aca_banks.
- rename aca_source_xxx to aca_handle_xxx.
v3:
Optimize some function implementation details. (from Hawking's suggestion)
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To allow using this helper for indirect access when
nbio funcs is not available. For instance, in ip
discovery phase.
v2: define macro for pcie_index/data/index_hi fallback.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Will replace it with new implementation to cover
boot fails in ip discovery phase.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In function 'amdgpu_device_need_post(struct amdgpu_device *adev)' -
'adev->pm.fw' may not be released before return.
Using the function release_firmware() to release adev->pm.fw.
Thus fixing the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1571 amdgpu_device_need_post() warn: 'adev->pm.fw' from request_firmware() not released on lines: 1554.
Cc: Monk Liu <Monk.Liu@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Replace direct usage of adev->ip_versions with amdgpu_ip_version.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Part of commit c035819862 ("drm/amdgpu: fix buffer funcs setting order on suspend")
got dropped accidently. Add it back.
Fixes: c035819862 ("drm/amdgpu: fix buffer funcs setting order on suspend")
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmV2PMQeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhgQIAKSdjtpkv+IeDwph
umLHxDlCTT8nl+TAhEYuxSdMiHt1i7+zyF0snnHhGthBRnBZBf+PDdkQOmqtjLZj
e591VGY7tdNCAVUwmRRaI82JFz9Pl1k8DXL3f+CE1+MfbpsirAecoIAL0rsvg+4f
ICKSAOBZonL7GT7IGrsbxgqGKCLC+2PcgNT4pADBdjtbC1nmVSLK21dtWr6i4xsn
CT9MxAnej1+EOpreyHhNZUVqfQy0/04x4OY0u/EOXU3PgpfGTrAzkyfwzcd/eit0
x59TZ0owDVXfvvdXUUAC0zG1th2ek5LKNtL+C4rKOoAbq7GZoQfjN5M/Ug+yykwE
aqm6CJ8=
=tZ+R
-----END PGP SIGNATURE-----
Backmerge tag 'v6.7-rc5' into drm-next
Linux 6.7-rc5
Alex requested this for some amdkfd work relying on the symbols exports.
Signed-off-by: Dave Airlie <airlied@redhat.com>
We need to disable this after the last eviction
call, but before we disable the SDMA IP.
Fixes: b70438004a ("drm/amdgpu: move buffer funcs setting up a level")
Link: https://lore.kernel.org/r/87edgv4x3i.fsf@vps.thesusis.net
Reviewed-by: Luben Tuikov <ltuikov89@gmail.com>
Tested-by: Phillip Susi <phill@thesusis.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Phillip Susi <phill@thesusis.net>
Cc: Luben Tuikov <ltuikov89@gmail.com>
Disable MCBP(mid command buffer preemption) by default as old Mesa
hangs with it. We shall not enable the feature that breaks old usermode
driver.
Fixes: 50a7c8765c ("drm/amdgpu: enable mcbp by default on gfx9")
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org