1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

33 commits

Author SHA1 Message Date
Mukul Joshi
cc009e613d drm/amdkfd: Add KFD support for soc21 v3
Add initial support for soc21 in KFD compute
driver (Mukul)
- Add new definition for soc21 device.
- Add new file for amdgpu-kfd interface for GFX11 family.
- Add new file for queue management, interrupt handling,
  mqd management for GFX11 family in KFD driver.
- Related changes/updates for soc21 device in
  KFD driver.
- Repurpose last 2 entries of SDMA MQD for driver use.

v2: Add an optional argument into update queue operation (Mukul)

v3: Switch to ip version check, replace kgd_dev with
    amdgpu_device (Hawking)

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:54 -04:00
Felix Kuehling
c3eb12dff0 drm/amdkfd: Ignore bogus signals from MEC efficiently
MEC firmware sometimes sends signal interrupts without a valid context ID
on end of pipe events that don't intend to signal any HSA signals.
This triggers the slow path in kfd_signal_event_interrupt that scans the
entire event page for signaled events. Detect these signals in the top
half interrupt handler to stop processing them as early as possible.

Because we now always treat event ID 0 as invalid, reserve that ID during
process initialization.

v2: Update firmware version checks to support more GPUs

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-25 17:05:48 -04:00
Tao Zhou
ed94aca6db drm/amdkfd: print unmap queue status for RAS poison consumption (v3)
Print the status out when it passes, and also tell user gpu reset
is triggered when we fall back to legacy way.

v2: make the message more explicit.
v3: change succeeds to succeeded.
    replace pr_warn with dev_warn.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:26 -04:00
Tao Zhou
1990e29b19 drm/amdkfd: add RAS poison consumption handling for UTCL2 (v2)
Do RAS page retirement and use gpu reset as fallback in UTCL2 fault
handler.

v2: replace vm fault event with posion consumed event in UTCL2
poison consumption.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:26 -04:00
Tao Zhou
9d8a8d78d9 drm/amdkfd: replace source_id with client_id for RAS poison consumption
Client ID is more accruate here and we can deal with more different
cases with client ID.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:25 -04:00
Tao Zhou
eed4197530 drm/amdkfd: refine event_interrupt_poison_consumption
Combine reading and setting poison flag as one atomic operation
and add print message for the function.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:25 -04:00
Tao Zhou
29b440d204 drm/amdkfd: add return value check for queue eviction
Otherwise gpu reset will be triggered unconditionally in poison
consumption.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-16 17:30:02 -05:00
Rajneesh Bhardwaj
2243f4937a drm/amdkfd: Fix leftover errors and warnings
A bunch of errors and warnings are leftover KFD over the years, attempt
to fix the errors and most warnings reported by checkpatch tool. Still a
few warnings remain which may be false positives so ignore them for now.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj
d87f36a063 drm/amdkfd: update SPDX license header
Update the SPDX License header for all the KFD files.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Tao Zhou
b1c87b0874 drm/amdkfd: use unmap all queues for poison consumption
Replace reset queue for specific PASID with unmap all queues, reset
queue could break CP scheduler.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 14:15:07 -05:00
Tao Zhou
03e5b167bd drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid
As the function is used in more different cases, use a more general
name.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 14:14:53 -05:00
yipechai
5b0ce2d41b drm/amdkfd: enable sdma ecc interrupt event can be handled by event_interrupt_wq_v9
Enable sdma ecc interrupt event can be handled by event_interrupt_wq_v9.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-07 17:19:14 -05:00
Tao Zhou
b6485bed40 drm/amdkfd: reset queue which consumes RAS poison (v2)
CP supports unmap queue with reset mode which only destroys specific queue without affecting others.
Replacing whole gpu reset with reset queue mode for RAS poison consumption
saves much time, and we can also fallback to gpu reset solution if reset
queue fails.

v2: Return directly if process is NULL;
    Reset queue solution is not applicable to SDMA, fallback to legacy
way;
    Call kfd_unref_process after lookup process.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-28 16:02:59 -05:00
Graham Sider
f0dc99a6f7 drm/amdkfd: add kfd_device_info_init function
Initializes kfd->device_info given either asic_type (enum) if GFX
version is less than GFX9, or GC IP version if greater. Also takes in vf
and the target compiler gfx version. Uses SDMA version to determine
num_sdma_queues_per_engine.

Convert device_info to a non-pointer member of kfd, change references
accordingly.

Change unsupported asic condition to only probe f2g, move device_info
initialization post-switch.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:15:37 -05:00
Graham Sider
6bfc7c7e17 drm/amdkfd: replace kgd_dev in various amgpu_amdkfd funcs
Modified definitions:

- amdgpu_amdkfd_submit_ib
- amdgpu_amdkfd_set_compute_idle
- amdgpu_amdkfd_have_atomics_support
- amdgpu_amdkfd_flush_gpu_tlb_pasid
- amdgpu_amdkfd_flush_gpu_tlb_pasid
- amdgpu_amdkfd_gpu_reset
- amdgpu_amdkfd_alloc_gtt_mem
- amdgpu_amdkfd_free_gtt_mem
- amdgpu_amdkfd_alloc_gws
- amdgpu_amdkfd_free_gws
- amdgpu_amdkfd_ras_poison_consumption_handler

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:01 -05:00
Tao Zhou
c749094923 amd/amdkfd: add ras page retirement handling for sq/sdma (v3)
In ras poison mode, page retirement will be handled by the irq handler of the
module which consumes corrupted data.

v2: rename ras_process_cb to ras_poison_consumption_handler.
    move the handler's implementation from ASIC specific file to common
file.

v3: call gpu reset for xGMI connected mode.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-04 15:22:57 -04:00
Hawking Zhang
4a1d4b6d38 drm/amdkfd: add sdma poison consumption handling
Follow the same apporach as GFX to handle SDMA
poison consumption. Send SIGBUS to application
when receives SDMA_ECC interrupt and issue gpu
reset either mode 2 or mode 1 to get the engine
back

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li<dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-07 14:57:24 -04:00
Dennis Li
e2b1f9f52b drm/amdkfd: refine the poison data consumption handling
The user applications maybe register the KFD_EVENT_TYPE_HW_EXCEPTION and
KFD_EVENT_TYPE_MEMORY events, driver could notify them when poison data
consumed. Beside that, some applications maybe register SIGBUS signal
hander. These applications will handle poison data by themselves, exit
or re-create context to re-dispatch works.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:29:44 -04:00
Hawking Zhang
be9064b7bc drm/amdgpu: remove unnecessary header include
amdgpu.h is included in kfd_priv.h

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:35:50 -04:00
Dennis Li
20161e51dc drm/amdkfd: add edc error interrupt handle for poison propogate mode
In poison progogate mode, when driver receive the edc error interrupt
from SQ, driver should kill the process by pasid which is using the
poison data, and then trigger GPU reset.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:35:13 -04:00
Oak Zeng
6d909c5da0 drm/amdkfd: Add kernel parameter to stop queue eviction on vm fault
This is to keep wavefront context for debug purpose

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 22:59:22 -04:00
Tao Zhou
7af103ea87 drm/amdkfd: check more client ids in interrupt handler
Add check for SExSH clients in kfd interrupt handler.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-01-08 15:18:25 -05:00
Alex Deucher
ae279f693c drm/amdkfd: check both client id and src id in interrupt handlers
We can have the same src ids for different client ids so make sure to
check both the client id and the source id when handling interrupts.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-12-23 15:07:40 -05:00
Amber Lin
938a0650aa drm/amdkfd: Provide SMI events watch
When the compute is malfunctioning or performance drops, the system admin
will use SMI (System Management Interface) tool to monitor/diagnostic what
went wrong. This patch provides an event watch interface for the user
space to register devices and subscribe events they are interested. After
registered, the user can use annoymous file descriptor's poll function
with wait-time specified and wait for events to happen. Once an event
happens, the user can use read() to retrieve information related to the
event.

VM fault event is done in this patch.

v2: - remove UNREGISTER and add event ENABLE/DISABLE
    - correct kfifo usage
    - move event message API to kfd_ioctl.h
v3: send the event msg in text than in binary
v4: support multiple clients
v5: move events enablement from ioctl to fd write
v6: sparse fix

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-15 13:27:34 -04:00
Aishwarya Ramakrishnan
8c8e1f6984 drm/amdkfd: Fix boolreturn.cocci warnings
Return statements in functions returning bool should use
true/false instead of 1/0.

drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c:40:9-10:
WARNING: return of 0/1 in function 'event_interrupt_isr_v9' with return type bool

Generated by: scripts/coccinelle/misc/boolreturn.cocci

Signed-off-by: Aishwarya Ramakrishnan <aishwaryarj100@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-21 12:37:19 -04:00
Yong Zhao
3fe023d42e drm/amdkfd: Query vmid pasid mapping through stored info for non HWS
Because we record the mapping under non HWS mode in the software,
we can query pasid through vmid using the stored mapping instead of
reading from ATC registers.

This also prepares for the defeatured ATC block in future ASICs.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03 09:11:03 -05:00
Yong Zhao
0ad8c5e296 drm/amdkfd: Support MMHUB1 in kfd interrupt path
Handle interrupts for second mmhub.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-18 14:18:04 -05:00
Yong Zhao
a53a11a835 drm/amdkfd: Workaround PASID missing in gfx9 interrupt payload under non HWS
This is a known gfx9 HW issue, and this change can perfectly workaround
the issue.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-19 16:38:14 -05:00
Yong Zhao
00557f4131 drm/amdkfd: Adjust the debug message in KFD ISR
This makes debug message get printed even when there is early return.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-19 16:38:14 -05:00
Lan Xiao
58e6988612 drm/amdkfd: fix zero reading of VMID and PASID for Hawaii
Upon VM Fault, the VMID and PASID written by HW are zeros in
Hawaii. Instead of reading from ih_ring_entry, read directly
from the registers. This workaround fix the soft hang issues
caused by mishandled VM Fault in Hawaii.

Signed-off-by: Lan Xiao <Lan.Xiao@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-07-11 22:32:51 -04:00
shaoyunl
2640c3facb drm/amdkfd: Handle VM faults in KFD
1. Pre-GFX9 the amdgpu ISR saves the vm-fault status and address per
   per-vmid. amdkfd needs to get the information from amdgpu through the
   new get_vm_fault_info interface. On GFX9 and later, all the required
   information is in the IH ring
2. amdkfd unmaps all queues from the faulting process and create new
   run-list without the guilty process
3. amdkfd notifies the runtime of the vm fault trap via EVENT_TYPE_MEMORY

Signed-off-by: shaoyun liu <shaoyun.liu@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-07-11 22:32:50 -04:00
Felix Kuehling
c129db1206 drm/amdkfd: Add sanity checks in IRQ handlers
Only accept interrupts from KFD VMIDs. Just checking for a PASID may
not be enough because amdgpu started using PASIDs to map VM faults
to processes.

Warn if an IRQ doesn't have a valid PASID (indicating a firmware bug).

Suggested-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Suggested-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-05-01 17:56:12 -04:00
Felix Kuehling
ca750681bc drm/amdkfd: Add SOC15 interrupt processing support
Signed-off-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2018-04-10 17:33:10 -04:00