Add initial support for soc21 in KFD compute
driver (Mukul)
- Add new definition for soc21 device.
- Add new file for amdgpu-kfd interface for GFX11 family.
- Add new file for queue management, interrupt handling,
mqd management for GFX11 family in KFD driver.
- Related changes/updates for soc21 device in
KFD driver.
- Repurpose last 2 entries of SDMA MQD for driver use.
v2: Add an optional argument into update queue operation (Mukul)
v3: Switch to ip version check, replace kgd_dev with
amdgpu_device (Hawking)
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Rather than using hardcoded tables, we can use the gfx and
gmc config pulled from the IP discovery table to generate the
cache configuration.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The logic to update the IO links when a KFD device
is removed was not correct as it would miss updating
the proximity domain values for some nodes where the
node_from and node_to both were greater values than the
proximity domain value of the KFD device being removed
from topology.
Fixes: 46d18d510d ("drm/amdkfd: Cleanup IO links during KFD device removal")
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
All uses of the 'kfd->gtt_sa_bitmap' bitmap are protected with the
'kfd->gtt_sa_lock' mutex.
So:
- prefer the non-atomic '__set_bit()' function
- use the non-atomic 'bitmap_[set|clear]()' functions instead of
equivalent 'for' loops. These functions can work on several bits at a
time
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
'kfd->gtt_sa_bitmap' is a bitmap. So use 'bitmap_zalloc()' to simplify
code, improve the semantic and avoid some open-coded arithmetic in
allocator arguments.
Also change the corresponding 'kfree()' into 'bitmap_free()' to keep
consistency.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Change SVM range mapping flags or access attributes don't trigger
migration, if range is already mapped on GPUs we should update GPU
mapping and pass flush_tlb flag true to amdgpu vm.
Change SVM range preferred_loc or migration granularity don't need
update GPU mapping, skip the validate_and_map.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To avoid unnecessary unmap SVM range from GPUs if range is not mapped on
GPUs when migrating the range. This flag will also be used to flush TLB
when updating the existing mapping on GPUs.
It is protected by prange->migrate_mutex and mmap read lock in MMU
notifier callback.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
MEC firmware sometimes sends signal interrupts without a valid context ID
on end of pipe events that don't intend to signal any HSA signals.
This triggers the slow path in kfd_signal_event_interrupt that scans the
entire event page for signaled events. Detect these signals in the top
half interrupt handler to stop processing them as early as possible.
Because we now always treat event ID 0 as invalid, reserve that ID during
process initialization.
v2: Update firmware version checks to support more GPUs
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
simplify programming with existing functions.
Signed-off-by: Yang Wang <KevinYang.Wang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 36bf93216e.
It causes SVM regressions on Vega10 with XNACK-ON. Just revert it
at the moment.
./kfdtest --gtest_filter=KFDSVMRangeTest.MigratePolicyTest
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add support to checkpoint/restore GWS (Global Wave Sync) queues.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
dqm->gws_queue_count and pdd->qpd.mapped_gws_queue need to be updated
each time the queue gets evicted.
Fixes: b8020b0304 ("drm/amdkfd: Enable over-subscription with >1 GWS queue")
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The idea is from
commit a50fe70780 ("drm/amdkfd: Only apply heavy-weight TLB flush on Aldebaran")
and
commit f61c40c075 ("drm/amdkfd: enable heavy-weight TLB flush on Arcturus").
At the moment, heavy-weight TLB could cause problems on ASICs except
Aldebaran and Arcturus.
A simple hipMallocManaged/hipFree program could trigger this issue.
[ 97.787657] amdgpu 0000:01:00.0: amdgpu: wait for kiq fence error: 0.
[ 106.868758] amdgpu: qcm fence wait loop timeout expired
[ 106.868966] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[ 106.869203] amdgpu: Failed to evict process queues
[ 106.869261] amdgpu: Failed to quiesce KFD
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To make kfd_flush_tlb_after_unmap visible in kfd_svm.c,
move it into kfd_priv.h. And change it to an inline function.
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add the waiters to the wait queue during initialization, while holding the
event spinlock. Otherwise the waiter will not get activated if the event
signals before being added to the wait queue.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If lookup_event_by_id() returns a NULL "ev" pointer then the
spin_lock(&ev->lock) will crash. This was detected by Smatch:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_events.c:644 kfd_set_event()
error: we previously assumed 'ev' could be null (see line 639)
Fixes: 5273e82c5f ("drm/amdkfd: Improve concurrency of event handling")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Currently, the IO-links to the device being removed from topology,
are not cleared. As a result, there would be dangling links left in
the KFD topology. This patch aims to fix the following:
1. Cleanup all IO links to the device being removed.
2. Ensure that node numbering in sysfs and nodes proximity domain
values are consistent after the device is removed:
a. Adding a device and removing a GPU device are made mutually
exclusive.
b. The global proximity domain counter is no longer required to be
an atomic counter. A normal 32-bit counter can be used instead.
3. Update generation_count to let user-mode know that topology has
changed due to device removal.
CC: Shuotao Xu <shuotaoxu@microsoft.com>
Reviewed-by: Shuotao Xu <shuotaoxu@microsoft.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
A MAX_GPU_INSTANCE bits bitmap will suffice.
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The synchronize_rcu call in destroy_events can take several ms, which
noticeably slows down applications destroying many events. Use kfree_rcu
to free the event structure asynchronously and eliminate the
synchronize_rcu call in the user thread.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Application could change XNACK enabled to disabled while KFD is draining
stale retry fault, therefore the check for whether to drain retry faults
must be before the check for whether xnack_enabled, to avoid report
incorrect vm fault after application changes XNACK mode.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use rcu_read_lock to read p->event_idr concurrently with other readers
and writers. Use p->event_mutex only for creating and destroying events
and in kfd_wait_on_events.
Protect the contents of the kfd_event structure with a per-event
spinlock that can be taken inside the rcu_read_lock critical section.
This eliminates contention of p->event_mutex in set_event, which tends
to be on the critical path for dispatch latency even when busy waiting
is used. It also eliminates lock contention in event interrupt handlers.
Since the p->event_mutex is now used much less, the impact of requiring
it in kfd_wait_on_events should also be much smaller.
This should improve event handling latency for processes using multiple
GPUs concurrently.
v2: Reschedule the worker periodically to avoid soft lockup warnings
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Sean Keely <Sean.Keely@amd.com> # v1
Tested-by: Sanjay Tripathi <sanjay.tripathi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
bo_adev is NULL for system memory mapping to GPU.
Fixes: 30671b44aa ("drm/amdgpu: fix TLB flushing during eviction")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Testing the valid bit is not enough to figure out if we
need to invalidate the TLB or not.
During eviction it is quite likely that we move a BO from VRAM to GTT and
update the page tables immediately to the new GTT address.
Rework the whole function to get all the necessary parameters directly as
value.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This ensures userspace cannot prematurely clean-up the client before
it is fully initialised which has been proven to cause issues in the
past.
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To support multi-thread update page table.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Better to leave the decision when to flush the VM changes in the TLB to
the VM code.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Instead of hand rolling the table_freed parameter.
v2: add some changes suggested by Philip
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Instead of trying to figure out if a TLB flush is necessary or not use
the information provided by the VM subsystem now.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Print the status out when it passes, and also tell user gpu reset
is triggered when we fall back to legacy way.
v2: make the message more explicit.
v3: change succeeds to succeeded.
replace pr_warn with dev_warn.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Do RAS page retirement and use gpu reset as fallback in UTCL2 fault
handler.
v2: replace vm fault event with posion consumed event in UTCL2
poison consumption.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Client ID is more accruate here and we can deal with more different
cases with client ID.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Combine reading and setting poison flag as one atomic operation
and add print message for the function.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As the kmalloc_array() may return null, the 'event_waiters[i].wait' would lead to null-pointer dereference.
Therefore, it is better to check the return value of kmalloc_array() to avoid this confusion.
Signed-off-by: QintaoShen <unSimple1993@163.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Recently introduced commit 158a05a0b8 ("drm/amdgpu: Add
use_xgmi_p2p module parameter") did not update XGMI iolinks
when use_xgmi_p2p is disabled. Add fix to not create XGMI
iolinks in KFD topology when this parameter is disabled.
Fixes: 158a05a0b8 ("drm/amdgpu: Add use_xgmi_p2p module parameter")
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Compute-only GPUs have more than 8 VMIDs allocated to KFD. Fix
this by passing correct number of VMIDs to HWS
v2: squash in warning fix (Alex)
Signed-off-by: Tushar Patel <tushar.patel@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Migrate vram to ram may fail to find the vma if process is exiting
and vma is removed, evict svm bo worker sets prange->svm_bo to NULL
and warn svm_bo ref count != 1 only if migrating vram to ram
successfully.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Export dmabuf handles for GTT BOs so that their contents can be accessed
using SDMA during checkpoint/restore.
v2: Squash in fix from David to set dmabuf handle to invalid for BOs
that cannot be accessed using SDMA during checkpoint/restore.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by : Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Refactor CRIU restore BO to reduce identation.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
When the process is getting restored, the queues are not mapped yet, so
there is no VMID assigned for this process and no TLBs to flush.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
it makes no sense to continue with an undefined vmid.
Fixes: c8b0507f40 ("drm/amdkfd: judge get_atc_vmid_pasid_mapping_info before call")
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To enable compiler type-checked against the format string in callers.
All warnings (new ones prefixed by >>):
>> warning: function 'kfd_smi_event_add' might be a candidate for
'gnu_printf' format attribute [-Wsuggest-attribute=format]
Fixes: d58b8a99cb ("drm/amdkfd: Add SMI add event helper")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To remove duplicate code, unify event message format and simplify new
event add in the following patches.
Use KFD_SMI_EVENT_MSG_SIZE to define msg size, the same size will be
used in user space to alloc the msg receive buffer.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
sizeof(buf) is 8 bytes because it is defined as unsigned char *buf,
each SMI event read only copy max 8 bytes to user buffer. Correct this
by using the buf allocate size.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 3abfe30d80.
To fix deadlock in kFDSVMEvictTest when xnack off.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Print alloc node, peer node and memory domain when peer map fails. This
is more useful
v2: use dev_err instead of pr_err
use bdf for identify peer gpu
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
kfd_chardev() doesn't provide much useful information in dev_... messages
on multi-GPU systems because there is only one KFD device, which doesn't
correspond to any particular GPU. Use the actual GPU device to indicate
the GPU that caused a message.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix for possible integer overflow when doing addition.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The driver has a fallback so make the message informational
rather than a warning. The driver has a fallback if the
Component Resource Association Table (CRAT) is missing, so
make this informational now.
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1906
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>