1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

189 commits

Author SHA1 Message Date
Jonathan Kim
a70a93fa56 drm/amdkfd: add debug suspend and resume process queues operation
In order to inspect waves from the saved context at any point during a
debug session, the debugger must be able to preempt queues to trigger
context save by suspending them.

On queue suspend, the KFD will copy the context save header information
so that the debugger can correctly crawl the appropriate size of the saved
context. The debugger must then also be allowed to resume suspended queues.

A queue that is newly created cannot be suspended because queue ids are
recycled after destruction so the debugger needs to know that this has
occurred.  Query functions will be later added that will clear a given
queue of its new queue status.

A queue cannot be destroyed while it is suspended to preserve its saved
context during debugger inspection.  Have queue destruction block while
a queue is suspended and unblocked when it is resumed.  Likewise, if a
queue is about to be destroyed, it cannot be suspended.

Return the number of queues successfully suspended or resumed along with
a per queue status array where the upper bits per queue status show that
the request was invalid (new/destroyed queue suspend request, missing
queue) or an error occurred (HWS in a fatal state so it can't suspend or
resume queues).

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:36:43 -04:00
Jonathan Kim
12fb1ad70d drm/amdkfd: update process interrupt handling for debug events
The debugger must be notified by any debugger subscribed exception
that comes from hardware interrupts.

If a debugger session exits, any exceptions it subscribed to may still
have interrupts in the interrupt ring buffer or KGD/KFD pipeline.
To prevent a new session from inheriting stale interrupts, when a new
queue is created, open an interrupt drain and allow the IH ring to drain
from a timestamped checkpoint.  Then inject a custom IV so that once
the custom IV is picked up by the KFD, it's safe to close the drain
and proceed with queue creation.

The drain must also be on debug disable as SW interrupts may still
be processed.  Drain at this time and clear all the exception status.

The debugger may also not be attached nor subscibed to certain
exceptions so forward them directly to the runtime.

GFX10 also requires its own IV processing, hence the creation of
kfd_int_process_v10.c.  This is because the IV from SQ interrupts are
packed into a new continguous format unlike GFX9. To make this clear,
a separate interrupting handling code file was created.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:36:17 -04:00
Philip Yang
acf429dcac drm/amdkfd: Align partition memory size to page size
The compute partition memory size calculated from KFD_XCP_MEMORY_SIZE
may not align to page size if xcp_mgr->num_xcp_per_mem_partition is 6.

Change the KFD_XCP_MEMORY_SIZE macro to return page align size, so KFD
node memory size reported in sysfs is page align size, to avoid
application VRAM allocation failure because application may use the size
directly and Thunk requires the memory allocation size is page size
align.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:53:03 -04:00
Alex Deucher
45b3a914d4 drm/amdgpu/gmc9: fix 64 bit division in partition code
Rework logic or use do_div() to avoid problems on 32 bit.

v2: add a missing case for XCP macro
v3: fix out of bounds array access
v4: fix xcp handling harder

Acked-by: Guchun Chen <guchun.chen@amd.com> (v1)
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com> (v3)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:37:52 -04:00
Hawking Zhang
9a3ce1a7a9 drm/amdgpu: Do not access members of xcp w/o check (v2)
Not all the asic needs xcp. ensure check xcp availabity
before accessing its member.

v2: add missing change in kfd_topology.c

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:37:24 -04:00
Mukul Joshi
315e29eca5 drm/amdkfd: Move local_mem_info to kfd_node
We need to track memory usage on a per partition basis. To do
that, store the local memory information in KFD node instead
of kfd device.

v2: squash in fix ("amdkfd: Use mem_id to access mem_partition info")

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:36:45 -04:00
Philip Yang
2fa9ff25de drm/amdgpu: KFD graphics interop support compute partition
kfd_ioctl_get_dmabuf use the amdgpu bo xcp_id to get the gpu_id of the
KFD node from the exported dmabuf_adev, and then create kfd bo on the
correct adev and KFD node when importing the amdgpu bo to KFD.

Remove function kfd_device_by_adev, it is not needed as it is the same
result as dmabuf_adev->kfd.dev->nodes[0]->id.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:36:41 -04:00
Mukul Joshi
0c7315e7d5 drm/amdkfd: Add device repartition support
GFX9.4.3 will support dynamic repartitioning of the GPU through sysfs.
Add device repartitioning support in KFD to repartition GPU from one
mode to other.

v2: squash in fix ("drm/amdkfd: Fix warning kgd2kfd_unlock_kfd defined but not used")

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:45:38 -04:00
Mukul Joshi
f87f686482 drm/amdgpu: Add XCC inst to PASID TLB flushing
Add XCC instance to select the correct KIQ ring when
flushing TLBs on a multi-XCC setup.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:46 -04:00
Le Ma
d9426c3d9b drm/amdgpu: add bitmask to iterate vmhubs
As the layout of VMHUB definition has been changed to cover multiple
XCD/AID case, the original num_vmhubs is not appropriate to do vmhub
iteration any more.

Drop num_vmhubs and introduce vmhubs_mask instead.

v2: switch to the new VMHUB layout
v3: use DECLARE_BITMAP to define vmhubs_mask

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:40:17 -04:00
Hawking Zhang
f4caf58426 drm/amdgpu: introduce vmhub definition for multi-partition cases (v3)
v1: Each partition has its own gfxhub or mmhub. adjust
the num of MAX_VMHUBS and the GFXHUB/MMHUB layout (Le)

v2: re-design the AMDGPU_GFXHUB/AMDGPU_MMHUB layout (Le)

v3: apply the gfxhub/mmhub layout to new IPs (Hawking)

v4: fix up gmc11 (Alex)

v5: rebase (Alex)

Signed-off-by: Le Ma <le.ma@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:40:03 -04:00
Mukul Joshi
27fb73a0e3 drm/amdkfd: Update KFD TTM mem limit
Use the helper function in TTM to get TTM memory
limit and set KFD's internal mem limit. This ensures
that KFD's TTM mem limit and actual TTM mem limit are
exactly same.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:24:52 -04:00
Le Ma
be697aa3a7 drm/amdgpu: move queue_bitmap to an independent structure (v3)
To allocate independent queue_bitmap for each XCD,
then the old bitmap policy can be continued to use
with a clear logic.

Use mec_bitmap[0] as default for all non-GC 9.4.3 IPs.

v2: squash commits to avoid breaking the build
v3: unify naming style

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-18 16:28:54 -04:00
Shashank Sharma
0512e9ffeb drm/amdgpu: rename num_doorbells
Rename doorbell.num_doorbells to doorbell.num_kernel_doorbells to
make it more readable.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Acked-by: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-13 00:19:42 -04:00
Dave Airlie
4e291f2f58 drm-misc-next for 6.2:
UAPI Changes:
 
 Cross-subsystem Changes:
 
 Core Changes:
 - atomic-helper: Add begin_fb_access and end_fb_access hooks
 - fb-helper: Rework to move fb emulation into helpers
 - scheduler: rework entity flush, kill and fini
 - ttm: Optimize pool allocations
 
 Driver Changes:
 - amdgpu: scheduler rework
 - hdlcd: Switch to DRM-managed resources
 - ingenic: Fix registration error path
 - lcdif: FIFO threshold tuning
 - meson: Fix return type of cvbs' mode_valid
 - ofdrm: multiple fixes (kconfig, types, endianness)
 - sun4i: A100 and D1 support
 - panel:
   - New Panel: Jadard JD9365DA-H3
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCY2y3vQAKCRDj7w1vZxhR
 xQu+AP9CzbNI2s12aNS8DskEZggo0lqUyRiEBaRJ2jrWZdGr0gEA1+Lc06HaKmGC
 2WBD4nw2I7ch0NUN6VFjQ9ATofevPwA=
 =AY3a
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2022-11-10-1' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for 6.2:

UAPI Changes:

Cross-subsystem Changes:

Core Changes:
- atomic-helper: Add begin_fb_access and end_fb_access hooks
- fb-helper: Rework to move fb emulation into helpers
- scheduler: rework entity flush, kill and fini
- ttm: Optimize pool allocations

Driver Changes:
- amdgpu: scheduler rework
- hdlcd: Switch to DRM-managed resources
- ingenic: Fix registration error path
- lcdif: FIFO threshold tuning
- meson: Fix return type of cvbs' mode_valid
- ofdrm: multiple fixes (kconfig, types, endianness)
- sun4i: A100 and D1 support
- panel:
  - New Panel: Jadard JD9365DA-H3

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20221110083612.g63eaocoaa554soh@houat
2022-11-16 07:17:32 +10:00
Thomas Zimmermann
8e4e4c2f53 Merge drm/drm-next into drm-misc-next
Backmerging drm/drm-next to get the latest changes in the xlnx driver.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2022-11-05 16:08:36 +01:00
Christian König
f7d66fb2ea drm/amdgpu: cleanup scheduler job initialization v2
Init the DRM scheduler base class while allocating the job.

This makes the whole handling much more cleaner.

v2: fix coding style

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-7-christian.koenig@amd.com
2022-11-03 12:45:20 +01:00
Mukul Joshi
d69a3b762d drm/amdkfd: Cleanup kfd_dev struct
Cleanup kfd_dev struct by removing ddev and pdev as both
drm_device and pci_dev can be fetched from amdgpu_device.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27 15:12:09 -04:00
Graham Sider
087b8542c0 drm/amdgpu: disable GFXOFF during compute for GFX11
Temporary workaround to fix issues observed in some compute applications
when GFXOFF is enabled on GFX11.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27 15:12:09 -04:00
Tao Zhou
1ed0e17690 drm/amdgpu: remove ras_error_status parameter for UMC poison handler
Make the code simpler.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27 15:12:08 -04:00
Victor Zhao
a340847b02 Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-18 22:08:33 -04:00
Victor Zhao
b98a1648d6 Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-17 17:41:20 -04:00
Philip Yang
2302d50714 drm/amdgpu: Correct amdgpu_amdkfd_total_mem_size calculation
amdkfd_total_mem_size is the size of total GPUs vram plus system memory
to estimate page tables memory usage and leave enough VRAM room for page
tables allocation.

Calculate amdkfd_total_mem_size in amdgpu_amdkfd_device_probe is
incorrect because adev->gmc.real_vram_size is still 0 called from
amdgpu_device_ip_early_init. Move the calculation
to amdgpu_amdkfd_device_init to get the correct VRAM size.

Do reverse calculation in amdgpu_amdkfd_device_fini_sw to support
hot-unplugging GPUs.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-06 12:08:18 -04:00
Tao Zhou
5e1fdf76cf drm/amdgpu: add page retirement handling for CPU RAS
Do RAS page retirement in poison consumption handler unconditionally.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29 09:42:36 -04:00
Christian König
4624459c84 drm/amdgpu: add gang submit frontend v6
Allows submitting jobs as gang which needs to run on multiple engines at the
same time.

All members of the gang get the same implicit, explicit and VM dependencies. So
no gang member will start running until everything else is ready.

The last job is considered the gang leader (usually a submission to the GFX
ring) and used for signaling output dependencies.

Each job is remembered individually as user of a buffer object, so there is no
joining of work at the end.

v2: rebase and fix review comments from Andrey and Yogesh
v3: use READ instead of BOOKKEEP for now because of VM unmaps, set gang
    leader only when necessary
v4: fix order of pushing jobs and adding fences found by Trigger.
v5: fix job index calculation and adding IBs to jobs
v6: fix typo found by Alex

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-20 12:40:46 -04:00
Jingyu Wang
2f4ca1ba6c drm/amdgpu: cleanup coding style in amdgpu_amdkfd.c
Fix everything checkpatch.pl complained about in amdgpu_amdkfd.c

Signed-off-by: Jingyu Wang <jingyuwang_vip@163.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13 14:32:58 -04:00
Victor Zhao
dac6b80818 drm/amdgpu: let mode2 reset fallback to default when failure
- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the asic specific part

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:14:31 -04:00
Likun Gao
f1549c09c5 drm/amdgpu: support reset flag set for gpu reset
Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-13 11:25:17 -04:00
Andrey Grodzovsky
9ae55f030d drm/amdgpu: Follow up change to previous drm scheduler change.
Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-28 11:24:41 -04:00
Ruili Ji
508f748b03 drm/amdgpu: To flush tlb for MMHUB of RAVEN series
amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:40 vmid:8 pasid:32769, for process test_basic pid 3305 thread test_basic pid 3305)
amdgpu: in page starting at address 0x00007ff990003000 from IH client 0x12 (VMC)
amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00840051
amdgpu: Faulty UTCL2 client ID: MP1 (0x0)
amdgpu: MORE_FAULTS: 0x1
amdgpu: WALKER_ERROR: 0x0
amdgpu: PERMISSION_FAULTS: 0x5
amdgpu: MAPPING_ERROR: 0x0
amdgpu: RW: 0x1

When memory is allocated by kfd, no one triggers the tlb flush for MMHUB0.
There is page fault from MMHUB0.

v2:fix indentation
v3:change subject and fix indentation

Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Reviewed-by: Aaron Liu <aaron.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-23 17:21:53 -04:00
Andrey Grodzovsky
cf72704414 drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:12 -04:00
Andrey Grodzovsky
b5fd0cf3ea drm/amdgpu: Add work_struct for GPU reset from kfd.
We need to have a work_struct to cancel this reset if another
already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:07 -04:00
Mukul Joshi
cc009e613d drm/amdkfd: Add KFD support for soc21 v3
Add initial support for soc21 in KFD compute
driver (Mukul)
- Add new definition for soc21 device.
- Add new file for amdgpu-kfd interface for GFX11 family.
- Add new file for queue management, interrupt handling,
  mqd management for GFX11 family in KFD driver.
- Related changes/updates for soc21 device in
  KFD driver.
- Repurpose last 2 entries of SDMA MQD for driver use.

v2: Add an optional argument into update queue operation (Mukul)

v3: Switch to ip version check, replace kgd_dev with
    amdgpu_device (Hawking)

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:54 -04:00
Felix Kuehling
3cd3e731f3 drm/amdkfd: Fix NULL pointer dereference
Check that adev->gfx.ras is valid before using it.

Fixes: 6475ae2b74 ("drm/amdgpu: add UTCL2 RAS poison query for Aldebaran (v2)")
CC: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-07 16:37:39 -04:00
Tao Zhou
6475ae2b74 drm/amdgpu: add UTCL2 RAS poison query for Aldebaran (v2)
Add help functions to query and reset RAS UTCL2 poison status.

v2: implement it on amdgpu side and kfd only calls it.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:26 -04:00
Nirmoy Das
ffb378fb30 drm/amdkfd: remove unused function
Remove unused amdgpu_amdkfd_get_vram_usage()

CC: Felix.Kuehling@amd.com

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fixes: dfcbe6d5f4 ("drm/amdgpu: Remove unused function pointers")
2022-01-11 15:44:26 -05:00
Tao Zhou
fec8c5244f drm/amdgpu: save error count in RAS poison handler
Otherwise the RAS error count couldn't be queried from sysfs.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-30 08:54:45 -05:00
Tao Zhou
b6485bed40 drm/amdkfd: reset queue which consumes RAS poison (v2)
CP supports unmap queue with reset mode which only destroys specific queue without affecting others.
Replacing whole gpu reset with reset queue mode for RAS poison consumption
saves much time, and we can also fallback to gpu reset solution if reset
queue fails.

v2: Return directly if process is NULL;
    Reset queue solution is not applicable to SDMA, fallback to legacy
way;
    Call kfd_unref_process after lookup process.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-28 16:02:59 -05:00
Graham Sider
02274fc0f6 drm/amdkfd: replace trivial funcs with direct access
These get funcs simply return an adev field. Replace funcs/calls with
direct field accesses instead.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:11 -05:00
Graham Sider
b5d1d755c1 drm/amdkfd: remove kgd_dev declaration and initialization
Completes removal of kgd_dev. Direct references to amdgpu_device objects
should now be used instead.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:03 -05:00
Graham Sider
574c4183ef drm/amdkfd: replace kgd_dev in get amdgpu_amdkfd funcs
Modified definitions:

- amdgpu_amdkfd_get_fw_version
- amdgpu_amdkfd_get_local_mem_info
- amdgpu_amdkfd_get_gpu_clock_counter
- amdgpu_amdkfd_get_max_engine_clock_in_mhz
- amdgpu_amdkfd_get_cu_info
- amdgpu_amdkfd_get_dmabuf_info
- amdgpu_amdkfd_get_vram_usage
- amdgpu_amdkfd_get_hive_id
- amdgpu_amdkfd_get_unique_id
- amdgpu_amdkfd_get_mmio_remap_phys_addr
- amdgpu_amdkfd_get_num_gws
- amdgpu_amdkfd_get_asic_rev_id
- amdgpu_amdkfd_get_noretry
- amdgpu_amdkfd_get_xgmi_hops_count
- amdgpu_amdkfd_get_xgmi_bandwidth_mbytes
- amdgpu_amdkfd_get_pcie_bandwidth_mbytes

Also replaces kfd_device_by_kgd with kfd_device_by_adev, now
searching via adev rather than kgd.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:02 -05:00
Graham Sider
6bfc7c7e17 drm/amdkfd: replace kgd_dev in various amgpu_amdkfd funcs
Modified definitions:

- amdgpu_amdkfd_submit_ib
- amdgpu_amdkfd_set_compute_idle
- amdgpu_amdkfd_have_atomics_support
- amdgpu_amdkfd_flush_gpu_tlb_pasid
- amdgpu_amdkfd_flush_gpu_tlb_pasid
- amdgpu_amdkfd_gpu_reset
- amdgpu_amdkfd_alloc_gtt_mem
- amdgpu_amdkfd_free_gtt_mem
- amdgpu_amdkfd_alloc_gws
- amdgpu_amdkfd_free_gws
- amdgpu_amdkfd_ras_poison_consumption_handler

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-17 16:58:01 -05:00
Alex Deucher
5b983db8c3 drm/amdkfd: clean up parameters in kgd2kfd_probe
We can get the pdev and asic type from the adev.  No need
to pass them explicitly.

v2: squash in build fix for !CONFIG_HSA_AMD from Anson

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-04 15:23:02 -04:00
Tao Zhou
c749094923 amd/amdkfd: add ras page retirement handling for sq/sdma (v3)
In ras poison mode, page retirement will be handled by the irq handler of the
module which consumes corrupted data.

v2: rename ras_process_cb to ras_poison_consumption_handler.
    move the handler's implementation from ASIC specific file to common
file.

v3: call gpu reset for xGMI connected mode.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-04 15:22:57 -04:00
James Zhu
8066008482 drm/amdgpu: add amdgpu_amdkfd_resume_iommu
Add amdgpu_amdkfd_resume_iommu for amdgpu.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211277
Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2021-09-16 09:56:24 -04:00
Jack Zhang
c530b02f39 drm/amd/amdgpu embed hw_fence into amdgpu_job
Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling

Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: Jack Zhang <Jack.Zhang7@hotmail.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed by: Monk Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-16 15:16:58 -04:00
Jonathan Kim
9330481038 drm/amdkfd: report pcie bandwidth to the kfd
Similar to xGMI reporting the min/max bandwidth between direct peers, PCIe
will report the min/max bandwidth to the KFD.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:07:59 -04:00
Jonathan Kim
3f46c4e9ce drm/amdkfd: report xgmi bandwidth between direct peers to the kfd
Report the min/max bandwidth in megabytes to the kfd for direct
xgmi connections only.  Indirect peers will report 0 since
indirect route is unknown.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-23 10:07:59 -04:00
Thomas Zimmermann
304ba5dca4 Merge drm/drm-next into drm-misc-next
Backmerging from drm/drm-next to the patches for AMD devices
for v5.14.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2021-05-22 07:17:05 +02:00
Andrey Grodzovsky
e9669fb782 drm/amdgpu: Add early fini callback
Use it to call disply code dependent on device->drv_data
before it's set to NULL on device unplug

v5:
Move HW finilization into this callback to prevent MMIO accesses
post cpi remove.

v7:
Split kfd suspend from device exit to expdite HW related
stuff to amdgpu_pci_remove

v8:
Squash previous KFD commit into this commit to avoid compile break.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210520032057.497334-1-andrey.grodzovsky@amd.com
2021-05-19 23:48:50 -04:00