1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

13998 commits

Author SHA1 Message Date
Stanley.Yang
939c475181 drm/amdgpu: Support setting reset_method at runtime
In order to support more test cases, support user
change reset_method at runtime.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-23 12:08:30 -04:00
Maxime Ripard
c058e7a8f8
Merge drm/drm-next into drm-misc-next
Maíra needs a backmerge to apply v3d patches, and Danilo for some
nouveau patches.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2024-04-23 08:48:56 +02:00
Arunpravin Paneer Selvam
a68c7eaa7a drm/amdgpu: Enable clear page functionality
Add clear page support in vram memory region.

v1(Christian):
  - Dont handle clear page as TTM flag since when moving the BO back
    in from GTT again we don't need that.
  - Make a specialized version of amdgpu_fill_buffer() which only
    clears the VRAM areas which are not already cleared
  - Drop the TTM_PL_FLAG_WIPE_ON_RELEASE check in
    amdgpu_object.c

v2:
  - Modify the function name amdgpu_ttm_* (Alex)
  - Drop the delayed parameter (Christian)
  - handle amdgpu_res_cleared(&cursor) just above the size
    calculation (Christian)
  - Use AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE for clearing the buffers
    in the free path to properly wait for fences etc.. (Christian)

v3(Christian):
  - Remove buffer clear code in VRAM manager instead change the
    AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE handling to set
    the DRM_BUDDY_CLEARED flag.
  - Remove ! from amdgpu_res_cleared(&cursor) check.

v4(Christian):
  - vres flag setting move to vram manager file
  - use dma_fence_get_stub in amdgpu_ttm_clear_buffer function
  - make fence a mandatory parameter and drop the if and the get/put dance

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419063538.11957-2-Arunpravin.PaneerSelvam@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2024-04-22 19:44:16 +02:00
Arunpravin Paneer Selvam
96950929eb drm/buddy: Implement tracking clear page feature
- Add tracking clear page feature.

- Driver should enable the DRM_BUDDY_CLEARED flag if it
  successfully clears the blocks in the free path. On the otherhand,
  DRM buddy marks each block as cleared.

- Track the available cleared pages size

- If driver requests cleared memory we prefer cleared memory
  but fallback to uncleared if we can't find the cleared blocks.
  when driver requests uncleared memory we try to use uncleared but
  fallback to cleared memory if necessary.

- When a block gets freed we clear it and mark the freed block as cleared,
  when there are buddies which are cleared as well we can merge them.
  Otherwise, we prefer to keep the blocks as separated.

- Add a function to support defragmentation.

v1:
  - Depends on the flag check DRM_BUDDY_CLEARED, enable the block as
    cleared. Else, reset the clear flag for each block in the list(Christian)
  - For merging the 2 cleared blocks compare as below,
    drm_buddy_is_clear(block) != drm_buddy_is_clear(buddy)(Christian)
  - Defragment the memory beginning from min_order
    till the required memory space is available.

v2: (Matthew)
  - Add a wrapper drm_buddy_free_list_internal for the freeing of blocks
    operation within drm buddy.
  - Write a macro block_incompatible() to allocate the required blocks.
  - Update the xe driver for the drm_buddy_free_list change in arguments.
  - add a warning if the two blocks are incompatible on
    defragmentation
  - call full defragmentation in the fini() function
  - place a condition to test if min_order is equal to 0
  - replace the list with safe_reverse() variant as we might
    remove the block from the list.

v3:
  - fix Gitlab user reported lockup issue.
  - Keep DRM_BUDDY_HEADER_CLEAR define sorted(Matthew)
  - modify to pass the root order instead max_order in fini()
    function(Matthew)
  - change bool 1 to true(Matthew)
  - add check if min_block_size is power of 2(Matthew)
  - modify the min_block_size datatype to u64(Matthew)

v4:
  - rename the function drm_buddy_defrag with __force_merge.
  - Include __force_merge directly in drm buddy file and remove
    the defrag use in amdgpu driver.
  - Remove list_empty() check(Matthew)
  - Remove unnecessary space, headers and placement of new variables(Matthew)
  - Add a unit test case(Matthew)

v5:
  - remove force merge support to actual range allocation and not to bail
    out when contains && split(Matthew)
  - add range support to force merge function.

v6:
  - modify the alloc_range() function clear page non merged blocks
    allocation(Matthew)
  - correct the list_insert function name(Matthew).

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Suggested-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419063538.11957-1-Arunpravin.PaneerSelvam@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2024-04-22 19:44:16 +02:00
Dave Airlie
377b5b397d amd-drm-next-6.10-2024-04-19:
amdgpu:
 - DC resource allocation logic updates
 - DC IPS fixes
 - DC YUV fixes
 - DMCUB fixes
 - DML2 fixes
 - Devcoredump updates
 - USB-C DSC fix
 - Misc display code cleanups
 - PSR fixes
 - MES timeout fix
 - RAS updates
 - UAF fix in VA IOCTL
 - Fix visible VRAM handling during faults
 - Fix IP discovery handling during PCI rescans
 - Misc code cleanups
 - PSP 14 updates
 - More runtime PM code rework
 - SMU 14.0.2 support
 - GPUVM page fault redirection to secondary IH rings for IH 6.x
 - Suspend/resume fixes
 - SR-IOV fixes
 
 amdkfd:
 - Fix eviction fence handling
 - Fix leak in GPU memory allocation failure case
 - DMABuf import handling fix
 
 radeon:
 - Silence UBSAN warnings related to flexible arrays
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZiLyawAKCRC93/aFa7yZ
 2BzfAPoDVZjTunizh6SyCFmQamR3eelnxWeY1xaVzmKBHqLCOAEAo2EyThRGyPCH
 SjD+f+ZlflaXQZtZpiQrOr0rkLvh5Q4=
 =96hT
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.10-2024-04-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.10-2024-04-19:

amdgpu:
- DC resource allocation logic updates
- DC IPS fixes
- DC YUV fixes
- DMCUB fixes
- DML2 fixes
- Devcoredump updates
- USB-C DSC fix
- Misc display code cleanups
- PSR fixes
- MES timeout fix
- RAS updates
- UAF fix in VA IOCTL
- Fix visible VRAM handling during faults
- Fix IP discovery handling during PCI rescans
- Misc code cleanups
- PSP 14 updates
- More runtime PM code rework
- SMU 14.0.2 support
- GPUVM page fault redirection to secondary IH rings for IH 6.x
- Suspend/resume fixes
- SR-IOV fixes

amdkfd:
- Fix eviction fence handling
- Fix leak in GPU memory allocation failure case
- DMABuf import handling fix

radeon:
- Silence UBSAN warnings related to flexible arrays

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240419224332.2938259-1-alexander.deucher@amd.com
2024-04-22 12:28:49 +10:00
Lang Yu
81bf14519a drm/amdkfd: make sure VM is ready for updating operations
When page table BOs were evicted but not validated before
updating page tables, VM is still in evicting state,
amdgpu_vm_update_range returns -EBUSY and
restore_process_worker runs into a dead loop.

v2: Split the BO validation and page table update into two
separate loops in amdgpu_amdkfd_restore_process_bos. (Felix)
  1.Validate BOs
  2.Validate VM (and DMABuf attachments)
  3.Update page tables for the BOs validated above

Fixes: 50661eb1a2 ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs")
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:54:49 -04:00
Mukul Joshi
e53a1713de drm/amdgpu: Fix leak when GPU memory allocation fails
Free the sync object if the memory allocation fails for any
reason.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:54:49 -04:00
Zhigang Luo
6a009ca1bf drm/amdgpu: remove virt_init_data_exchange from poison consumption handler
Host will initiate an FLR for all poison consumption.
Guest should wait for FLR message to re-init data exchange.

Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:47:26 -04:00
Lijo Lazar
8954c3fbe7 drm/amdgpu: Change AID detection logic
On GFX 9.4.3 SOCs, only 2 SDMA instances need to be available to be
considered as a valid AID.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:47:19 -04:00
Sunil Khatri
93522c1948 drm/amdgpu: enable redirection of irq's for IH V6.1
Enable redirection of irq for pagefaults for specific
clients to avoid overflow without dropping interrupts.

So here we redirect the interrupts to another IH ring
i.e ring1 where only these interrupts are processed.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:46:56 -04:00
Ahmad Rehman
ea137071ad drm/amdgpu: Skip the coredump collection on reset during driver reload
In passthrough environment, the driver triggers the mode-1 reset on
reload. The reset causes the core dump collection which is delayed task
and prevents driver from unloading until it is completed. Since we do
not need to collect data on "reset on reload" case, we can skip core
dump collection.

v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.

Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:46:45 -04:00
Sunil Khatri
ca0afa2f41 drm/amdgpu: enable redirection of irq's for IH V6.0
Enable redirection of irq for pagefaults for specific
clients to avoid overflow without dropping interrupts.

So here we redirect the interrupts to another IH ring
i.e ring1 where only these interrupts are processed.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:46:37 -04:00
Sunil Khatri
5adcd78fa2 drm:amdgpu: enable IH ring1 for IH v6.1
We need IH ring1 for handling the pagefault
interrupts which over flow in default
ring for specific usecases.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:45:37 -04:00
Sunil Khatri
eefc85a277 drm:amdgpu: enable IH RB ring1 for IH v6.0
We need IH ring1 for handling the pagefault
interrupts which are overflowing the default
ring for specific usecases.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-18 23:45:22 -04:00
Christian König
a6ff969fe9 drm/amdgpu: fix visible VRAM handling during faults
When we removed the hacky start code check we actually didn't took into
account that *all* VRAM pages needs to be CPU accessible.

Clean up the code and unify the handling into a single helper which
checks if the whole resource is CPU accessible.

The only place where a partial check would make sense is during
eviction, but that is neglitible.

Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: aed01a6804 ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
CC: stable@vger.kernel.org
2024-04-17 11:50:43 -04:00
xinhui pan
6fef2d4c00 drm/amdgpu: validate the parameters of bo mapping operations more clearly
Verify the parameters of
amdgpu_vm_bo_(map/replace_map/clearing_mappings) in one common place.

Fixes: dc54d3d174 ("drm/amdgpu: implement AMDGPU_VA_OP_CLEAR v2")
Cc: stable@vger.kernel.org
Reported-by: Vlad Stolyarov <hexed@google.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-17 11:50:43 -04:00
Christian König
ca7c4507ba drm/amdgpu: remove invalid resource->start check v2
The majority of those where removed in the commit aed01a6804
("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")

But this one was missed because it's working on the resource and not the
BO. Since we also no longer use a fake start address for visible BOs
this will now trigger invalid mapping errors.

v2: also remove the unused variable

Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: aed01a6804 ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")
CC: stable@vger.kernel.org
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-17 11:02:57 -04:00
Dave Airlie
34633158b8 amd-drm-next-6.10-2024-04-13:
amdgpu:
 - HDCP fixes
 - ODM fixes
 - RAS fixes
 - Devcoredump improvements
 - Misc code cleanups
 - Expose VCN activity via sysfs
 - SMY 13.0.x updates
 - Enable fast updates on DCN 3.1.4
 - Add dclk and vclk reporting on additional devices
 - Add ACA RAS infrastructure
 - Implement TLB flush fence
 - EEPROM handling fixes
 - SMUIO 14.0.2 support
 - SMU 14.0.1 Updates
 - Sync page table freeing with TLB flushes
 - DML2 refactor
 - DC debug improvements
 - SR-IOV fixes
 - Suspend and Resume fixes
 - DCN 3.5.x Updates
 - Z8 fixes
 - UMSCH fixes
 - GPU reset fixes
 - HDP fix for second GFX pipe on GC 10.x
 - Enable secondary GFX pipe on GC 10.3
 - Refactor and clean up BACO/BOCO/BAMACO handling
 - VCN partitioning fix
 - DC DWB fixes
 - VSC SDP fixes
 - DCN 3.1.6 fix
 - GC 11.5 fixes
 - Remove invalid TTM resource start check
 - DCN 1.0 fixes
 
 amdkfd:
 - MQD handling cleanup
 - Preemption handling fixes for XCDs
 - TLB flush fix for GC 9.4.2
 - Properly clean up workqueue during module unload
 - Fix memory leak process create failure
 - Range check CP bad op exception targets to avoid reporting invalid exceptions to userspace
 
 radeon:
 - Misc code cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCZhr4EAAKCRC93/aFa7yZ
 2B8jAP9z1JpOnjSQvc2mhHAooXRYO4Mj5HCQ25ZE8N4c8ZZhjAEAqefmEx5/UyLh
 lv2pWILL4o597qhq9nA7hJ6tTICLPAU=
 =HUwY
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.10-2024-04-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.10-2024-04-13:

amdgpu:
- HDCP fixes
- ODM fixes
- RAS fixes
- Devcoredump improvements
- Misc code cleanups
- Expose VCN activity via sysfs
- SMY 13.0.x updates
- Enable fast updates on DCN 3.1.4
- Add dclk and vclk reporting on additional devices
- Add ACA RAS infrastructure
- Implement TLB flush fence
- EEPROM handling fixes
- SMUIO 14.0.2 support
- SMU 14.0.1 Updates
- Sync page table freeing with TLB flushes
- DML2 refactor
- DC debug improvements
- SR-IOV fixes
- Suspend and Resume fixes
- DCN 3.5.x Updates
- Z8 fixes
- UMSCH fixes
- GPU reset fixes
- HDP fix for second GFX pipe on GC 10.x
- Enable secondary GFX pipe on GC 10.3
- Refactor and clean up BACO/BOCO/BAMACO handling
- VCN partitioning fix
- DC DWB fixes
- VSC SDP fixes
- DCN 3.1.6 fix
- GC 11.5 fixes
- Remove invalid TTM resource start check
- DCN 1.0 fixes

amdkfd:
- MQD handling cleanup
- Preemption handling fixes for XCDs
- TLB flush fix for GC 9.4.2
- Properly clean up workqueue during module unload
- Fix memory leak process create failure
- Range check CP bad op exception targets to avoid reporting invalid exceptions to userspace

radeon:
- Misc code cleanups

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240413213708.3427038-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-04-17 15:48:59 +10:00
Kenneth Feng
0c1195ca0d drm/amd/swsmu: support smu block discovery for smu v14
Support for smu ip block add for SMU v14.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:16 -04:00
Hawking Zhang
577cbed318 drm/amdgpu: rename DBG_DRV to HAD_DRV for psp v14
Add a psp bl command enum for HAD_DRV.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:16 -04:00
Ma Jun
1347853271 drm/amdgpu: refactoring the runtime pm mode detection code
refactor the code of runtime pm mode detection to support
amdgpu_runtime_pm =2 and 1 two cases

Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:16 -04:00
Hawking Zhang
6c6acc5f33 drm/amdgpu: Load ipkeymgr drv for psp v14
while DBG_DRV is renamed to HAD_DRV for psp v14,
part of its APIs/functionality is moved to a new
component named Ipkeymgr_Drv.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:16 -04:00
Thorsten Blum
12b8b4e685 drm/amdgpu: Add missing space to DRM_WARN() message
s/,please/, please/

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:15 -04:00
Ma Jun
959056982a drm/amdgpu: Fix discovery initialization failure during pci rescan
Waiting for system ready to fix the discovery initialization
failure issue. This failure usually occurs when dGPU is removed
and then rescanned via command line.
It's caused by following two errors:
[1] vram size is 0
[2] wrong binary signature

Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:15 -04:00
Christian König
394ae0603a drm/amdgpu: fix visible VRAM handling during faults
When we removed the hacky start code check we actually didn't took into
account that *all* VRAM pages needs to be CPU accessible.

Clean up the code and unify the handling into a single helper which
checks if the whole resource is CPU accessible.

The only place where a partial check would make sense is during
eviction, but that is neglitible.

Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: aed01a6804 ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
CC: stable@vger.kernel.org
2024-04-16 22:39:15 -04:00
xinhui pan
98856136c4 drm/amdgpu: validate the parameters of bo mapping operations more clearly
Verify the parameters of
amdgpu_vm_bo_(map/replace_map/clearing_mappings) in one common place.

Fixes: dc54d3d174 ("drm/amdgpu: implement AMDGPU_VA_OP_CLEAR v2")
Cc: stable@vger.kernel.org
Reported-by: Vlad Stolyarov <hexed@google.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:15 -04:00
Yang Wang
f23558627f drm/amdgpu: add new aca smu callback func parse_error_code()
add new aca smu callback parse_error_code{} to avoid specific asic check
in amdgpu_aca.c file

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:39:15 -04:00
Jonathan Kim
f7c161a4c2 drm/amdgpu: increase mes submission timeout
MES internally has a timeout allowance of 2 seconds.
Increase driver timeout to 3 seconds to be safe.

Signed-off-by: Jonathan Kim <Jonathan.Kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 22:38:59 -04:00
Sunil Khatri
3c858cf65e drm/amdgpu: add missing vbios version from devcoredump
Add vbios version in the devcoredump along with formatting
the information with proper alignment.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 21:25:28 -04:00
Alex Deucher
8b9130bae0 drm/amdgpu/gfx11: properly handle regGRBM_GFX_CNTL in soft reset
Need to take the srbm_mutex and while we are here, use the
helper function soc21_grbm_select();

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-16 21:25:23 -04:00
Christian König
c8962679af drm/amdgpu: remove invalid resource->start check v2
The majority of those where removed in the commit aed01a6804
("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")

But this one was missed because it's working on the resource and not the
BO. Since we also no longer use a fake start address for visible BOs
this will now trigger invalid mapping errors.

v2: also remove the unused variable

Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: aed01a6804 ("drm/amdgpu: Remove TTM resource->start visible VRAM condition v2")
CC: stable@vger.kernel.org
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-12 00:33:51 -04:00
Jack Xiao
a0e002cdac drm/amdgpu/sdma6: set sdma hang watchdog
Set SDMAx_WATCHDOG_CNTL.QUEUE_HANG_COUNT registers
to improve SDMA reliability.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-12 00:33:43 -04:00
Luqmaan Irshad
6b0d78032f drm/amd/amdgpu: Update PF2VF Header
Adding a new field for GPU Capacity to align the header with the host.

Signed-off-by: Luqmaan Irshad <Luqmaan.Irshad@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-12 00:33:11 -04:00
Yifan Zhang
6dba20d23e drm/amdgpu: differentiate external rev id for gfx 11.5.0
This patch to differentiate external rev id for gfx 11.5.0.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-10 00:00:32 -04:00
Tim Huang
bbca7f414a drm/amdgpu: fix incorrect number of active RBs for gfx11
The RB bitmap should be global active RB bitmap &
active RB bitmap based on active SA.

Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:30:30 -04:00
ZhenGuo Yin
e33997e18d drm/amdgpu: clear set_q_mode_offs when VM changed
[Why]
set_q_mode_offs don't get cleared after GPU reset, nexting SET_Q_MODE
packet to init shadow memory will be skiped, hence there has a page fault.

[How]
VM flush is needed after GPU reset, clear set_q_mode_offs when
emitting VM flush.

Fixes: 8bc75586ea ("drm/amdgpu: workaround to avoid SET_Q_MODE packets v2")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:27:58 -04:00
Lijo Lazar
f7e232de51 drm/amdgpu: Fix VCN allocation in CPX partition
VCN need not be shared in CPX mode always for all GFX 9.4.3 SOC SKUs. In
certain configs, VCN instance can be exclusively allocated to a
partition even under CPX mode.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:27:23 -04:00
Kenneth Feng
3818708e9c drm/amd/pm: fix the high voltage issue after unload
fix the high voltage issue after unload on smu 13.0.10

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:26:32 -04:00
Tao Zhou
f886b49fea drm/amdgpu: implement IRQ_STATE_ENABLE for SDMA v4.4.2
SDMA_CNTL is not set in some cases, driver configures it by itself.

v2: simplify code

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:16:51 -04:00
Yifan Zhang
533eefb9be drm/amdgpu: add smu 14.0.1 discovery support
This patch to add smu 14.0.1 support

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:15:05 -04:00
shaoyunl
5b0cd091d9 drm/amdgpu : Increase the mes log buffer size as per new MES FW version
From MES version 0x54, the log entry increased and require the log buffer
size to be increased. The 16k is maximum size agreed

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:14:40 -04:00
shaoyunl
a3a4c0b123 drm/amdgpu : Add mes_log_enable to control mes log feature
The MES log might slow down the performance for extra step of log the data,
disable it by default and introduce a parameter can enable it when necessary

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:14:18 -04:00
Lijo Lazar
8b2be55f4d drm/amdgpu: Reset dGPU if suspend got aborted
For SOC21 ASICs, there is an issue in re-enabling PM features if a
suspend got aborted. In such cases, reset the device during resume
phase. This is a workaround till a proper solution is finalized.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:12:44 -04:00
Lang Yu
0f1bbcc2ba drm/amdgpu/umsch: reinitialize write pointer in hw init
Otherwise the old one will be used during GPU reset.
That's not expected.

Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:12:20 -04:00
Lijo Lazar
4b18a91faf drm/amdgpu: Refine IB schedule error logging
Downgrade to debug information when IBs are skipped. Also, use dev_* to
identify the device.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 23:11:59 -04:00
Alex Deucher
65ff8092e4 drm/amdgpu: always force full reset for SOC21
There are cases where soft reset seems to succeed, but
does not, so always use mode1/2 for now.

Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2024-04-09 23:10:06 -04:00
Yifan Zhang
526b184e88 drm/amdgpu: differentiate external rev id for gfx 11.5.0
This patch to differentiate external rev id for gfx 11.5.0.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:20:01 -04:00
Tim Huang
d6d6561f93 drm/amdgpu: fix incorrect number of active RBs for gfx11
The RB bitmap should be global active RB bitmap &
active RB bitmap based on active SA.

Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:14:55 -04:00
Zhigang Luo
d1999b4017 amd/amdgpu: improve VF recover time
1. change AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT from 30 to 5.
2. set fatel error detected flag.

Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:14:30 -04:00
Lijo Lazar
b41f742d6f drm/amdgpu: Set fatal errror detected flag earlier
In case of fatal errors, set FED status when interrupt is received. Set
the flag on other devices in the hive before RAS recovery work.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09 22:13:36 -04:00