1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

1244 commits

Author SHA1 Message Date
Vignesh Chander
2efc30f016 drm/amdgpu: Fix hive reference count leak
both get_xgmi_hive and put_xgmi_hive can be skipped since the
reset domain is not necessary for VF

Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com>
Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13 14:32:57 -04:00
shaoyunl
46c676600c drm/amdgpu: Use per device reset_domain for XGMI on sriov configuration
For SRIOV configuration, host driver control the reset method(either FLR or
heavier chain reset). The host will notify the guest individually with FLR
message if individual GPU within the hive need to be reset. So for guest
side, no need to use hive->reset_domain to replace the original per
device reset_domain

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13 12:54:23 -04:00
YiPeng Chai
fac53471d0 drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled
V1:
  The psp_cmd_submit_buf function is called by psp_hw_fini to send
TA unload messages to psp to terminate ras, asd and tmr. But when
amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
	drm_dev_unplug
	......
	amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
		->.hw_fini->psp_hw_fini->...
		->psp_ta_unload->psp_cmd_submit_buf
	......
}
The program will return when calling drm_dev_enter in psp_cmd_submit_buf.

So the call to drm_dev_enter in psp_cmd_submit_buf should be
removed, so that the TA unload messages can be sent to the psp
when amdgpu is uninstalled.

V2:
1. Restore psp_cmd_submit_buf to its original code.
2. Move drm_dev_unplug call after amdgpu_driver_unload_kms in
   amdgpu_pci_remove.
3. Since amdgpu_device_fini_hw is called by amdgpu_driver_unload_kms,
   remove the unplug check to release device mmio resource in
   amdgpu_device_fini_hw before calling drm_dev_unplug.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-07 22:21:00 -04:00
Alex Sierra
ab23c5b9c7 drm/amdgpu: ensure no PCIe peer access for CPU XGMI iolinks
[Why] Devices with CPU XGMI iolink do not support PCIe peer access.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-29 17:59:30 -04:00
Chengming Gui
d3ef9d57f2 drm/amd/amdgpu: avoid soft reset check when gpu recovery disabled
Avoid soft reset, even ip hang check (ring/ib test) when gpu recovery
disabled.

v2: add missing "}"

Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-25 13:35:18 -04:00
shaoyunl
947f63f17e drm/amdgpu: Remove the additional kfd pre reset call for sriov
The additional call is caused by merge conflict

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-22 16:47:09 -04:00
YiPeng Chai
9dfa4860ef drm/amdgpu: fix hive reference leak when adding xgmi device
Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive
which will leak the hive reference.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-22 16:47:09 -04:00
André Almeida
0ad7347a64 drm/amd: Add detailed GFXOFF stats to debugfs
Add debugfs interface to log GFXOFF statistics:

- Read amdgpu_gfxoff_count to get the total GFXOFF entry count at the
  time of query since system power-up

- Write 1 to amdgpu_gfxoff_residency to start logging, and 0 to stop.
  Read it to get average GFXOFF residency % multiplied by 100
  during the last logging interval.

Both features are designed to be keep the values persistent between
suspends.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:17:31 -04:00
Victor Zhao
72fadb1367 drm/amdgpu: revert context to stop engine before mode2 reset
For some hang caused by slow tests, engine cannot be stopped which
may cause resume failure after reset. In this case, force halt
engine by reverting context addresses

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:14:31 -04:00
Victor Zhao
dac6b80818 drm/amdgpu: let mode2 reset fallback to default when failure
- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the asic specific part

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:14:31 -04:00
Lijo Lazar
0a83bb35d8 drm/amdgpu: Avoid another list of reset devices
A list of devices to be reset is already created in
amdgpu_device_gpu_recover function. Creating another list with the
same nodes is incorrect and not supported in list_head. Instead, pass
the device list as part of reset context.

Fixes: 9e08564727 (drm/amdgpu: Refactor mode2 reset logic for v13.0.2)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-10 15:07:14 -04:00
Jack Xiao
ed67f7292b drm/amdgpu: move mes self test after drm sched re-started
mes self test rely on vm mapping, move it after
drm sched re-started so that vm mapping can work
during gpu reset.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-and-tested-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-28 16:28:54 -04:00
Evan Quan
0da0def770 drm/amdgpu: drop non-necessary call trace dump
This extra call trace dump comes out in every gpu reset.
And it gives people a wrong impression that something
went wrong. Although actually there was not.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-28 16:28:54 -04:00
Andrey Grodzovsky
f6a3f66063 drm/amdgpu: Get rid of amdgpu_job->external_hw_fence
This is a follow-up cleanup to [1]. See bellow refcount balancing
for calling amdgpu_job_submit_direct after this cleanup as far
as I calculated.

amdgpu_fence_emit
	dma_fence_init 1
	dma_fence_get(fence) 2
	rcu_assign_pointer(*ptr, dma_fence_get(fence) 3

---> amdgpu_job_submit_direct completes before fence signaled
			amdgpu_sa_bo_free
				(*sa_bo)->fence = dma_fence_get(fence) 4

			amdgpu_job_free
				dma_fence_put 3

			amdgpu_vcn_enc_get_destroy_msg
				*fence = dma_fence_get(f) 4
				dma_fence_put(f); 3

			amdgpu_vcn_enc_ring_test_ib
				dma_fence_put(fence) 2

			amdgpu_fence_process
				dma_fence_put 1

			amdgpu_sa_bo_remove_locked
				dma_fence_put 0

---> amdgpu_job_submit_direct completes after fence signaled
			amdgpu_fence_process
				dma_fence_put 2

			amdgpu_job_free
				dma_fence_put 1

			amdgpu_vcn_enc_get_destroy_msg
				*fence = dma_fence_get(f) 2
				dma_fence_put(f); 1

			amdgpu_vcn_enc_ring_test_ib
				dma_fence_put(fence) 0

[1] - https://patchwork.kernel.org/project/dri-devel/cover/20220624180955.485440-1-andrey.grodzovsky@amd.com/

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-18 16:37:25 -04:00
Likun Gao
f1549c09c5 drm/amdgpu: support reset flag set for gpu reset
Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-13 11:25:17 -04:00
Alex Deucher
6e9c65f71e drm/amdgpu: fix documentation warning
Fixes this issue:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5094: warning: expecting prototype for amdgpu_device_gpu_recover_imp(). Prototype was for amdgpu_device_gpu_recover() instead

Fixes: cf72704414 ("drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover")
Reviewed-by: Kent Russell <kent.russell@amd.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-29 22:03:51 -04:00
Kent Russell
d193b12b2f drm/amdgpu: Fix typos in amdgpu_stop_pending_resets
Change amdggpu to amdgpu and pedning to pending

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-29 17:10:21 -04:00
Andrey Grodzovsky
9ae55f030d drm/amdgpu: Follow up change to previous drm scheduler change.
Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-28 11:24:41 -04:00
Andrey Grodzovsky
9e225fb9e6 drm/amdgpu: Prevent race between late signaled fences and GPU reset.
Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array from
amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
from a late EOP interrupt.

Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.

v2: Switch from irq_get/put to full enable/disable_irq for amdgpu

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-28 11:24:24 -04:00
Alex Deucher
163d4cd26a drm/amdgpu: fix adev variable used in amdgpu_device_gpu_recover()
Use the correct adev variable for the drm_fb_helper in
amdgpu_device_gpu_recover().  Noticed by inspection.

Fixes: 087451f372 ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-21 18:17:24 -04:00
Yifan Zhang
68ad7f90c7 drm/amdgpu: remove redundant enable_mes and enable_mes_kiq
enable_mes and enable_mes_kiq are set in both device init and
MES IP init. Leave the ones in MES IP init, since it is
a more accurate way to judge from GC IP version.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Jack Xiao <Jack.Xiao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-14 21:38:41 -04:00
Andrey Grodzovsky
247c7b0dac drm/amdgpu: Stop any pending reset if another in progress.
We skip rest requests if another one is already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:18 -04:00
Andrey Grodzovsky
cf72704414 drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:12 -04:00
Andrey Grodzovsky
b5fd0cf3ea drm/amdgpu: Add work_struct for GPU reset from kfd.
We need to have a work_struct to cancel this reset if another
already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:07 -04:00
Andrey Grodzovsky
ab9a0b1f36 drm/amdgpu: Cache result of last reset at reset domain level.
Will be read by executors of async reset like debugfs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:25:34 -04:00
Ramesh Errabolu
08a2fd23c6 drm/amdgpu: Add peer-to-peer support among PCIe connected AMD GPUs
Add support for peer-to-peer communication among AMD GPUs over PCIe
bus. Support REQUIRES enablement of config HSA_AMD_P2P.

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-08 11:40:12 -04:00
Somalapuram Amaranath
3d8785f6c0 drm/amdgpu: adding device coredump support
Added device coredump information:
- Kernel version
- Module
- Time
- VRAM status
- Guilty process name and PID
- GPU register dumps
v1 -> v2: Variable name change
v1 -> v2: NULL check
v1 -> v2: Code alignment
v1 -> v2: Adding dummy amdgpu_devcoredump_free
v1 -> v2: memset reset_task_info to zero
v2 -> v3: add CONFIG_DEV_COREDUMP for variables
v2 -> v3: remove NULL check on amdgpu_devcoredump_read

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Reviewed-by: Shashank Sharma <Shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-06 14:41:19 -04:00
Somalapuram Amaranath
651d7ee63f drm/amdgpu: save the reset dump register value for devcoredump
Allocate memory for register value and use the same values for devcoredump.
v1 -> v2: Change krealloc_array() to kmalloc_array()
v2 -> v3: Fix alignment

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Reviewed-by: Shashank Sharma <Shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-06 14:41:12 -04:00
Alex Deucher
b5a0168e14 drm/amdgpu: fix up comment in amdgpu_device_asic_has_dc_support()
LVDS support was implemented in DC a while ago.  Just DAC
support is left to do.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03 16:45:01 -04:00
Alex Deucher
1d6c363330 drm/amdgpu: simplify the logic in amdgpu_device_parse_gpu_info_fw()
Drop all of the extra cases in the default case.

Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03 16:45:00 -04:00
Mitchell Augustin
f74e78ca90 amdgpu: amdgpu_device.c: Removed trailing whitespace
Removed trailing whitespace from end of line in amdgpu_device.c

Signed-off-by: Mitchell Augustin <kernel@mitchellaugustin.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03 16:43:36 -04:00
Alex Deucher
b8b64595d6 drm/amdgpu: simplify amdgpu_device_asic_has_dc_support()
Drop extra cases in the default case.

Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03 16:43:36 -04:00
Sunil Khatri
4d33e7040d drm/amdgpu: move amdgpu_gmc_tmz_set after ip_version populated
To enable TMZ feature based on IP version needs adev->ip_version
populated but its empty. Move amdgpu_gmc_tmz_set to a place where
ip_version is populated.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-26 14:56:32 -04:00
Stanley.Yang
950d64250f drm/amdgpu: support ras on SRIOV
support umc/gfx/sdma ras on guest side

Changed from V1:
    move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-26 14:56:32 -04:00
Likun Gao
8424f2ccb3 drm/amdgpu/psp: Add vbflash sysfs interface support
Add sysfs interface to copy VBIOS.

v2: squash in fix for proper vmalloc API (Alex)

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-10 17:53:10 -04:00
Yiqing Yao
98f5618846 drm/amdgpu: flush delete wq after wait fence
[why]
lru_list not empty warning in sw fini during repeated device bind unbind.
There should be a amdgpu_fence_wait_empty() before the flush_delayed_work()
call as Christian suggested.

[how]
Move to do flush_delayed_work for ttm bo delayed delete wq after fence_driver_hw_fini.

Tested by: Yiqing Yao <yiqing.yao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Yiqing Yao <yiqing.yao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-06 16:56:41 -04:00
Mukul Joshi
c004d44e10 drm/amdgpu: Enable KFD with MES enabled
Enable KFD initialization with MES enabled.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:52 -04:00
Jack Xiao
9c12f5cd06 drm/amdgpu: skip kfd routines when mes enabled
For kfd hasn't supported mes, skip kfd routines.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:52 -04:00
Jack Xiao
928fe236c0 drm/amdgpu: add mes_kiq module parameter v2
mes_kiq parameter is used to enable mes kiq pipe.
This module parameter is unneccessary or enabled by default
in final version.

v2: reword commit message.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:49 -04:00
Jack Xiao
de33a32968 drm/amdgpu: use the whole doorbell space for mes
Use the whole doorbell space for mes. Each queue in one process occupies
one doorbell slot to ring the queue submitting.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:04:01 -04:00
Hawking Zhang
85d1bcc6e0 drm/amdgpu: switch to atomfirmware_asic_init
Some initial settings now are not available from
the atom data table. The assumption that !ps[0]
|| !ps[1] in amdgpu_atom_asic_init is not valid.
In addition, driver needs to strictly follow
atomfirmware structure (asic_init_parameters) to
initialize parameters used to execute asic_init
function, otherwise, the execution of asic_init
would fail.

This shall be applicable to all soc15 adapters,but
let make the transition on soc21 first.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28 17:48:00 -04:00
Alex Deucher
e24d0e91b3 drm/amdgpu/discovery: move all table parsing into amdgpu_discovery.c
This data has no dependencies, so encapsulate it all within
amdgpu_discovery.c.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28 17:47:43 -04:00
Bokun Zhang
e15c9d06e9 drm/amd/amdgpu: Update PF2VF header
- In the latest version of the header, there is a variable name change.
  This should not cause any backward compatibility since the variable is
  at the same offset in the struct.

Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-21 16:00:14 -04:00
Evan Quan
25faeddcf3 drm/amdgpu: expand cg_flags from u32 to u64
With this, we can support more CG flags.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-08 17:24:24 -04:00
Alex Deucher
b818a5d374 drm/amdgpu/gmc: use PCI BARs for APUs in passthrough
If the GPU is passed through to a guest VM, use the PCI
BAR for CPU FB access rather than the physical address of
carve out.  The physical address is not valid in a guest.

v2: Fix HDP handing as suggested by Michel

Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:24 -04:00
Philip Yang
436afdfa35 drm/amdgpu: Move reset domain init before calling RREG32
amdgpu_detect_virtualization reads register, amdgpu_device_rreg access
adev->reset_domain->sem if kernel defined CONFIG_LOCKDEP, below is the
random boot hang backtrace on Vega10. It may get random NULL pointer
access backtrace if amdgpu_sriov_runtime is true too.

Move amdgpu_reset_create_reset_domain before calling to RREG32.

 BUG: kernel NULL pointer dereference, address:
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 Workqueue: events work_for_cpu_fn
 RIP: 0010:down_read_trylock+0x13/0xf0
 Call Trace:
  <TASK>
  amdgpu_device_skip_hw_access+0x38/0x80 [amdgpu]
  amdgpu_device_rreg+0x1b/0x170 [amdgpu]
  amdgpu_detect_virtualization+0x73/0x100 [amdgpu]
  amdgpu_device_init.cold.60+0xbe/0x16b1 [amdgpu]
  ? pci_bus_read_config_word+0x43/0x70
  amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
  amdgpu_pci_probe+0x1a1/0x3a0 [amdgpu]

Fixes: d0fb18b535 ("drm/amdgpu: Move reset sem into reset_domain")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15 14:40:49 -04:00
Alex Deucher
85ac2021fe drm/amdgpu: only check for _PR3 on dGPUs
We don't support runtime pm on APUs.  They support more
dynamic power savings using clock and powergating.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15 14:34:41 -04:00
Alex Deucher
1b537e6410 drm/amdgpu: remove unused gpu_info firmwares
These were leftover from bring up and are no longer
necessary.  The information is available via
the IP discovery table.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:07 -05:00
Yifan Zhang
957b0787ee drm/amdgpu: move amdgpu_gmc_noretry_set after ip_versions populated
otherwise adev->ip_versions is still empty when amdgpu_gmc_noretry_set
is called.

Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:06 -05:00
Dave Airlie
38a15ad948 Merge tag 'amd-drm-next-5.18-2022-02-25' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.18-2022-02-25:

amdgpu:
- Raven2 suspend/resume fix
- SDMA 5.2.6 updates
- VCN 3.1.2 updates
- SMU 13.0.5 updates
- DCN 3.1.5 updates
- Virtual display fixes
- SMU code cleanup
- Harvest fixes
- Expose benchmark tests via debugfs
- Drop no longer relevant gart aperture tests
- More RAS restructuring
- W=1 fixes
- PSR rework
- DP/VGA adapter fixes
- DP MST fixes
- GPUVM eviction fix
- GPU reset debugfs register dumping support
- Misc display fixes
- SR-IOV fix
- Aldebaran mGPU fix
- Add module parameter to disable XGMI for testing

amdkfd:
- IH ring overflow logging fixes
- CRIU fixes
- Misc fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220225183535.5907-1-alexander.deucher@amd.com
2022-03-01 16:19:02 +10:00