1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

47 commits

Author SHA1 Message Date
Jack Zhang
a89b5dae3e drm/amdgpu fix incorrect sysfs remove behavior for xgmi
Under xgmi setup,some sysfs fail to create for the second time of kmd
driver loading. It's due to sysfs nodes are not removed appropriately
in the last unlod time.

Changes of this patch:
1. remove sysfs for dev_attr_xgmi_error
2. remove sysfs_link adev->dev->kobj with target name.
   And it only needs to be removed once for a xgmi setup
3. remove sysfs_link hive->kobj with target name

In amdgpu_xgmi_remove_device:
1. amdgpu_xgmi_sysfs_rem_dev_info needs to be run per device
2. amdgpu_xgmi_sysfs_destroy needs to be run on the last node of
device.

v2: initialize array with memset

Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-21 12:48:42 -04:00
Colin Ian King
29c1ec244c drm/amdgpu: remove redundant assignment to variable ret
The variable ret is being initializeed with a value that is never read
and it is being updated later with a new value. The initialization
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-14 16:42:45 -04:00
Hawking Zhang
890900fe77 drm/amdgpu: use node_id and node_size to calcualte dram_base_address
physical_node_id * node_segment_size should be the
dram_base_address for current gpu node in xgmi config

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-08 14:32:10 -04:00
Jonathan Kim
dfe31f255f drm/amdgpu: sw pstate switch should only be for vega20
Driver steered p-state switching is designed for Vega20 only.
Also simplify early return for temporary disable due to SMU FW
bug.

Signed-off-by: Jonathan Kim <Jonathan.Kim@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-27 15:51:42 -04:00
Jonathan Kim
d84a430d9f drm/amdgpu: fix race between pstate and remote buffer map
Vega20 arbitrates pstate at hive level and not device level. Last peer to
remote buffer unmap could drop P-State while another process is still
remote buffer mapped.

With this fix, P-States still needs to be disabled for now as SMU bug
was discovered on synchronous P2P transfers.  This should be fixed in the
next FW update.

Signed-off-by: Jonathan Kim <Jonathan.Kim@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:46 -04:00
John Clements
66399248fe drm/amdgpu: added xgmi ras error reset sequence
added mechanism to clear xgmi ras status inbetween error queries

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:42 -04:00
Tao Zhou
204eaac625 drm/amdgpu: call ras_debugfs_create_all in debugfs_init
and remove each ras IP's own debugfs creation

this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-10 15:55:11 -04:00
Hawking Zhang
a61f41b177 drm/amdgpu: enable PCS error report on arcturus
add arcturus xgmi/wafl pcs err status group to support
PCS error detection and report on arcturus

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-06 14:31:43 -05:00
Hawking Zhang
18f36157f2 drm/amdgpu: add helper funcs to detect PCS error
Since from vega20, hardware supports run-time detect
and report XGMI/WAFL PCS ras error. Add helper functions
to walkthrough every type of ras error and report it if
any.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-06 14:31:28 -05:00
Hawking Zhang
938065d4cb drm/amdgpu: toggle DF-Cstate to protect DF reg access
driver needs to take DF out Cstate before any DF register
access. otherwise, the DF register may not be accessible.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:17:32 -05:00
Hawking Zhang
19744f5f2d drm/amdgpu: move get_xgmi_relative_phy_addr to amdgpu_xgmi.c
centralize all the xgmi related function to amdgpu_xgmi.c

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:17:32 -05:00
Hawking Zhang
0b9d37609a drm/amdgpu: move xgmi init/fini to xgmi_add/remove_device call (v2)
For sriov, psp ip block has to be initialized before
ih block for the dynamic register programming interface
that needed for vf ih ring buffer. On the other hand,
current psp ip block hw_init function will initialize
xgmi session which actaully depends on interrupt to
return session context. This results an empty xgmi ta
session id and later failures on all the xgmi ta cmd
invoked from vf. xgmi ta session initialization has to
be done after ih ip block hw_init call.

to unify xgmi session init/fini for both bare-metal
sriov virtualization use scenario, move xgmi ta init
to xgmi_add_device call, and accordingly terminate xgmi
ta session in xgmi_remove_device call.

The existing suspend/resume sequence will not be changed.

v2: squash in return fix from Nirmoy

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Frank Min <Frank.Min@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-06 15:04:36 -05:00
Joseph Greathouse
bdf84a80e0 drm/amdgpu: Create generic DF struct in adev
The only data fabric information the adev struct currently
contains is a function pointer table. In the near future,
we will be adding some cached DF information into adev. As
such, this patch creates a new amdgpu_df struct for adev.
Right now, it only containst the old function pointer table,
but new stuff will be added soon.

Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-14 10:18:41 -05:00
Evan Quan
9530273ec9 drm/amd/powerplay: cover the powerplay implementation details V3
This can save users much troubles. As they do not
actually need to care whether swSMU or traditional
powerplay routine should be used.

V2: apply the fixes to vi.c and cik.c also
V3: squash in oops fix

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-14 10:18:08 -05:00
Andrey Grodzovsky
f33a8770cd drm/amdgpu: Add task barrier to XGMI hive.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Le Ma <Le.Ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:13 -05:00
Jonathan Kim
cb5932f866 drm/amdgpu: fix vega20 pstate status change
vega20 only requires all devices be set to same pstate level for low
pstate and not high.

Signed-off-by: Jonathan Kim <Jonathan.Kim@amd.com>
Reviewed-by: Evan Quan <Evan.Quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-07 18:08:07 -05:00
Evan Quan
5c5b2ba006 drm/amdgpu: fix possible pstate switch race condition
Added lock protection so that the p-state switch will
be guarded to be sequential. Also update the hive
pstate only all device from the hive are in the same
state.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-06 16:27:48 -05:00
Evan Quan
3e454860f2 drm/amd/powerplay: support xgmi pstate setting on powerplay routine V2
Add xgmi pstate setting on powerplay routine.

V2: split the change of is_support_sw_smu_xgmi into a separate patch

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-06 16:27:46 -05:00
Tao Zhou
be5b39d87a drm/amdgpu: move xgmi ras fini to xgmi block
it's more suitable to put xgmi ras fini in xgmi block

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03 09:11:02 -05:00
Hawking Zhang
029fbd437e drm/amdgpu: initialize ras structures for xgmi block (v2)
init ras common interface and fs node for xgmi block

v2: remove unnecessary physical node number check before
invoking amdgpu_xgmi_ras_late_init

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 10:08:51 -05:00
Jonathan Kim
24f9aacfb0 drm/amdgpu: adding xgmi error monitoring
monitor xgmi errors via mc pie status through fica registers.

Signed-off-by: Jonathan Kim <Jonathan.Kim@amd.com>
Reviewed-by: Kent Russell <Kent.Russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-30 23:22:34 -05:00
Yong Zhao
54bd77f3d0 amd/powerplay: No SW XGMI dpm for Arcturus rev 2
xgmi dpm is handled by the SMU.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-18 14:18:06 -05:00
Le Ma
75b2fce2d8 drm/amdgpu: skip get/update xgmi topology info when no psp exists
We don't currently have psp support for arcturus so provide a alternative
mechanism in the meantime.

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-18 14:18:05 -05:00
Oak Zeng
2f2eab3acc drm/amdgpu: Hack xgmi topology info when there is no psp fw
This is only needed on emulation platform where psp fw might
not be available, to hack xgmi topology info such as hive id and
node id.

v2: Add offset to hacked hive/node id
v3: Don't use introduce new module parameter.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-18 14:18:04 -05:00
Tom St Denis
1c1e53f7f2 drm/amd/doc: Add XGMI sysfs documentation
Acked-by: Slava Abramov <slava.abramov@amd.com>
Signed-off-by: Tom St Denis <tom.stdenis@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:46:49 -05:00
shaoyunl
e008299ea9 drm/amdgpu: Update latest xgmi topology info after each device is enumulated
Adjust the sequence of set/get xgmi topology, so driver can have the latest
XGMI topology info for future usage

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:20:48 -05:00
shaoyunl
da361dd13f drm/amdgpu: Implement get num of hops between two xgmi device
KFD need to provide the info for upper level to determine the data path

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:20:48 -05:00
shaoyunl
93abb05fd5 drm/amdgpu: Set proper function to set xgmi pstate
Driver need to call SMU to set xgmi pstate

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-04-12 11:29:46 -05:00
shaoyunl
df399b0641 drm/amdgpu: XGMI pstate switch initial support
Driver vote low to high pstate switch whenever there is an outstanding
XGMI mapping request. Driver vote high to low pstate when all the
outstanding XGMI mapping is terminated.

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-27 22:40:43 -05:00
Christian König
86f7bae5cf drm/amdgpu: revert "XGMI pstate switch initial support"
This reverts commit 9b638f9751.

Adding this to the mapping is complete nonsense and the whole
implementation looks racy. This patch wasn't thoughtfully reviewed
and should be reverted for now.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-21 14:05:01 -05:00
shaoyunl
9b638f9751 drm/amdgpu: XGMI pstate switch initial support
Driver vote low to high pstate switch whenever there is an outstanding
XGMI mapping request. Driver vote high to low pstate when all the
outstanding XGMI mapping is terminated.

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19 15:36:48 -05:00
Andrey Grodzovsky
b1fa8c8955 drm/amdgpu: Add sysfs entries for xgmi hive v2.
For each device a file xgmi_device_id is created.
On the first device a subdirectory named xgmi_hive_info is created,
It contains  a file named hive_id and symlinks named node 1-4 linking
to each device in the hive.

v2: Return error codes instead of '-1' and few misspellings.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19 15:36:48 -05:00
Colin Ian King
c1219b941c drm/amd/amdgpu: fix spelling mistake "matech" -> "match"
There is a spelling mistake in a dev_err message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-02-05 21:15:32 -05:00
shaoyunl
47dd8048a1 drm/amdgpu: Show XGMI node and hive message per device only once
Reduce the repeated node and hive information during XGMI initialization

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-01-29 15:16:18 -05:00
Tom St Denis
22d6575b8d drm/amd/amdgpu: add missing mutex lock to amdgpu_get_xgmi_hive() (v3)
v2: Move locks around in other functions so that this
function can stand on its own.  Also only hold the hive
specific lock for add/remove device instead of the driver
global lock so you can't add/remove devices in parallel from
one hive.

v3: add reset_lock

Acked-by:  Shaoyun.liu < Shaoyun.liu@amd.com>
Signed-off-by: Tom St Denis <tom.stdenis@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-01-14 15:04:53 -05:00
shaoyunl
36ca09a02a drm/amdgpu: Add message print when unable to get valid hive
Add message print out and return -EINVAL when driver can not get valid hive
from hive  arrary on xgmi configuration

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-01-14 15:04:52 -05:00
Evan Quan
379c237e39 drm/amdgpu: correct the return value for error case
It should not return 0 for error case as '0' is actually
a special value for index.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-12-18 17:39:18 -05:00
Andrey Grodzovsky
5d66ef38bc drm/amdgpu: Update XGMI node print
amdgpu_xgmi_update_topology is called both on device registration
and reset. Fix misleading print since the device is added only once to
the hive on registration and not on reset.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-12-04 14:37:23 -05:00
Andrey Grodzovsky
a82400b57a drm/amdgpu: Handle xgmi device removal.
XGMI hive has some resources allocted on device init which
needs to be deallocated when the device is unregistered.

v2: Remove creation of dedicated wq for XGMI hive reset.
v3: Use the gmc.xgmi.supported flag

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-12-03 11:15:08 -05:00
Alex Deucher
47622ba033 drm/amdgpu: add a xgmi supported flag
Use this to track whether an asic supports xgmi rather than
checking the asic type everywhere.

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-12-03 11:14:39 -05:00
Andrey Grodzovsky
ed2bf5229c drm/amdgpu: Expose hive adev list and xgmi_mutex
It's needed for device reset of entire hive.

v3:
Add per hive lock to allow avoiding duplicate resets triggered by
multiple members  of same hive.
Expose amdgpu_hive_info instead of adding getter functions.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-28 15:55:36 -05:00
Andrey Grodzovsky
5183411b56 drm/amdgpu: Refactor amdgpu_xgmi_add_device
This is prep work for updating each PSP FW in hive after
GPU reset.
Split into build topology SW state and update each PSP FW in the hive.
Save topology and count of XGMI devices for reuse.

v2: Create seperate header for XGMI.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-28 15:55:35 -05:00
shaoyunl
a82c15668c drm/amdgpu: Each PSP need to get latest topology info on XGMI configuration
Driver need to call each psp instance to get topology info before set topology

Signed-off-by: shaoyunl <Shaoyun.Liu@amd.com>
reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-09 16:28:32 -05:00
Hawking Zhang
db00491293 drm/amdgpu: fix frame size of amdgpu_xgmi_add_devices excceed 1024 bytes
Instead of stack-allocated psp_xgmi_topology_info in function
amdgpu_xgmi_add_device, dynamically allocated this structure to
avoid the frame size of this function excceed 1024 bytes

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Xiaojie Yuan <xiaojie.yuan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-07 17:05:54 -05:00
Hawking Zhang
593caa07ad drm/amdgpu/psp: update topology info structures
topology info structure needs to match with the one defined
in xgmi ta

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-06 14:02:45 -05:00
Hawking Zhang
dd3c45d306 drm/amdgpu/psp: add get_node_id function
get_node_id function is used for driver to get node_id
for current device from xgmi ta

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-06 14:02:43 -05:00
Shaoyun Liu
fb30fc59a2 drm/amdgpu : Generate XGMI topology info from driver level
Driver will save an array of XGMI hive info, each hive will have a list of devices
that have the same hive ID.

Signed-off-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-09-10 22:47:52 -05:00