Add missing parameters to fix a kerneldoc warning
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add IP discovery data in sysfs. The format is:
/sys/class/drm/cardX/device/ip_discovery/die/D/B/I/<attrs>
where,
X is the card ID, an integer,
D is the die ID, an integer,
B is the IP HW ID, an integer, aka block type,
I is the IP HW ID instance, an integer.
<attrs> are the attributes of the block instance. At the moment these
include HW ID, instance number, major, minor, revision, number of base
addresses, and the base addresses themselves.
A symbolic link of the acronym HW ID is also created, under D/, if you
prefer to browse by something humanly accessible.
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Tom StDenis <tom.stdenis@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix few kerneldoc warnings and one typo.
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
SDMA ring buffer test failed if suspend is aborted during
S0i3 resume.
[How]
If suspend is aborted for some reason during S0i3 resume
cycle, it follows SDMA ring test failing and errors in amdgpu
resume. For RN/CZN/Picasso, SMU saves and restores SDMA
registers during S0ix cycle. So, skipping SDMA suspend and
resume from driver solves the issue. This time, the system
is able to resume gracefully even the suspend is aborted.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rajib Mahapatra <rajib.mahapatra@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
This is provided by TTM now.
Also switch man->size to bytes instead of pages and fix the double
printing of size and usage in debugfs.
v2: fix size checking as well
Signed-off-by: Christian König <christian.koenig@amd.com>
Tested-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220214093439.2989-8-christian.koenig@amd.com
This is provided by TTM now.
Also switch man->size to bytes instead of pages and fix the double
printing of size and usage in debugfs.
v2: fix size checking as well
Signed-off-by: Christian König <christian.koenig@amd.com>
Tested-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220214093439.2989-6-christian.koenig@amd.com
amd-drm-next-5.18-2022-02-11-1:
amdgpu:
- Clean up of power management code
- Enable freesync video mode by default
- Clean up of RAS code
- Improve VRAM access for debug using SDMA
- Coding style cleanups
- SR-IOV fixes
- More display FP reorg
- TLB flush fixes for Arcuturus, Vega20
- Misc display fixes
- Rework special register access methods for SR-IOV
- DP2 fixes
- DP tunneling fixes
- DSC fixes
- More IP discovery cleanups
- Misc RAS fixes
- Enable both SMU i2c buses where applicable
- s2idle improvements
- DPCS header cleanup
- Add new CAP firmware support for SR-IOV
amdkfd:
- Misc cleanups
- SVM fixes
- CRIU support
- Clean up MQD manager
UAPI:
- Add interface to amdgpu CTX ioctl to request a stable power state for profiling
https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207
- Add amdkfd support for CRIU
https://github.com/checkpoint-restore/criu/pull/1709
- Remove old unused amdkfd debugger interface
Was only implemented for Kaveri and was only ever used by an old HSA tool that was never open sourced
radeon:
- Fix error handling in radeon_driver_open_kms
- UVD suspend fix
- Misc fixes
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220211220706.5803-1-alexander.deucher@amd.com
the UMC_STATUS register is not linear, adjust offset
calculation formula to get correct address
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
SDMA ring buffer test failed if suspend is aborted during
S0i3 resume.
[How]
If suspend is aborted for some reason during S0i3 resume
cycle, it follows SDMA ring test failing and errors in amdgpu
resume. For RN/CZN/Picasso, SMU saves and restores SDMA
registers during S0ix cycle. So, skipping SDMA suspend and
resume from driver solves the issue. This time, the system
is able to resume gracefully even the suspend is aborted.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rajib Mahapatra <rajib.mahapatra@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
KMD reports a warning on holding a lock from drm_syncobj_find_fence,
when running amdgpu_test case “syncobj timeline test”.
ctx->lock was designed to prevent concurrent "amdgpu_ctx_wait_prev_fence"
calls and avoid dead reservation lock from GPU reset. since no reservation
lock is held in latest GPU reset any more, ctx->lock can be simply removed
and concurrent "amdgpu_ctx_wait_prev_fence" call also can be prevented by
PD root bo reservation lock.
call stacks:
=================
//hold lock
amdgpu_cs_ioctl->amdgpu_cs_parser_init->mutex_lock(&parser->ctx->lock);
…
//report warning
amdgpu_cs_dependencies->amdgpu_cs_process_syncobj_timeline_in_dep \
->amdgpu_syncobj_lookup_and_add_to_sync -> drm_syncobj_find_fence \
-> lockdep_assert_none_held_once
…
amdgpu_cs_ioctl->amdgpu_cs_parser_fini->mutex_unlock(&parser->ctx->lock);
Signed-off-by: Ken Xue <Ken.Xue@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The OOB table error count info should be reset after reset
eeprom table
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No need to check poison setting for each channel, check for umc0
channel0 is enough.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add basic support for GC 10.1.4,
it uses same IP blocks with GC 10.1.3
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Seems I forgot to add this to the relevant commit
when submitting.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220210031724.440943-1-andrey.grodzovsky@amd.com
"adev->gfx.rlc.rlcg_reg_access_supported = true;"
the above varible were set too late during driver initialization.
it will cause the driver to fail to write/read register during GMC hw init
in sriov mode.
move gfx_xxx_init_rlcg_reg_access_ctrl() function to gfx early init stage
to avoid this issue.
Fixes: 5d447e2967 ("drm/amdgpu: add helper for rlcg indirect reg access")
Signed-off-by: Yang Wang <KevinYang.Wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
After the buffer object is successfully mapped,
call amdgpu_bo_kunmap before the function returns.
Signed-off-by: zhanglianjie <zhanglianjie@uniontech.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Confirmed with hardware team, there is harvesting for gc 10.3.1.
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Since we have a single instance of reset semaphore which we
lock only once even for XGMI hive we don't need the nested
locking hint anymore.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74120.html
This functions needs to be split into 2 parts where
one is called only once for locking single instance of
reset_domain's sem and reset flag and the other part
which handles MP1 states should still be called for
each device in XGMI hive.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74118.html
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74116.html
We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could be
in a reset in the moment.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be binded to XGMI
hive life cycle.
Adress this by making reset domain refcounted and pointed
by each member of the hive and the hive itself.
v4:
Fix crash on boot witrh XGMI hive by adding type to reset_domain.
XGMI will only create a new reset_domain if prevoius was of single
device type meaning it's first boot. Otherwsie it will take a
refocunt to exsiting reset_domain from the amdgou device.
Add a wrapper around reset_domain->refcount get/put
and a wrapper around send to reset wq (Lijo)
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
Since now all GPU resets are serialzied there is no need for this.
This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74119.html
Since we serialize all resets no need to protect from concurrent
resets.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74115.html
No need to to trigger another work queue inside the work queue.
v3:
Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.
Suggested-by: Liu Shaoyun <Shaoyun.Liu@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Liu Shaoyun <Shaoyun.Liu@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74114.html
Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeueue work and wait on it to finish.
v2: Rename to amdgpu_recover_work_struct
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74113.html
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74112.html
Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Suggested-by: Christian König <ckoenig.leichtzumerken@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74111.html
These have been at production level for a while. Drop
the flag.
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We want to have lockdep annotation here, so make sure that we reserve
the PD while removing PRTs even if it isn't strictly necessary since the
VM object is about to be destroyed anyway.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Since newly added BOs don't have any mappings it's ok to add them
without holding the VM lock. Only when we add per VM BOs the lock is
mandatory.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reported-by: Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Bypass group programming (utcl2_harvest) aims to forbid UTCL2 to send
invalidation command to harvested SE/SA. Once invalidation command comes
into harvested SE/SA, SE/SA has no response and system hang.
This patch is to add checking if the GART table is already allocated before
invalidating TLB. The new procedure is as following:
1. Calling amdgpu_gtt_mgr_init() in amdgpu_ttm_init(). After this step GTT
BOs can be allocated, but GART mappings are still ignored.
2. Calling amdgpu_gart_table_vram_alloc() from the GMC code. This allocates
the GART backing store.
3. Initializing the hardware, and programming the backing store into VMID0
for all VMHUBs.
4. Calling amdgpu_gtt_mgr_recover() to make sure the table is updated with
the GTT allocations done before it was allocated.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Confirmed with hardware team, there is harvesting for gc 10.3.1.
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The parameter order in the list_add_tail is incorrect, it causes the
reuse of ras reserved page.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The commit d5e8ff5f7b ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop")
had fixed this defect.
Revert workaround
commit a2170b4af6 ("drm/amdgpu: Add judgement to avoid infinite loop").
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20
ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU
is sought through that bus "pointer" and not through the RAS bus "pointer".
Cc: Roy Sun <Roy.Sun@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Fixes: 2f60dd5076 ("drm/amd: Expose the FRU SMU I2C bus")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prevent random memory access in the FRU EEPROM code by passing the size of
the destination buffer to the reading routine, and reading no more than the
size of the buffer.
Cc: Kent Russell <kent.russell@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Read buffers no longer expose the I2C address, and so we don't need to
offset by two when we get the read data.
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Kent Russell <kent.russell@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Fixes: bd607166af ("drm/amdgpu: Enable reading FRU chip via I2C v3")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er).
This is consistent with the rest of the kernel code.
Cc: Kent Russell <kent.russell@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This adds support to create userptr BOs on restore and introduces a new
ioctl op to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore process i.e. criu_resume ioctl op is received, and the process is
ready to be resumed. This ioctl is different from other KFD CRIU ioctls
since its called by CRIU master restore process for all the target
processes being resumed by CRIU.
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This adds support to discover the buffer objects that belong to a
process being checkpointed. The data corresponding to these buffer
objects is returned to user space plugin running under criu master
context which then stores this info to recreate these buffer objects
during a restore operation.
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes
the dmesg log to fill up with the same message, e.g,
[18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function.
Make it dev_dbg_once(), as it isn't something correctible during boot or
thereafter, so printing just once is sufficient. Also sanitize the message.
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: yipechai <YiPeng.Chai@amd.com>
Fixes: 8b0fb0e967 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Some people complained about the name and this matches much
more Linux naming conventions for object functions.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Whenever a bo_va structure is added or removed the VM and eventually
added BO should be locked.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>