The amdgpu driver is composed of multiple components, each of which can be a source of some specific problem that the user/developer can see. This commit introduces steps to narrow down and collect display information. Cc: Leo Li <sunpeng.li@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Cc: Hamza Mahfooz <hamza.mahfooz@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: Christian Konig <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
305 lines
13 KiB
ReStructuredText
305 lines
13 KiB
ReStructuredText
========================
|
||
Display Core Debug tools
|
||
========================
|
||
|
||
In this section, you will find helpful information on debugging the amdgpu
|
||
driver from the display perspective. This page introduces debug mechanisms and
|
||
procedures to help you identify if some issues are related to display code.
|
||
|
||
Narrow down display issues
|
||
==========================
|
||
|
||
Since the display is the driver's visual component, it is common to see users
|
||
reporting issues as a display when another component causes the problem. This
|
||
section equips users to determine if a specific issue was caused by the display
|
||
component or another part of the driver.
|
||
|
||
DC dmesg important messages
|
||
---------------------------
|
||
|
||
The dmesg log is the first source of information to be checked, and amdgpu
|
||
takes advantage of this feature by logging some valuable information. When
|
||
looking for the issues associated with amdgpu, remember that each component of
|
||
the driver (e.g., smu, PSP, dm, etc.) is loaded one by one, and this
|
||
information can be found in the dmesg log. In this sense, look for the part of
|
||
the log that looks like the below log snippet::
|
||
|
||
[ 4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).
|
||
[ 4.254718] [drm] register mmio base: 0xFCB00000
|
||
[ 4.254918] [drm] register mmio size: 1048576
|
||
[ 4.260095] [drm] add ip block number 0 <soc21_common>
|
||
[ 4.260318] [drm] add ip block number 1 <gmc_v11_0>
|
||
[ 4.260510] [drm] add ip block number 2 <ih_v6_0>
|
||
[ 4.260696] [drm] add ip block number 3 <psp>
|
||
[ 4.260878] [drm] add ip block number 4 <smu>
|
||
[ 4.261057] [drm] add ip block number 5 <dm>
|
||
[ 4.261231] [drm] add ip block number 6 <gfx_v11_0>
|
||
[ 4.261402] [drm] add ip block number 7 <sdma_v6_0>
|
||
[ 4.261568] [drm] add ip block number 8 <vcn_v4_0>
|
||
[ 4.261729] [drm] add ip block number 9 <jpeg_v4_0>
|
||
[ 4.261887] [drm] add ip block number 10 <mes_v11_0>
|
||
|
||
From the above example, you can see the line that reports that `<dm>`,
|
||
(**Display Manager**), was loaded, which means that display can be part of the
|
||
issue. If you do not see that line, something else might have failed before
|
||
amdgpu loads the display component, indicating that we don't have a
|
||
display issue.
|
||
|
||
After you identified that the DM was loaded correctly, you can check for the
|
||
display version of the hardware in use, which can be retrieved from the dmesg
|
||
log with the command::
|
||
|
||
dmesg | grep -i 'display core'
|
||
|
||
This command shows a message that looks like this::
|
||
|
||
[ 4.655828] [drm] Display Core v3.2.285 initialized on DCN 3.2
|
||
|
||
This message has two key pieces of information:
|
||
|
||
* **The DC version (e.g., v3.2.285)**: Display developers release a new DC version
|
||
every week, and this information can be advantageous in a situation where a
|
||
user/developer must find a good point versus a bad point based on a tested
|
||
version of the display code. Remember from page :ref:`Display Core <amdgpu-display-core>`,
|
||
that every week the new patches for display are heavily tested with IGT and
|
||
manual tests.
|
||
* **The DCN version (e.g., DCN 3.2)**: The DCN block is associated with the
|
||
hardware generation, and the DCN version conveys the hardware generation that
|
||
the driver is currently running. This information helps to narrow down the
|
||
code debug area since each DCN version has its files in the DC folder per DCN
|
||
component (from the example, the developer might want to focus on
|
||
files/folders/functions/structs with the dcn32 label might be executed).
|
||
However, keep in mind that DC reuses code across different DCN versions; for
|
||
example, it is expected to have some callbacks set in one DCN that are the same
|
||
as those from another DCN. In summary, use the DCN version just as a guide.
|
||
|
||
From the dmesg file, it is also possible to get the ATOM bios code by using::
|
||
|
||
dmesg | grep -i 'ATOM BIOS'
|
||
|
||
Which generates an output that looks like this::
|
||
|
||
[ 4.274534] amdgpu: ATOM BIOS: 113-D7020100-102
|
||
|
||
This type of information is useful to be reported.
|
||
|
||
Avoid loading display core
|
||
--------------------------
|
||
|
||
Sometimes, it might be hard to figure out which part of the driver is causing
|
||
the issue; if you suspect that the display is not part of the problem and your
|
||
bug scenario is simple (e.g., some desktop configuration) you can try to remove
|
||
the display component from the equation. First, you need to identify `dm` ID
|
||
from the dmesg log; for example, search for the following log::
|
||
|
||
[ 4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).
|
||
[..]
|
||
[ 4.260095] [drm] add ip block number 0 <soc21_common>
|
||
[ 4.260318] [drm] add ip block number 1 <gmc_v11_0>
|
||
[..]
|
||
[ 4.261057] [drm] add ip block number 5 <dm>
|
||
|
||
Notice from the above example that the `dm` id is 5 for this specific hardware.
|
||
Next, you need to run the following binary operation to identify the IP block
|
||
mask::
|
||
|
||
0xffffffff & ~(1 << [DM ID])
|
||
|
||
From our example the IP mask is::
|
||
|
||
0xffffffff & ~(1 << 5) = 0xffffffdf
|
||
|
||
Finally, to disable DC, you just need to set the below parameter in your
|
||
bootloader::
|
||
|
||
amdgpu.ip_block_mask = 0xffffffdf
|
||
|
||
If you can boot your system with the DC disabled and still see the issue, it
|
||
means you can rule DC out of the equation. However, if the bug disappears, you
|
||
still need to consider the DC part of the problem and keep narrowing down the
|
||
issue. In some scenarios, disabling DC is impossible since it might be
|
||
necessary to use the display component to reproduce the issue (e.g., play a
|
||
game).
|
||
|
||
**Note: This will probably lead to the absence of a display output.**
|
||
|
||
Display flickering
|
||
------------------
|
||
|
||
Display flickering might have multiple causes; one is the lack of proper power
|
||
to the GPU or problems in the DPM switches. A good first generic verification
|
||
is to set the GPU to use high voltage::
|
||
|
||
bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level"
|
||
|
||
The above command sets the GPU/APU to use the maximum power allowed which
|
||
disables DPM switches. If forcing DPM levels high does not fix the issue, it
|
||
is less likely that the issue is related to power management. If the issue
|
||
disappears, there is a good chance that other components might be involved, and
|
||
the display should not be ignored since this could be a DPM issues. From the
|
||
display side, if the power increase fixes the issue, it is worth debugging the
|
||
clock configuration and the pipe split police used in the specific
|
||
configuration.
|
||
|
||
Display artifacts
|
||
-----------------
|
||
|
||
Users may see some screen artifacts that can be categorized into two different
|
||
types: localized artifacts and general artifacts. The localized artifacts
|
||
happen in some specific areas, such as around the UI window corners; if you see
|
||
this type of issue, there is a considerable chance that you have a userspace
|
||
problem, likely Mesa or similar. The general artifacts usually happen on the
|
||
entire screen. They might be caused by a misconfiguration at the driver level
|
||
of the display parameters, but the userspace might also cause this issue. One
|
||
way to identify the source of the problem is to take a screenshot or make a
|
||
desktop video capture when the problem happens; after checking the
|
||
screenshot/video recording, if you don't see any of the artifacts, it means
|
||
that the issue is likely on the the driver side. If you can still see the
|
||
problem in the data collected, it is an issue that probably happened during
|
||
rendering, and the display code just got the framebuffer already corrupted.
|
||
|
||
Disabling/Enabling specific features
|
||
====================================
|
||
|
||
DC has a struct named `dc_debug_options`, which is statically initialized by
|
||
all DCE/DCN components based on the specific hardware characteristic. This
|
||
structure usually facilitates the bring-up phase since developers can start
|
||
with many disabled features and enable them individually. This is also an
|
||
important debug feature since users can change it when debugging specific
|
||
issues.
|
||
|
||
For example, dGPU users sometimes see a problem where a horizontal fillet of
|
||
flickering happens in some specific part of the screen. This could be an
|
||
indication of Sub-Viewport issues; after the users identified the target DCN,
|
||
they can set the `force_disable_subvp` field to true in the statically
|
||
initialized version of `dc_debug_options` to see if the issue gets fixed. Along
|
||
the same lines, users/developers can also try to turn off `fams2_config` and
|
||
`enable_single_display_2to1_odm_policy`. In summary, the `dc_debug_options` is
|
||
an interesting form for identifying the problem.
|
||
|
||
DC Visual Confirmation
|
||
======================
|
||
|
||
Display core provides a feature named visual confirmation, which is a set of
|
||
bars added at the scanout time by the driver to convey some specific
|
||
information. In general, you can enable this debug option by using::
|
||
|
||
echo <N> > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
|
||
|
||
Where `N` is an integer number for some specific scenarios that the developer
|
||
wants to enable, you will see some of these debug cases in the following
|
||
subsection.
|
||
|
||
Multiple Planes Debug
|
||
---------------------
|
||
|
||
If you want to enable or debug multiple planes in a specific user-space
|
||
application, you can leverage a debug feature named visual confirm. For
|
||
enabling it, you will need::
|
||
|
||
echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
|
||
|
||
You need to reload your GUI to see the visual confirmation. When the plane
|
||
configuration changes or a full update occurs there will be a colored bar at
|
||
the bottom of each hardware plane being drawn on the screen.
|
||
|
||
* The color indicates the format - For example, red is AR24 and green is NV12
|
||
* The height of the bar indicates the index of the plane
|
||
* Pipe split can be observed if there are two bars with a difference in height
|
||
covering the same plane
|
||
|
||
Consider the video playback case in which a video is played in a specific
|
||
plane, and the desktop is drawn in another plane. The video plane should
|
||
feature one or two green bars at the bottom of the video depending on pipe
|
||
split configuration.
|
||
|
||
* There should **not** be any visual corruption
|
||
* There should **not** be any underflow or screen flashes
|
||
* There should **not** be any black screens
|
||
* There should **not** be any cursor corruption
|
||
* Multiple plane **may** be briefly disabled during window transitions or
|
||
resizing but should come back after the action has finished
|
||
|
||
Pipe Split Debug
|
||
----------------
|
||
|
||
Sometimes we need to debug if DCN is splitting pipes correctly, and visual
|
||
confirmation is also handy for this case. Similar to the MPO case, you can use
|
||
the below command to enable visual confirmation::
|
||
|
||
echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
|
||
|
||
In this case, if you have a pipe split, you will see one small red bar at the
|
||
bottom of the display covering the entire display width and another bar
|
||
covering the second pipe. In other words, you will see a bit high bar in the
|
||
second pipe.
|
||
|
||
DTN Debug
|
||
=========
|
||
|
||
DC (DCN) provides an extensive log that dumps multiple details from our
|
||
hardware configuration. Via debugfs, you can capture those status values by
|
||
using Display Test Next (DTN) log, which can be captured via debugfs by using::
|
||
|
||
cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
|
||
|
||
Since this log is updated accordingly with DCN status, you can also follow the
|
||
change in real-time by using something like::
|
||
|
||
sudo watch -d cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
|
||
|
||
When reporting a bug related to DC, consider attaching this log before and
|
||
after you reproduce the bug.
|
||
|
||
Collect Firmware information
|
||
============================
|
||
|
||
When reporting issues, it is important to have the firmware information since
|
||
it can be helpful for debugging purposes. To get all the firmware information,
|
||
use the command::
|
||
|
||
cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
|
||
|
||
From the display perspective, pay attention to the firmware of the DMCU and
|
||
DMCUB.
|
||
|
||
DMUB Firmware Debug
|
||
===================
|
||
|
||
Sometimes, dmesg logs aren't enough. This is especially true if a feature is
|
||
implemented primarily in DMUB firmware. In such cases, all we see in dmesg when
|
||
an issue arises is some generic timeout error. So, to get more relevant
|
||
information, we can trace DMUB commands by enabling the relevant bits in
|
||
`amdgpu_dm_dmub_trace_mask`.
|
||
|
||
Currently, we support the tracing of the following groups:
|
||
|
||
Trace Groups
|
||
------------
|
||
|
||
.. csv-table::
|
||
:header-rows: 1
|
||
:widths: 1, 1
|
||
:file: ./trace-groups-table.csv
|
||
|
||
**Note: Not all ASICs support all of the listed trace groups**
|
||
|
||
So, to enable just PSR tracing you can use the following command::
|
||
|
||
# echo 0x8020 > /sys/kernel/debug/dri/0/amdgpu_dm_dmub_trace_mask
|
||
|
||
Then, you need to enable logging trace events to the buffer, which you can do
|
||
using the following::
|
||
|
||
# echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_dmcub_trace_event_en
|
||
|
||
Lastly, after you are able to reproduce the issue you are trying to debug,
|
||
you can disable tracing and read the trace log by using the following::
|
||
|
||
# echo 0 > /sys/kernel/debug/dri/0/amdgpu_dm_dmcub_trace_event_en
|
||
# cat /sys/kernel/debug/dri/0/amdgpu_dm_dmub_tracebuffer
|
||
|
||
So, when reporting bugs related to features such as PSR and ABM, consider
|
||
enabling the relevant bits in the mask before reproducing the issue and
|
||
attach the log that you obtain from the trace buffer in any bug reports that you
|
||
create.
|