The display engine does not snoop the caches so we should mark
the DSB command buffer as I915_CACHE_NONE.
i915_gem_object_create_internal() always gives us I915_CACHE_LLC
on LLC platforms. And to make things 100% correct we should also
clflush at the end, if necessary.
Note that currently this is a non-issue as we always write the
command buffer through a WC mapping, so a cache flush is not actually
needed. But we might actually want to consider a WB mapping since
we also end up reading from the command buffer (in the indexed
reg write handling). Either that or we should do something else
to avoid those reads (might actually be even more sensible on DGFX
since we end up reading over PCIe). But we should measure the overhead
first...
Anyways, no real harm in adding the belts and suspenders here so
that the code will work correctly regardless of how we map the
buffer. If we do get a WC mapping (as we request)
i915_gem_object_flush_map() will be a nop. Well, apart form
a wmb() which may just flush the WC buffer a bit earlier
than would otherwise happen (at the latest the mmio accesses
would trigger the WC flush).
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231009132204.15098-2-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Using system memory for the DSB command buffer doesn't appear to work.
On DG2 it seems like the hardware internally replaces the actual memory
reads with zeroes, and so we end up executing a bunch of NOOPs instead
of whatever commands we put in the buffer. To determine that I measured
the time it takes to execute the instructions, and the results are
always more or less consistent with executing a buffer full of NOOPs
from local memory.
Another theory I considered was some kind of cache coherency issue.
Looks like i915_gem_object_pin_map_unlocked() will in fact give you a
WB mapping for system memory on DGFX regardless of what mapping mode
was requested (WC in case of the DSB code). But clflush did not
change the behaviour at all, so that theory seems moot.
On DG1 it looks like the hardware might actually be fetching data from
system memory as the logs indicate that we just get underruns. But that
is equally bad, so doesn't look like we can really use system memory on
DG1 either.
Thus always allocate the DSB command buffer from local memory on
discrete GPUs.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231009132204.15098-1-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Normally we could be in a deep PkgC state all the way up to the
point when DSB starts its execution at the transcoders undelayed
vblank. The DSB will then have to wait for the hardware to
wake up before it can execute anything. This will waste a huge
chunk of the vblank time just waiting, and risks the DSB execution
spilling into the vertical active period. That will be very bad,
especially when programming the LUTs as the anti-collision logic
will cause DSB to corrupt LUT writes during vertical active.
To avoid these problems we can instruct the DSB to pre-wake the
display engine on a specific scanline so that everything will
be 100% ready to go when we hit the transcoder's undelayed vblank.
One annoyance is that the scanline is specified as just that,
a single scanline. So if we happen to start the DSB execution
after passing said scanline no DEwake will happen and we may drop
back into some PkgC state before reaching the transcoder's undelayed
vblank. To prevent that we'll use the "force DEwake" bit to manually
force the display engine to stay awake. We'll then have to clear
the force bit again after the DSB is done (the force bit remains
effective even when the DSB is otherwise disabled).
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230606191504.18099-18-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Avoid the locking overhead for DSB registers. We don't need the locks
and intel_dsb_commit() in particular needs to be called from the
vblank evade critical section and thus needs to be fast.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230606191504.18099-3-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
i915_gem_object_create_internal() does not hand out zeroed
memory. Thus we may confuse whatever stale garbage is in
there as a previous register write and mistakenly handle the
first actual register write as an indexed write. This can
end up corrupting the instruction sufficiently well to lose
the entire register write.
Make sure we've actually emitted a previous instruction before
attemting indexed register write merging.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230606191504.18099-7-ville.syrjala@linux.intel.com
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
Introduce a function to emits whatever commands we need
at the end of the DSB command buffer. For the moment we
only do the tail cacheline alignment there, but eventually
we might want to eg. emit an interrupt.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230118163040.29808-5-ville.syrjala@linux.intel.com
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
Starting the DSB execution vs. waiting for it stop are two
totally different things. Split intel_dsb_wait() from
intel_dsb_commit() so that we can eventually allow the DSB
to execute asynchronously.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230118163040.29808-4-ville.syrjala@linux.intel.com
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
The caller should more or less know how many DSB commands it
wants to emit into the command buffer, so allow it to specify
the size of the command buffer rather than having the low level
DSB code guess it.
Technically we can emit as many as 134+1033 (for adl+ degamma +
10bit gamma) register writes but thanks to the DSB indexed register
write command we get significant space savings so the current size
estimate of 8KiB (~1024 DSB commands) is sufficient for now.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216003810.13338-11-ville.syrjala@linux.intel.com
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
The DSB indexed register write insturction is purely an internal
DSB implementation detail, no reason why the caller should have to
know about it. So let's just have the caller emit blind register
writes let the DSB code convert things to an indexed write if/when
multiple writes occur to the same register offset in a row.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216003810.13338-9-ville.syrjala@linux.intel.com
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
Currently intel_dsb_indexed_reg_write() just assumes the previous
instructions is also an indexed register write, and thus only
checks the register offset. Make the check more robust by
actually checking the instruction opcode as well.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216003810.13338-8-ville.syrjala@linux.intel.com
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
free_pos is in dwords, DSB_BUF_SIZE in bytes. Directly
comparing the two is nonsense. Fix it up, and make sure
we also account for the 8byte alignment requirement for
each instruction, and also assume that each instruction
normally eats two dwords.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216003810.13338-5-ville.syrjala@linux.intel.com
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
Every DSB instruction has to be 8byte aligned. Make sure
that is the case for the non-indexed register writes as well.
The way this could end up unaligned is we emitted an odd
number of indexed register writes beforehand.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216003810.13338-4-ville.syrjala@linux.intel.com
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
We could have many different uses for the DSB(s) during a
single commit, so the current approach of passing the whole
crtc_state to the DSB functions is far too high level. Lower
the abstraction a little bit so each DSB user can decide where
to stick the command buffer/etc.
v2: Document the intel_dsb_prepare() return value (Ankit)
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221123152638.20622-10-ville.syrjala@linux.intel.com
Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
The use of DSB has to be done differently on a case by case basis.
So no way this kind of blind mmio fallback in the guts of the DSB
code will work properly. Move it at least one level up into the
LUT loading code. Not sure if this is the way we want do the
DSB vs. mmio handling in the end, but at least it's a bit
closer than what we had before.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221123152638.20622-8-ville.syrjala@linux.intel.com
Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Turns out many of the files that need i915_reg.h get it implicitly via
{display/intel_de.h, gt/intel_context.h} -> i915_trace.h -> i915_irq.h
-> i915_reg.h. Since i915_trace.h doesn't actually need i915_irq.h,
makes sense to drop it, but that requires adding quite a few new
includes all over the place.
Prefer including i915_reg.h where needed instead of adding another
implicit include, because eventually we'll want to split up i915_reg.h
and only include the specific registers at each place.
Also some places actually needed i915_irq.h too.
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/6e78a2e0ac1bffaf5af3b5ccc21dff05e6518cef.1668008071.git.jani.nikula@intel.com
The request to aqquire gem resources is failing for DSB in rare
scenario where it is busy and the register programming will be done
through mmio fallback path.
DSB has extra advantage of faster register programming which may
go away through mmio path. Adding wait for gem resource also may
not be right as anyways losing time.
To make the CI execution happy replaced drm_err() to drm_info()
for printing debug info during dsb buffer preparation.
v1: Initial version.
v2: Added print for mmio fallback at out label. [Nirmoy]
v3: Improved debug message. [Nirmoy]
Cc: Nirmoy Das <nirmoy.das@linux.intel.com>
Signed-off-by: Animesh Manna <animesh.manna@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220325161140.11906-1-animesh.manna@intel.com
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
We have to bash in a lot of registers to load the higher
precision LUT modes. The locking overhead is significant, especially
as we have to get this done as quickly as possible during vblank.
So let's switch to unlocked accesses for these. Fortunately the LUT
registers are mostly spread around such that two pipes do not have
any registers on the same cacheline. So as long as commits on the
same pipe are serialized (which they are) we should get away with
this without angering the hardware.
The only exceptions are the PREC_PIPEGCMAX registers on ilk/snb which
we don't use atm as they are only used in the 12bit gamma mode. If/when
we add support for that we may need to remember to still serialize
those registers, though I'm not sure ilk/snb are actually affected
by the same cacheline issue. I think ivb/hsw at least were, but they
use a different set of registers for the precision LUT.
I have a test case which is updating the LUTs on two pipes from a
single atomic commit. Running that in a loop for a minute I get the
following worst case with the locks in place:
intel_crtc_vblank_work_start: pipe B, frame=10037, scanline=1081
intel_crtc_vblank_work_start: pipe A, frame=12274, scanline=769
intel_crtc_vblank_work_end: pipe A, frame=12274, scanline=58
intel_crtc_vblank_work_end: pipe B, frame=10037, scanline=74
And here's the worst case with the locks removed:
intel_crtc_vblank_work_start: pipe B, frame=5869, scanline=1081
intel_crtc_vblank_work_start: pipe A, frame=7616, scanline=769
intel_crtc_vblank_work_end: pipe B, frame=5869, scanline=1096
intel_crtc_vblank_work_end: pipe A, frame=7616, scanline=777
The test was done on a snb using the 10bit 1024 entry LUT mode.
The vtotals for the two displays are 793 and 1125. So we can
see that with the locks ripped out the LUT updates are pretty
nicely confined within the vblank, whereas with the locks in
place we're routinely blasting past the vblank end which causes
visual artifacts near the top of the screen.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211020223339.669-5-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Hoist the intel_de.h include from intel_display_types.h one
level up. I need this in order to untangle the include order
so that I can add tracepoints into intel_de.h.
This little cocci script did most of the work for me:
@find@
@@
(
intel_de_read(...)
|
intel_de_read_fw(...)
|
intel_de_write(...)
|
intel_de_write_fw(...)
)
@has_include@
@@
(
#include "intel_de.h"
|
#include "display/intel_de.h"
)
@depends on find && !has_include@
@@
+ #include "intel_de.h"
#include "intel_display_types.h"
@depends on find && !has_include@
@@
+ #include "display/intel_de.h"
#include "display/intel_display_types.h"
Cc: Cooper Chiou <cooper.chiou@intel.com>
Reviewed-by: Anshuman Gupta <anshuman.gupta@intel.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210430143945.6776-1-ville.syrjala@linux.intel.com
Because of the long lifetime of the mapping, we cannot wrap this in a
simple limited ww lock. Just use the unlocked version of pin_map,
because we'll likely release the mapping a lot later, in a different
thread.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210323155059.628690-39-maarten.lankhorst@linux.intel.com
Currently there is no null check for a failed memory allocation
on the dsb object and without this a null pointer dereference
error can occur. Fix this by adding a null check.
Note: added a drm_err message in keeping with the error message style
in the function.
Addresses-Coverity: ("Dereference null return")
Fixes: afeda4f3b1 ("drm/i915/dsb: Pre allocate and late cleanup of cmd buffer")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200616114221.73971-1-colin.king@canonical.com
drivers/gpu/drm/i915/display/intel_dsb.c:177 intel_dsb_reg_write() warn: variable dereferenced before check 'dsb' (see line 175)
Fixes: afeda4f3b1 ("drm/i915/dsb: Pre allocate and late cleanup of cmd buffer")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Animesh Manna <animesh.manna@intel.com>
Cc: Uma Shankar <uma.shankar@intel.com>
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200524233900.25598-1-chris@chris-wilson.co.uk
Pre-allocate command buffer in atomic_commit using intel_dsb_prepare
function which also includes pinning and map in cpu domain.
No functional change is dsb write/commit functions.
Now dsb get/put function is removed and ref-count mechanism is
not needed. Below dsb api added to do respective job mentioned
below.
intel_dsb_prepare - Allocate, pin and map the buffer.
intel_dsb_cleanup - Unpin and release the gem object.
RFC: Initial patch for design review.
v2: included _init() part in _prepare(). [Daniel, Ville]
v3: dsb_cleanup called after cleanup_planes. [Daniel]
v4: dsb structure is moved to intel_crtc_state from intel_crtc. [Maarten]
v5: dsb get/put/ref-count mechanism removed. [Maarten]
v6: Based on review feedback following changes are added,
- replaced intel_dsb structure by pointer in intel_crtc_state. [Maarten]
- passing intel_crtc_state to dsp-api to simplify the code. [Maarten]
- few dsb functions prototype modified to simplify code.
v7: added few cosmetic changes suggested by Jani and null check for
crtc_state in dsb_cleanup removed as suggested by Maarten.
v8: changed the function parameter to intel_crtc_state* of
ivb_load_lut_ext_max() from intel_crtc. [Maarten]
v9: error handling improved in _write() and prepare(). [Maarten]
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@intel.com>
Signed-off-by: Animesh Manna <animesh.manna@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200520130737.11240-1-animesh.manna@intel.com
Remove a number of inlines from .c files, and let the compiler decide
what's best. There's more to do, but need to start somewhere, and need
to start setting the example.
Acked-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200420140438.14672-2-jani.nikula@intel.com
The "err" label is not really "err", but rather "out" since the return
path is shared between error condition and normal path. This broke when
commit 03cea61076 ("drm/i915/dsb: fix extra warning on error path
handling") added a "dsb->cmd_buf = NULL;" there, making DSB to stop
working since now all writes would pass-through via mmio.
Remove the set to NULL since it's actually not needed: we only set it if
all steps are successful. While at it, rename the label so this confusion
doesn't happen again.
Fixes: 03cea61076 ("drm/i915/dsb: fix extra warning on error path handling")
Resolves: https://gitlab.freedesktop.org/drm/intel/issues/8
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191127221119.384754-1-lucas.demarchi@intel.com
When we call intel_dsb_get(), the dsb initialization may fail for
various reasons. We already log the error message in that path, making
it unnecessary to trigger a warning that refcount == 0 when calling
intel_dsb_put().
So here we simplify the logic and do lazy shutdown: leaving the extra
refcount alive so when we call intel_dsb_put() we end up calling
i915_vma_unpin_and_release().
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191111205024.22853-3-lucas.demarchi@intel.com
The current dsb API is not really prepared to handle multithread access.
I was debugging an issue that ended up fixed by commit a096883dda
("drm/i915/dsb: Remove PIN_MAPPABLE from the DSB object VMA") and was
puzzled how these atomic operations were guaranteeing atomicity.
if (atomic_add_return(1, &dsb->refcount) != 1)
return dsb;
Thread A could still be initializing dsb struct (and even fail in the
middle) while thread B would take a reference and use it (even
derefencing a NULL cmd_buf).
I don't think the atomic operations here will help much if this were
to support multithreaded scenario in future, so just remove them to
avoid confusion.
v2: Use refcount++ != 0 instead of ++refcount != 1 (from Ville)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191111205024.22853-2-lucas.demarchi@intel.com
Link: https://patchwork.freedesktop.org/patch/msgid/20191116011539.18230-1-lucas.demarchi@intel.com
Replace the struct_mutex requirement for pinning the i915_vma with the
local vm->mutex instead. Note that the vm->mutex is tainted by the
shrinker (we require unbinding from inside fs-reclaim) and so we cannot
allocate while holding that mutex. Instead we have to preallocate
workers to do allocate and apply the PTE updates after we have we
reserved their slot in the drm_mm (using fences to order the PTE writes
with the GPU work and with later unbind).
In adding the asynchronous vma binding, one subtle requirement is to
avoid coupling the binding fence into the backing object->resv. That is
the asynchronous binding only applies to the vma timeline itself and not
to the pages as that is a more global timeline (the binding of one vma
does not need to be ordered with another vma, nor does the implicit GEM
fencing depend on a vma, only on writes to the backing store). Keeping
the vma binding distinct from the backing store timelines is verified by
a number of async gem_exec_fence and gem_exec_schedule tests. The way we
do this is quite simple, we keep the fence for the vma binding separate
and only wait on it as required, and never add it to the obj->resv
itself.
Another consequence in reducing the locking around the vma is the
destruction of the vma is no longer globally serialised by struct_mutex.
A natural solution would be to add a kref to i915_vma, but that requires
decoupling the reference cycles, possibly by introducing a new
i915_mm_pages object that is own by both obj->mm and vma->pages.
However, we have not taken that route due to the overshadowing lmem/ttm
discussions, and instead play a series of complicated games with
trylocks to (hopefully) ensure that only one destruction path is called!
v2: Add some commentary, and some helpers to reduce patch churn.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-4-chris@chris-wilson.co.uk
Added docbook info regarding Display State Buffer(DSB) which
is added from gen12 onwards to batch submit display HW programming.
v1: Initial version as RFC.
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shashank Sharma <shashank.sharma@intel.com>
Reviewed-by: Shashank Sharma <shashank.sharma@intel.com>
Signed-off-by: Animesh Manna <animesh.manna@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190920115930.27829-11-animesh.manna@intel.com