cgroup.h (therefore swap.h, therefore half of the universe)
includes bpf.h which in turn includes module.h and slab.h.
Since we're about to get rid of that dependency we need
to clean things up.
v2: drop the cpu.h include from cacheinfo.h, it's not necessary
and it makes riscv sensitive to ordering of include files.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Krzysztof Wilczyński <kw@linux.com>
Acked-by: Peter Chen <peter.chen@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20211120035253.72074-1-kuba@kernel.org/ # v1
Link: https://lore.kernel.org/all/20211120165528.197359-1-kuba@kernel.org/ # cacheinfo discussion
Link: https://lore.kernel.org/bpf/20211202203400.1208663-1-kuba@kernel.org
We really only need memcpy restore for objects that affect the
operability of the migrate context. That is, primarily the page-table
objects of the migrate VM.
Add an object flag, I915_BO_ALLOC_PM_EARLY for objects that need early
restores using memcpy and a way to assign LMEM page-table object flags
to be used by the vms.
Restore objects without this flag with the gpu blitter and only objects
carrying the flag using TTM memcpy.
Initially mark the migrate, gt, gtt and vgpu vms to use this flag, and
defer for a later audit which vms actually need it. Most importantly, user-
allocated vms with pinned page-table objects can be restored using the
blitter.
Performance-wise memcpy restore is probably as fast as gpu restore if not
faster, but using gpu restore will help tackling future restrictions in
mappable LMEM size.
v4:
- Don't mark the aliasing ppgtt page table flags for early resume, but
rather the ggtt page table flags as intended. (Matthew Auld)
- The check for user buffer objects during early resume is pointless, since
they are never marked I915_BO_ALLOC_PM_EARLY. (Matthew Auld)
v5:
- Mark GuC LMEM objects with I915_BO_ALLOC_PM_EARLY to have them restored
before we fire up the migrate context.
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210922062527.865433-8-thomas.hellstrom@linux.intel.com
The full audit is quite a bit of work:
- i915_dpt has very simple lifetime (somehow we create a display pagetable vm
per object, so its _very_ simple, there's only ever a single vma in there),
and uses i915_vm_close(), which internally does a i915_vm_put(). No rcu.
Aside: wtf is i915_dpt doing in the intel_display.c garbage collector as a new
feature, instead of added as a separate file with some clean-ish interface.
Also, i915_dpt unfortunately re-introduces some coding patterns from
pre-dma_resv_lock conversion times.
- i915_gem_proto_ctx is fully refcounted and no rcu, all protected by
fpriv->proto_context_lock.
- i915_gem_context is itself rcu protected, and that might leak to anything it
points at. Before
commit cf977e1861
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Wed Dec 2 11:21:40 2020 +0000
drm/i915/gem: Spring clean debugfs
and
commit db80a1294c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Mon Jan 18 11:08:54 2021 +0000
drm/i915/gem: Remove per-client stats from debugfs/i915_gem_objects
we had a bunch of debugfs files that relied on rcu protecting everything, but
those are gone now. The main one was removed even earlier with
There doesn't seem to be anything left that's actually protecting
stuff now that the ctx->vm itself is invariant. See
commit ccbc1b9794
Author: Jason Ekstrand <jason@jlekstrand.net>
Date: Thu Jul 8 10:48:30 2021 -0500
drm/i915/gem: Don't allow changing the VM on running contexts (v4)
Note that we drop the vm refcount before the final release of the gem context
refcount, so this is all very dangerous even without rcu. Note that aside from
later on creating new engines (a defunct feature) and debug output we're never
looked at gem_ctx->vm for anything functional, hence why this is ok.
Fingers crossed.
Preceeding patches removed all vestiges of rcu use from gem_ctx->vm
derferencing to make it clear it's really not used.
The gem_ctx->rcu protection was introduced in
commit a4e7ccdac3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Fri Oct 4 14:40:09 2019 +0100
drm/i915: Move context management under GEM
The commit message is somewhat entertaining because it fails to
mention this fact completely, and compensates that by an in-commit
changelog entry that claims that ctx->vm is protected by ctx->mutex.
Which was the case _before_ this commit, but no longer after it.
- intel_context holds a full reference. Unfortunately intel_context is also rcu
protected and the reference to the ->vm is dropped before the
rcu barrier - only the kfree is delayed. So again we need to check
whether that leaks anywhere on the intel_context->vm. RCU is only
used to protect intel_context sitting on the breadcrumb lists, which
don't look at the vm anywhere, so we are fine.
Nothing else relies on rcu protection of intel_context and hence is
fully protected by the kref refcount alone, which protects
intel_context->vm in turn.
The breadcrumbs rcu usage was added in
commit c744d50363
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Thu Nov 26 14:04:06 2020 +0000
drm/i915/gt: Split the breadcrumb spinlock between global and contexts
its parent commit added the intel_context rcu protection:
commit 14d1eaf088
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Thu Nov 26 14:04:05 2020 +0000
drm/i915/gt: Protect context lifetime with RCU
given some credence to my claim that I've actually caught them all.
- drm_i915_gem_object's shares_resv_from pointer has a full refcount to the
dma_resv, which is a sub-refcount that's released after the final
i915_vm_put() has been called. Safe.
Aside: Maybe we should have a struct dma_resv_shared which is just dma_resv +
kref as a stand-alone thing. It's a pretty useful pattern which other drivers
might want to copy.
For a bit more context see
commit 4d8151ae53
Author: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Date: Tue Jun 1 09:46:41 2021 +0200
drm/i915: Don't free shared locks while shared
- the fpriv->vm_xa was relying on rcu_read_lock for lookup, but that
was updated in a prep patch too to just be a spinlock-protected
lookup.
- intel_gt->vm is set at driver load in intel_gt_init() and released
in intel_gt_driver_release(). There seems to be some issue that
in some error paths this is called twice, but otherwise no rcu to be
found anywhere. This was added in the below commit, which
unfortunately doesn't explain why this complication exists.
commit e6ba764802
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Sat Dec 21 16:03:24 2019 +0000
drm/i915: Remove i915->kernel_context
The proper fix most likely for this is to start using drmm_ at large
scale, but that's also huge amounts of work.
- i915_vma->vm is some real pain, because rcu is rcu protected, at
least in the vma lookup in the context lookup cache in
eb_lookup_vma(). This was added in
commit 4ff4b44cbb
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Fri Jun 16 15:05:16 2017 +0100
drm/i915: Store a direct lookup from object handle to vma
This was changed to a radix tree from the hashtable in, but with the
locking unchanged, in
commit d1b48c1e71
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Wed Aug 16 09:52:08 2017 +0100
drm/i915: Replace execbuf vma ht with an idr
In
commit 93159e1235
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Mon Mar 23 09:28:41 2020 +0000
drm/i915/gem: Avoid gem_context->mutex for simple vma lookup
the locking was changed from dev->struct_mutex to rcu, which added
the requirement to rcu protect i915_vma. Somehow this was missed in
review (or I'm completely blind).
Irrespective of all that the vma lookup cache rcu_read_lock grabs a
full reference of the vma and the rcu doesn't leak further. So no
impact on i915_address_space from that.
I have not found any other rcu use for i915_vma, but given that it
seems broken I also didn't bother to do a careful in-depth audit.
Alltogether there's nothing left in-tree anymore which requires that a
pointer deref to an i915_address_space is safe undre rcu_read_lock
only.
rcu protection of i915_address_space was introduced in
commit b32fa81115
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Thu Jun 20 19:37:05 2019 +0100
drm/i915/gtt: Defer address space cleanup to an RCU worker
by mixing up a bugfixing (i915_address_space needs to be released from
a worker) with enabling rcu support. The commit message also seems
somewhat confused, because it talks about cleanup of WC pages
requiring sleep, while the code and linked bugzilla are about a
requirement to take dev->struct_mutex (which yes sleeps but it's a
much more specific problem). Since final kref_put can be called from
pretty much anywhere (including hardirq context through the
scheduler's i915_active cleanup) we need a worker here. Hence that
part must be kept.
Ideally all these reclaim workers should have some kind of integration
with our shrinkers, but for some of these it's rather tricky. Anyway,
that's a preexisting condition in the codeebase that we wont fix in
this patch here.
We also remove the rcu_barrier in ggtt_cleanup_hw added in
commit 60a4233a49
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Mon Jul 29 14:24:12 2019 +0100
drm/i915: Flush the i915_vm_release before ggtt shutdown
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Link: https://patchwork.freedesktop.org/patch/msgid/20210902142057.929669-11-daniel.vetter@ffwll.ch
The min_page_size is only needed for pages inserted into the GTT, and
for our paging structures we only need at most 4K bytes, so simply
ignore the min_page_size restrictions here, otherwise we might see some
severe overallocation on some devices.
v2(Thomas): add some commentary
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210625103824.558481-2-matthew.auld@intel.com
This was done by the following semantic patch:
@@ expression i915; @@
- INTEL_GEN(i915)
+ GRAPHICS_VER(i915)
@@ expression i915; expression E; @@
- INTEL_GEN(i915) >= E
+ GRAPHICS_VER(i915) >= E
@@ expression dev_priv; expression E; @@
- !IS_GEN(dev_priv, E)
+ GRAPHICS_VER(dev_priv) != E
@@ expression dev_priv; expression E; @@
- IS_GEN(dev_priv, E)
+ GRAPHICS_VER(dev_priv) == E
@@
expression dev_priv;
expression from, until;
@@
- IS_GEN_RANGE(dev_priv, from, until)
+ IS_GRAPHICS_VER(dev_priv, from, until)
@def@
expression E;
identifier id =~ "^gen$";
@@
- id = GRAPHICS_VER(E)
+ ver = GRAPHICS_VER(E)
@@
identifier def.id;
@@
- id
+ ver
It also takes care of renaming the variable we assign to GRAPHICS_VER()
so to use "ver" rather than "gen".
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210605155356.4183026-2-lucas.demarchi@intel.com
We are currently sharing the VM reservation locks across a number of
gem objects with page-table memory. Since TTM will individiualize the
reservation locks when freeing objects, including accessing the shared
locks, make sure that the shared locks are not freed until that is done.
For PPGTT we add an additional refcount, for GGTT we take additional
measures to make sure objects sharing the GGTT reservation lock are
freed at GGTT takedown
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210601074654.3103-3-thomas.hellstrom@linux.intel.com
It's a requirement that for dgfx we place all the paging structures in
device local-memory.
v2: use i915_coherent_map_type()
v3: improve the shared dma-resv object comment
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210427085417.120246-4-matthew.auld@intel.com
We need to generalise our accessor for the page directories and tables from
using the simple kmap_atomic to support local memory, and this setup
must be done on acquisition of the backing storage prior to entering
fence execution contexts. Here we replace the kmap with the object
mapping code that for simple single page shmemfs object will return a
plain kmap, that is then kept for the lifetime of the page directory.
Note that keeping the mapping around is a potential concern here, since
while the vma is pinned the mapping remains there for the PDs
underneath, or at least until the used_count reaches zero, at which
point we can safely destroy the mapping. For 32b this will be even worse
since the address space is more limited, but since this change mostly
impacts full ppGTT platforms, the justification is that for modern
platforms we shouldn't care too much about 32b.
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210427085417.120246-3-matthew.auld@intel.com
We may create page table objects on the fly, but we may need to
wait with the ww lock held. Instead of waiting on a freed obj
lock, ensure we have the same lock for each object to keep
-EDEADLK working. This ensures that i915_vma_pin_ww can lock
the page tables when required.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210323155059.628690-41-maarten.lankhorst@linux.intel.com
This should be done as part of the ww loop, in order to remove a
i915_vma_pin that needs ww held.
Now only i915_ggtt_pin() callers remaining.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210323155059.628690-25-maarten.lankhorst@linux.intel.com
Primarily used by selftests, but also by runtime debugging of engine
w/a, is a routine to create a temporarily bound buffer for readback.
Almagamate the duplicated routines into one.
Suggested-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20201219020343.22681-2-chris@chris-wilson.co.uk
Since SKL the eLLC has been sitting on the far side of the system
agent, meaning the display engine can utilize it. Let's enable that.
I chose WB for the caching mode, because my numbers are indicating
that WT might actually be WB and WC might actually be UC. I'm not
100% sure that is indeed the case but at least my simple rendercopy
based benchmark didn't see any difference in performance.
Also if I configure things to do LLCeLLC+WT I still get cache dirt
on my screen, suggesting that is in fact operating in WB mode
anyway. This is also the reason I had to fix the MOCS target cache
to really say PTE rather than LLC+eLLC.
Since SKL the eLLC has been sitting on the far side of the system agent,
meaning the display engine can utilize it. Let's enable that.
Eero's earlier benchmarks numbers:
"* Results in GfxBench and Unigine (Valley/Heaven) tests were within daily
variation on the tested SKL machines
* SKL GT4e (128MB eLLC) / Wayland / Weston:
+15-20% SynMark TexMem512 (512MB of textures)
+4-6% SynMark TerrainFly*, CSCloth, ShMapVsm
-5-10% SynMark TexMem128 (128MB of textures)
* SKL GT3e (64MB eLLC) / Xorg / Unity:
+4-8% GpuTest Triangle fullscreen (FullHD)
-5-10% GpuTest Triangle windowed (1/2 screen)
* SKL GT2 (no eLLC) / Xorg / Unity:
* Some of the higher FPS SynMark pixel and vertex shader tests
are few percent higher, more than daily variance
=> Do you see any reason why this machine would be impacted
although it doesn't eLLC?"
Caveats:
- Still haven't tested with a prime setup
- Still not entirely sure this a good idea, but I've been
using it on my cfl anyway :)
v2: Split the MOCS PTE change out
Cc: Eero Tamminen <eero.t.tamminen@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20201007120329.17076-3-ville.syrjala@linux.intel.com
Link: https://patchwork.freedesktop.org/patch/msgid/20201015122138.30161-3-chris@chris-wilson.co.uk
The GEM object is grossly overweight for the practicality of tracking
large numbers of individual pages, yet it is currently our only
abstraction for tracking DMA allocations. Since those allocations need
to be reserved upfront before an operation, and that we need to break
away from simple system memory, we need to ditch using plain struct page
wrappers.
In the process, we drop the WC mapping as we ended up clflushing
everything anyway due to various issues across a wider range of
platforms. Though in a future step, we need to drop the kmap_atomic
approach which suggests we need to pre-map all the pages and keep them
mapped.
v2: Verify our large scratch page is suitably DMA aligned; and manually
clear the scratch since we are allocating plain struct pages full of
prior content.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200729164219.5737-2-chris@chris-wilson.co.uk
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Pull the final atomic_dec of vm->open (marking the vm as closed)
underneath the same vm->mutex as used to close it. This is required to
correctly serialise with attempting to reuse the vma as the vm is closed
by a second thread.
References: 00de702c6c ("drm/i915: Check that the vma hasn't been closed before we insert it")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200227085723.1961649-10-chris@chris-wilson.co.uk
On TGL, bits 2-4 in the GGTT PTE are not ignored anymore and are
instead used for some extra VT-d capabilities. We don't (yet?) have
support for those capabilities, but, given that we shared the pte_encode
function betweed GGTT and PPGTT, we still set those bits to the PPGTT
PPAT values. The DMA engine gets very confused when those bits are
set while the iommu is enabled, leading to errors. E.g. when loading
the GuC we get:
[ 9.796218] DMAR: DRHD: handling fault status reg 2
[ 9.796235] DMAR: [DMA Write] Request device [00:02.0] PASID ffffffff fault addr 0 [fault reason 02] Present bit in context entry is clear
[ 9.899215] [drm:intel_guc_fw_upload [i915]] *ERROR* GuC firmware signature verification failed
To fix this, just have dedicated gen8_pte_encode function per type of
gtt. Also, explicitly set vm->pte_encode for gen8_ppgtt, even if we
don't use it, to make sure we don't accidentally assign it to the GGTT
one, like we do for gen6_ppgtt, in case we need it in the future.
Reported-by: "Sodhi, Vunny" <vunny.sodhi@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200226185657.26445-1-daniele.ceraolospurio@intel.com
Using a clear page for scratch means that we have relatively benign
errors in case it is accidentally used, but that can be rather too
benign for debugging. If we poison the scratch, ideally it quickly
results in an obvious error.
v2: Set each page individually just in case we are using highmem for our
scratch page.
v3: Pick a new scratch register as MI_STORE_REGISTER_MEM does not work
with GPR0 on gen7, unbelievably.
v4: Haswell still considers 3DPRIM a privileged register!
Suggested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Matthew Auld <matthew.william.auld@gmail.com>
Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200124115133.53360-1-chris@chris-wilson.co.uk
Attempt to split i915_gem_gtt.[ch] into more manageable chunks.
Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200107134009.3255354-1-chris@chris-wilson.co.uk