1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

84 commits

Author SHA1 Message Date
Sean Christopherson
f3310e622f KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault()
to harden KVM against using "uninitialized" values.  In quotes because the
fields are actually zero-initialized, and zero is a legal value for both
page frame numbers and virtual addresses.  E.g. failure to set "pfn" prior
to creating an SPTE could result in KVM pointing at physical address '0',
which is far less desirable than KVM generating a SPTE with reserved PA
bits set and thus effectively killing the VM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240228024147.41573-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:23 -04:00
Sean Christopherson
b3d5dc629c KVM: x86/mmu: Use synthetic page fault error code to indicate private faults
Add and use a synthetic, KVM-defined page fault error code to indicate
whether a fault is to private vs. shared memory.  TDX and SNP have
different mechanisms for reporting private vs. shared, and KVM's
software-protected VMs have no mechanism at all.  Usurp an error code
flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
and friends.

Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
for TDX and software-protected VMs as appropriate, but that would require
*clearing* the flag for SEV and SEV-ES VMs, which support encrypted
memory at the hardware layer, but don't utilize private memory at the
KVM layer.

Opportunistically add a comment to call out that the logic for software-
protected VMs is (and was before this commit) broken for nested MMUs, i.e.
for nested TDP, as the GPA is an L2 GPA.  Punt on trying to play nice with
nested MMUs as there is a _lot_ of functionality that simply doesn't work
for software-protected VMs, e.g. all of the paths where KVM accesses guest
memory need to be updated to be aware of private vs. shared memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20240228024147.41573-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:19 -04:00
Isaku Yamahata
c9710130cc KVM: x86/mmu: Pass full 64-bit error code when handling page faults
Plumb the full 64-bit error code throughout the page fault handling code
so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
will be used to determine whether or not a fault is private vs. shared.

Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
the behavior of permission_fault() when invoked in the page fault path, as
KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().

Continue passing '0' from the async #PF worker, as guest_memfd and thus
private memory doesn't support async page faults.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[mdr: drop references/changes on rebase, update commit message]
Signed-off-by: Michael Roth <michael.roth@amd.com>
[sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20240228024147.41573-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:18 -04:00
Sean Christopherson
d0bf8e6e44 KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
triggers emulation of any kind, as KVM doesn't currently support emulating
access to guest private memory.  Practically speaking, private faults and
emulation are already mutually exclusive, but there are many flow that
can result in KVM returning RET_PF_EMULATE, and adding one last check
to harden against weird, unexpected combinations and/or KVM bugs is
inexpensive.

Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240228024147.41573-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 11:59:16 -04:00
Ingo Molnar
4589f199eb Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches
Merge in pending alternatives patching infrastructure changes, before
applying more patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-14 10:49:37 +01:00
Breno Leitao
aefb2f2e61 x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINE
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options.

[ mingo: Converted a few more uses in comments/messages as well. ]

Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ariel Miculas <amiculas@cisco.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2024-01-10 10:52:28 +01:00
Binbin Wu
a130066f74 KVM: x86/mmu: Drop non-PA bits when getting GFN for guest's PGD
Drop non-PA bits when getting GFN for guest's PGD with the maximum theoretical
mask for guest MAXPHYADDR.

Do it unconditionally because it's harmless for 32-bit guests, querying 64-bit
mode would be more expensive, and for EPT the mask isn't tied to guest mode.
Using PT_BASE_ADDR_MASK would be technically wrong (PAE paging has 64-bit
elements _except_ for CR3, which has only 32 valid bits), it wouldn't matter
in practice though.

Opportunistically use GENMASK_ULL() to define __PT_BASE_ADDR_MASK.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-6-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:04 -08:00
Sean Christopherson
89ea60c2c7 KVM: x86: Add support for "protected VMs" that can utilize private memory
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-24-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14 08:01:05 -05:00
Chao Peng
8dd2eee9d5 KVM: x86/mmu: Handle page fault for private memory
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory.  For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.

For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes.  To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace.  Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.

Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits.  In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.

Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14 08:01:04 -05:00
Sean Christopherson
0df9dab891 KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM
preserves TDP MMU roots until they are explicitly invalidated.  Zapping
roots asynchronously was effectively a workaround to avoid stalling a vCPU
for an extended during if a vCPU unloaded a root, which at the time
happened whenever the guest toggled CR0.WP (a frequent operation for some
guest kernels).

While a clever hack, zapping roots via an unbound worker had subtle,
unintended consequences on host scheduling, especially when zapping
multiple roots, e.g. as part of a memslot.  Because the work of zapping a
root is no longer bound to the task that initiated the zap, things like
the CPU affinity and priority of the original task get lost.  Losing the
affinity and priority can be especially problematic if unbound workqueues
aren't affined to a small number of CPUs, as zapping multiple roots can
cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
the CPUs KVM is already using to run vCPUs.

When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
zap can result in KVM occupying all logical CPUs for ~8ms, and result in
high priority tasks not being scheduled in in a timely manner.  In v5.15,
which doesn't preserve unloaded roots, the issues were even more noticeable
as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.

Consuming all CPUs for an extended duration can lead to significant jitter
throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
is a semi-frequent operation as memslots are deleted and recreated with
different host virtual addresses to react to host GPU drivers allocating
and freeing GPU blobs.  On ChromeOS, the jitter manifests as audio blips
during games due to the audio server's tasks not getting scheduled in
promptly, despite the tasks having a high realtime priority.

Deleting memslots isn't exactly a fast path and should be avoided when
possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
memslot shenanigans, but KVM is squarely in the wrong.  Not to mention
that removing the async zapping eliminates a non-trivial amount of
complexity.

Note, one of the subtle behaviors hidden behind the async zapping is that
KVM would zap invalidated roots only once (ignoring partial zaps from
things like mmu_notifier events).  Preserve this behavior by adding a flag
to identify roots that are scheduled to be zapped versus roots that have
already been zapped but not yet freed.

Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
encounter invalid roots, as it's not at all obvious why zapping
invalidated roots shouldn't simply zap all invalid roots.

Reported-by: Pattara Teerapong <pteerapong@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Yiwei Zhang<zzyiwei@google.com>
Cc: Paul Hsia <paulhsia@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230916003916.2545000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:48 -04:00
Sean Christopherson
0e3223d8d0 KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
When attempting to allocate a shadow root for a !visible guest root gfn,
e.g. that resides in MMIO space, load a dummy root that is backed by the
zero page instead of immediately synthesizing a triple fault shutdown
(using the zero page ensures any attempt to translate memory will generate
a !PRESENT fault and thus VM-Exit).

Unless the vCPU is racing with memslot activity, KVM will inject a page
fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e.
the end result is mostly same, but critically KVM will inject a fault only
*after* KVM runs the vCPU with the bogus root.

Waiting to inject a fault until after running the vCPU fixes a bug where
KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled
and a !visible root.  Even though a bad root will *probably* lead to
shutdown, (a) it's not guaranteed and (b) the CPU won't read the
underlying memory until after VM-Enter succeeds.  E.g. if L1 runs L2 with
a VMX preemption timer value of '0', then architecturally the preemption
timer VM-Exit is guaranteed to occur before the CPU executes any
instruction, i.e. before the CPU needs to translate a GPA to a HPA (so
long as there are no injected events with higher priority than the
preemption timer).

If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because
userspace created a memslot between installing the dummy root and handling
the page fault, simply unload the MMU to allocate a new root and retry the
instruction.  Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as
invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and
conceptually the dummy root has indeeed become obsolete.  The only
difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is
that the root has become obsolete due to memslot *creation*, not memslot
deletion or movement.

Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31 14:08:24 -04:00
Sean Christopherson
3328dfe0ea KVM: x86/mmu: Use BUILD_BUG_ON_INVALID() for KVM_MMU_WARN_ON() stub
Use BUILD_BUG_ON_INVALID() instead of an empty do-while loop to stub out
KVM_MMU_WARN_ON() when CONFIG_KVM_PROVE_MMU=n, that way _some_ build
issues with the usage of KVM_MMU_WARN_ON() will be dected even if the
kernel is using the stubs, e.g. basic syntax errors will be detected.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20230729004722.1056172-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31 13:48:48 -04:00
Sean Christopherson
870d4d4ed8 KVM: x86/mmu: Replace MMU_DEBUG with proper KVM_PROVE_MMU Kconfig
Replace MMU_DEBUG, which requires manually modifying KVM to enable the
macro, with a proper Kconfig, KVM_PROVE_MMU.  Now that pgprintk() and
rmap_printk() are gone, i.e. the macro guards only KVM_MMU_WARN_ON() and
won't flood the kernel logs, enabling the option for debug kernels is both
desirable and feasible.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20230729004722.1056172-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31 13:48:47 -04:00
Sean Christopherson
20ba462dfd KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE().  Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger.  E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.

If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost.  In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.

On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.

Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once.  And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.

Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31 13:48:44 -04:00
Sean Christopherson
0fe6370eb3 KVM: x86/mmu: Rename MMU_WARN_ON() to KVM_MMU_WARN_ON()
Rename MMU_WARN_ON() to make it super obvious that the assertions are
all about KVM's MMU, not the primary MMU.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20230729004722.1056172-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31 13:48:43 -04:00
Sean Christopherson
c4f92cfe02 KVM: x86/mmu: Delete the "dbg" module param
Delete KVM's "dbg" module param now that its usage in KVM is gone (it
used to guard pgprintk() and rmap_printk()).

Link: https://lore.kernel.org/r/20230729004722.1056172-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31 13:48:41 -04:00
Sean Christopherson
350c49fdea KVM: x86/mmu: Delete rmap_printk() and all its usage
Delete rmap_printk() so that MMU_WARN_ON() and MMU_DEBUG can be morphed
into something that can be regularly enabled for debug kernels.  The
information provided by rmap_printk() isn't all that useful now that the
rmap and unsync code is mature, as the prints are simultaneously too
verbose (_lots_ of message) and yet not verbose enough to be helpful for
debug (most instances print just the SPTE pointer/value, which is rarely
sufficient to root cause anything but trivial bugs).

Alternatively, rmap_printk() could be reworked to into tracepoints, but
it's not clear there is a real need as rmap bugs rarely escape initial
development, and when bugs do escape to production, they are often edge
cases and/or reside in code that isn't directly related to the rmaps.
In other words, the problems with rmap_printk() being unhelpful also apply
to tracepoints.  And deleting rmap_printk() doesn't preclude adding
tracepoints in the future.

Link: https://lore.kernel.org/r/20230729004722.1056172-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31 13:48:40 -04:00
Sean Christopherson
a98b889492 KVM: x86/mmu: Delete pgprintk() and all its usage
Delete KVM's pgprintk() and all its usage, as the code is very prone
to bitrot due to being buried behind MMU_DEBUG, and the functionality has
been rendered almost entirely obsolete by the tracepoints KVM has gained
over the years.  And for the situations where the information provided by
KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps,
and is almost always far too noisy, i.e. developers end up implementing
custom prints anyways.

Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31 13:48:39 -04:00
David Matlack
d478899605 KVM: Allow range-based TLB invalidation from common code
Make kvm_flush_remote_tlbs_range() visible in common code and create a
default implementation that just invalidates the whole TLB.

This paves the way for several future features/cleanups:

 - Introduction of range-based TLBI on ARM.
 - Eliminating kvm_arch_flush_remote_tlbs_memslot()
 - Moving the KVM/x86 TDP MMU to common code.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com
2023-08-17 09:40:35 +01:00
David Matlack
9d4655da1a KVM: x86/mmu: Use gfn_t in kvm_flush_remote_tlbs_range()
Use gfn_t instead of u64 for kvm_flush_remote_tlbs_range()'s parameters,
since gfn_t is the standard type for GFNs throughout KVM.

Opportunistically rename pages to nr_pages to make its role even more
obvious.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230126184025.2294823-6-dmatlack@google.com
[sean: convert pages to gfn_t too, and rename]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17 15:36:20 -07:00
David Matlack
8c63e8c217 KVM: x86/mmu: Rename kvm_flush_remote_tlbs_with_address()
Rename kvm_flush_remote_tlbs_with_address() to
kvm_flush_remote_tlbs_range(). This name is shorter, which reduces the
number of callsites that need to be broken up across multiple lines, and
more readable since it conveys a range of memory is being flushed rather
than a single address.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230126184025.2294823-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-17 15:16:12 -07:00
Sean Christopherson
258d985f6e KVM: x86/mmu: Use EMULTYPE flag to track write #PFs to shadow pages
Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults
on self-changing writes to shadowed page tables instead of propagating
that information to the emulator via a semi-persistent vCPU flag.  Using
a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented,
as it's not at all obvious that clearing the flag only when emulation
actually occurs is correct.

E.g. if KVM sets the flag and then retries the fault without ever getting
to the emulator, the flag will be left set for future calls into the
emulator.  But because the flag is consumed if and only if both
EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because
EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated
MMIO, or while L2 is active, KVM avoids false positives on a stale flag
since FNAME(page_fault) is guaranteed to be run and refresh the flag
before it's ultimately consumed by the tail end of reexecute_instruction().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230202182817.407394-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-14 10:28:56 -04:00
Hou Wenlong
9ffe926537 KVM: x86/mmu: Fix wrong gfn range of tlb flushing in kvm_set_pte_rmapp()
When the spte of hupe page is dropped in kvm_set_pte_rmapp(), the whole
gfn range covered by the spte should be flushed. However,
rmap_walk_init_level() doesn't align down the gfn for new level like tdp
iterator does, then the gfn used in kvm_set_pte_rmapp() is not the base
gfn of huge page. And the size of gfn range is wrong too for huge page.
Use the base gfn of huge page and the size of huge page for flushing
tlbs for huge page. Also introduce a helper function to flush the given
page (huge or not) of guest memory, which would help prevent future
buggy use of kvm_flush_remote_tlbs_with_address() in such case.

Fixes: c3134ce240 ("KVM: Replace old tlb flush function with new one to flush a specified range.")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/0ce24d7078fa5f1f8d64b0c59826c50f32f8065e.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:46 -08:00
Hou Wenlong
c667a3baed KVM: x86/mmu: Move round_gfn_for_level() helper into mmu_internal.h
Rounding down the GFN to a huge page size is a common pattern throughout
KVM, so move round_gfn_for_level() helper in tdp_iter.c to
mmu_internal.h for common usage. Also rename it as gfn_round_for_level()
to use gfn_* prefix and clean up the other call sites.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/415c64782f27444898db650e21cf28eeb6441dfa.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:45 -08:00
David Matlack
6c882ef4fc KVM: x86/mmu: Rename __direct_map() to direct_map()
Rename __direct_map() to direct_map() since the leading underscores are
unnecessary. This also makes the page fault handler names more
consistent: kvm_tdp_mmu_page_fault() calls kvm_tdp_mmu_map() and
direct_page_fault() calls direct_map().

Opportunistically make some trivial cleanups to comments that had to be
modified anyway since they mentioned __direct_map(). Specifically, use
"()" when referring to functions, and include kvm_tdp_mmu_map() among
the various callers of disallowed_hugepage_adjust().

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-11-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:33:24 -05:00
David Matlack
e5e6f8d254 KVM: x86/mmu: Initialize fault.{gfn,slot} earlier for direct MMUs
Move the initialization of fault.{gfn,slot} earlier in the page fault
handling code for fully direct MMUs. This will enable a future commit to
split out TDP MMU page fault handling without needing to duplicate the
initialization of these 2 fields.

Opportunistically take advantage of the fact that fault.gfn is
initialized in kvm_tdp_page_fault() rather than recomputing it from
fault->addr.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-8-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:33:21 -05:00
David Matlack
ba6e3fe255 KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()
Grab mmu_invalidate_seq in kvm_faultin_pfn() and stash it in struct
kvm_page_fault. The eliminates duplicate code and reduces the amount of
parameters needed for is_page_fault_stale().

Preemptively split out __kvm_faultin_pfn() to a separate function for
use in subsequent commits.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220921173546.2674386-4-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29 15:33:18 -05:00
Sean Christopherson
5e3edd7e8b KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
Add a helper to convert a SPTE to its shadow page to deduplicate a
variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
get the shadow page for a SPTE without dropping high bits.

Opportunistically add a comment in mmu_free_root_page() documenting why
it treats the root HPA as a SPTE.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221019165618.927057-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 12:31:33 -05:00
Sean Christopherson
61f9447854 KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE
Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
visible to other readers, i.e. before setting its SPTE.  This will allow
KVM to query the flag when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.

Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
for another CPU to see a shadow page without an up-to-date
nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
dance.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Message-Id: <20221019165618.927057-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 12:31:32 -05:00
Sean Christopherson
55c510e26a KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
Rename most of the variables/functions involved in the NX huge page
mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
applies only to the NX huge page mitigation, not to any condition that
prevents creating a huge page.

Add a comment explaining what the newly named "possible_nx_huge_pages"
tracks.

Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
stone.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221019165618.927057-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 12:31:31 -05:00
Sean Christopherson
428e921611 KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
Tag shadow pages that cannot be replaced with an NX huge page regardless
of whether or not zapping the page would allow KVM to immediately create
a huge page, e.g. because something else prevents creating a huge page.

I.e. track pages that are disallowed from being NX huge pages regardless
of whether or not the page could have been huge at the time of fault.
KVM currently tracks pages that were disallowed from being huge due to
the NX workaround if and only if the page could otherwise be huge.  But
that fails to handled the scenario where whatever restriction prevented
KVM from installing a huge page goes away, e.g. if dirty logging is
disabled, the host mapping level changes, etc...

Failure to tag shadow pages appropriately could theoretically lead to
false negatives, e.g. if a fetch fault requests a small page and thus
isn't tracked, and a read/write fault later requests a huge page, KVM
will not reject the huge page as it should.

To avoid yet another flag, initialize the list_head and use list_empty()
to determine whether or not a page is on the list of NX huge pages that
should be recovered.

Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is
more involved due to mmu_lock being held for read.  This will be
addressed in a future commit.

Fixes: 5bcaf3e171 ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221019165618.927057-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 12:31:31 -05:00
Sean Christopherson
a8ac499bb6 KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs
Drop the requirement that a pfn be backed by a refcounted, compound or
or ZONE_DEVICE, struct page, and instead rely solely on the host page
tables to identify huge pages.  The PageCompound() check is a remnant of
an old implementation that identified (well, attempt to identify) huge
pages without walking the host page tables.  The ZONE_DEVICE check was
added as an exception to the PageCompound() requirement.  In other words,
neither check is actually a hard requirement, if the primary has a pfn
backed with a huge page, then KVM can back the pfn with a huge page
regardless of the backing store.

Dropping the @pfn parameter will also allow KVM to query the max host
mapping level without having to first get the pfn, which is advantageous
for use outside of the page fault path where KVM wants to take action if
and only if a page can be mapped huge, i.e. avoids the pfn lookup for
gfns that can't be backed with a huge page.

Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220715232107.3775620-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:22 -04:00
David Matlack
6a97575d5c KVM: x86/mmu: Cache the access bits of shadowed translations
Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU.  Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page.  The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:58 -04:00
Ben Gardon
084cc29f8b KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
In some cases, the NX hugepage mitigation for iTLB multihit is not
needed for all guests on a host. Allow disabling the mitigation on a
per-VM basis to avoid the performance hit of NX hugepages on trusted
workloads.

In order to disable NX hugepages on a VM, ensure that the userspace
actor has permission to reboot the system. Since disabling NX hugepages
would allow a guest to crash the system, it is similar to reboot
permissions.

Ideally, KVM would require userspace to prove it has access to KVM's
nx_huge_pages module param, e.g. so that userspace can opt out without
needing full reboot permissions.  But getting access to the module param
file info is difficult because it is buried in layers of sysfs and module
glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.

Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:49 -04:00
Sean Christopherson
42c88ff893 KVM: x86/mmu: Dedup macros for computing various page table masks
Provide common helper macros to generate various masks, shifts, etc...
for 32-bit vs. 64-bit page tables.  Only the inputs differ, the actual
calculations are identical.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:27 -04:00
Sean Christopherson
1075d41efd KVM: x86/mmu: Expand and clean up page fault stats
Expand and clean up the page fault stats.  The current stats are at best
incomplete, and at worst misleading.  Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:43 -04:00
Sean Christopherson
8d5265b101 KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults
Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults.  The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).

No functional or binary change intented.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:43 -04:00
Sean Christopherson
8a009d5bca KVM: x86/mmu: Make all page fault handlers internal to the MMU
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs.  This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:42 -04:00
Sean Christopherson
5276c616ab KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"
Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing.  Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.

E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.

More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12 09:51:42 -04:00
Paolo Bonzini
22b94c4b63 KVM: x86/mmu: Zap invalidated roots via asynchronous worker
Use the system worker threads to zap the roots invalidated
by the TDP MMU's "fast zap" mechanism, implemented by
kvm_tdp_mmu_invalidate_all_roots().

At this point, apart from allowing some parallelism in the zapping of
roots, the workqueue is a glorified linked list: work items are added and
flushed entirely within a single kvm->slots_lock critical section.  However,
the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
assumes that it owns a reference to all invalid roots; therefore, no
one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
invalidated roots on a linked list... erm, on a workqueue ensures that
tdp_mmu_zap_root_work() only puts back those extra references that
kvm_mmu_zap_all_invalidated_roots() had gifted to it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-08 10:55:27 -05:00
Sean Christopherson
c10743a182 KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery
When recovering a potential hugepage that was shattered for the iTLB
multihit workaround, precisely zap only the target page instead of
iterating over the TDP MMU to find the SP that was passed in.  This will
allow future simplification of zap_gfn_range() by having it zap only
leaf SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-08 09:31:54 -05:00
Ben Gardon
8283e36abf KVM: x86/mmu: Propagate memslot const qualifier
In preparation for implementing in-place hugepage promotion, various
functions will need to be called from zap_collapsible_spte_range, which
has the const qualifier on its memslot argument. Propagate the const
qualifier to the various functions which will be needed. This just serves
to simplify the following patch.

No functional change intended.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115234603.2908381-11-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:24:43 -05:00
Ben Gardon
4d78d0b39a KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages
The vCPU argument to mmu_try_to_unsync_pages is now only used to get a
pointer to the associated struct kvm, so pass in the kvm pointer from
the beginning to remove the need for a vCPU when calling the function.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115234603.2908381-7-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:24:42 -05:00
Sean Christopherson
ce92ef7604 KVM: x86/mmu: Use shadow page role to detect PML-unfriendly pages for L2
Rework make_spte() to query the shadow page's role, specifically whether
or not it's a guest_mode page, a.k.a. a page for L2, when determining if
the SPTE is compatible with PML.  This eliminates a dependency on @vcpu,
with a future goal of being able to create SPTEs without a specific vCPU.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:24:41 -05:00
Paolo Bonzini
2839180ce5 KVM: x86/mmu: clean up prefetch/prefault/speculative naming
"prefetch", "prefault" and "speculative" are used throughout KVM to mean
the same thing.  Use a single name, standardizing on "prefetch" which
is already used by various functions such as direct_pte_prefetch,
FNAME(prefetch_gpte), FNAME(pte_prefetch), etc.

Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:26 -04:00
David Matlack
53597858db KVM: x86/mmu: Avoid memslot lookup in make_spte and mmu_try_to_unsync_pages
mmu_try_to_unsync_pages checks if page tracking is active for the given
gfn, which requires knowing the memslot. We can pass down the memslot
via make_spte to avoid this lookup.

The memslot is also handy for make_spte's marking of the gfn as dirty:
we can test whether dirty page tracking is enabled, and if so ensure that
pages are mapped as writable with 4K granularity.  Apart from the warning,
no functional change is intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:56 -04:00
Paolo Bonzini
ad67e4806e KVM: MMU: clean up make_spte return value
Now that make_spte is called directly by the shadow MMU (rather than
wrapped by set_spte), it only has to return one boolean value.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:54 -04:00
Paolo Bonzini
536f0e6ace KVM: MMU: change disallowed_hugepage_adjust() arguments to kvm_page_fault
Pass struct kvm_page_fault to disallowed_hugepage_adjust() instead of
extracting the arguments from the struct.  Tweak a bit the conditions
to avoid long lines.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:51 -04:00
Paolo Bonzini
73a3c65947 KVM: MMU: change kvm_mmu_hugepage_adjust() arguments to kvm_page_fault
Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of
extracting the arguments from the struct; the results are also stored
in the struct, so the callers are adjusted consequently.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01 03:44:51 -04:00
Lai Jiangshan
f1c4a88c41 KVM: X86: Don't unsync pagetables when speculative
We'd better only unsync the pagetable when there just was a really
write fault on a level-1 pagetable.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-10-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:10 -04:00