Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.
Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.
In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded. Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.
Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.
=========================================================================
UBSAN: invalid-load in kernel/params.c:320:33
load of value 255 is not a valid value for type '_Bool'
CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
ubsan_epilogue+0x5/0x40
__ubsan_handle_load_invalid_value.cold+0x43/0x48
param_get_bool.cold+0xf/0x14
param_attr_show+0x55/0x80
module_attr_show+0x1c/0x30
sysfs_kf_seq_show+0x93/0xc0
seq_read_iter+0x11c/0x450
new_sync_read+0x11b/0x1a0
vfs_read+0xf0/0x190
ksys_read+0x5f/0xe0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
=========================================================================
Fixes: b8e8c8303f ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Documentation improvements
* Prevent module exit until all VMs are freed
* PMU Virtualization fixes
* Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences
* Other miscellaneous bugfixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmJIGV8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroO5FQgAhls4+Nu+NqId/yvvyNxr3vXq0dHI
hLlHtvzgGzZisZ7y2bNeyIpJVBDT5LCbrptPD/5eTvchVswDh0+kCVC0Uni5ugGT
tLT/Pv9Oq9e0X7aGdHRyuHIivIFDC20zIZO2DV48Lrj/+r6DafB2Fghq2XQLlBxN
p8KislvuqAAos543BPC1+Lk3dhOLuZ8qcFD8wGRlcCwjNwYaitrQ16rO04cLfUur
OwIks1I6TdI2JpLBhm6oWYVG/YnRsoo4bQE8cjdQ6yNSbwWtRpV33q7X6onw8x8K
BEeESoTnMqfaxIF/6mPl6bnDblVHFp6Xhld/vJcgeWQTdajFtuFE/K4sCA==
=xnQ6
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
- Documentation improvements
- Prevent module exit until all VMs are freed
- PMU Virtualization fixes
- Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences
- Other miscellaneous bugfixes
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
KVM: x86: fix sending PV IPI
KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
KVM: x86: Remove redundant vm_entry_controls_clearbit() call
KVM: x86: cleanup enter_rmode()
KVM: x86: SVM: fix tsc scaling when the host doesn't support it
kvm: x86: SVM: remove unused defines
KVM: x86: SVM: move tsc ratio definitions to svm.h
KVM: x86: SVM: fix avic spec based definitions again
KVM: MIPS: remove reference to trap&emulate virtualization
KVM: x86: document limitations of MSR filtering
KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
KVM: x86: Trace all APICv inhibit changes and capture overall status
KVM: x86: Add wrappers for setting/clearing APICv inhibits
KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
KVM: X86: Handle implicit supervisor access with SMAP
KVM: X86: Rename variable smap to not_smap in permission_fault()
...
Add optional callback .vcpu_get_apicv_inhibit_reasons returning
extra inhibit reasons that prevent APICv from working on this vCPU.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This sets the default TSC frequency for subsequently created vCPUs.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220225145304.36166-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for SCHEDOP_poll hypercall.
This implementation is optimized for polling for a single channel, which
is what Linux does. Polling for multiple channels is not especially
efficient (and has not been tested).
PV spinlocks slow path uses this hypercall, and explicitly crash if it's
not supported.
[ dwmw2: Rework to use kvm_vcpu_halt(), not supported for 32-bit guests ]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-17-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Windows uses a per-vCPU vector, and it's delivered via the local APIC
basically like an MSI (with associated EOI) unlike the traditional
guest-wide vector which is just magically asserted by Xen (and in the
KVM case by kvm_xen_has_interrupt() / kvm_cpu_get_extint()).
Now that the kernel is able to raise event channel events for itself,
being able to do so for Windows guests is also going to be useful.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-15-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Turns out this is a fast path for PV guests because they use it to
trigger the event channel upcall. So letting it bounce all the way up
to userspace is not great.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-14-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the guest has offloaded the timer virq, handle the following
hypercalls for programming the timer:
VCPUOP_set_singleshot_timer
VCPUOP_stop_singleshot_timer
set_timer_op(timestamp_ns)
The event channel corresponding to the timer virq is then used to inject
events once timer deadlines are met. For now we back the PV timer with
hrtimer.
[ dwmw2: Add save/restore, 32-bit compat mode, immediate delivery,
don't check timer in kvm_vcpu_has_event() ]
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-13-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we
need to be aware of the Xen CPU numbering.
This looks a lot like the Hyper-V handling of vpidx, for obvious reasons.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-12-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Userspace registers a sending @port to either deliver to an @eventfd
or directly back to a local event channel port.
After binding events the guest or host may wish to bind those
events to a particular vcpu. This is usually done for unbound
and and interdomain events. Update requests are handled via the
KVM_XEN_EVTCHN_UPDATE flag.
Unregistered ports are handled by the emulator.
Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
Co-developed-By: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-10-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This switches the final pvclock to kvm_setup_pvclock_pfncache() and now
the old kvm_setup_pvclock_page() can be removed.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-7-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the fast path of kvm_xen_set_evtchn_fast() doesn't set the
index bits in the target vCPU's evtchn_pending_sel, because it only has
a userspace virtual address with which to do so. It just sets them in
the kernel, and kvm_xen_has_interrupt() then completes the delivery to
the actual vcpu_info structure when the vCPU runs.
Using a gfn_to_pfn_cache allows kvm_xen_set_evtchn_fast() to do the full
delivery in the common case.
Clean up the fallback case too, by moving the deferred delivery out into
a separate kvm_xen_inject_pending_events() function which isn't ever
called in atomic contexts as __kvm_xen_has_interrupt() is.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new kvm_setup_guest_pvclock() which parallels the existing
kvm_setup_pvclock_page(). The latter will be removed once we convert
all users to the gfn_to_pfn_cache version.
Using the new cache, we can potentially let kvm_set_guest_paused() set
the PVCLOCK_GUEST_STOPPED bit directly rather than having to delegate
to the vCPU via KVM_REQ_CLOCK_UPDATE. But not yet.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-5-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-4-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM handles the VMCALL/VMMCALL instructions very strangely. Even though
both of these instructions really should #UD when executed on the wrong
vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the
guest's instruction with the appropriate instruction for the vendor.
Nonetheless, older guest kernels without commit c1118b3602 ("x86: kvm:
use alternatives for VMCALL vs. VMMCALL if kernel text is read-only")
do not patch in the appropriate instruction using alternatives, likely
motivating KVM's intervention.
Add a quirk allowing userspace to opt out of hypercall patching. If the
quirk is disabled, KVM synthesizes a #UD in the guest.
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220316005538.2282772-2-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add set/clear wrappers for toggling APICv inhibits to make the call sites
more readable, and opportunistically rename the inner helpers to align
with the new wrappers and to make them more readable as well. Invert the
flag from "activate" to "set"; activate is painfully ambiguous as it's
not obvious if the inhibit is being activated, or if APICv is being
activated, in which case the inhibit is being deactivated.
For the functions that take @set, swap the order of the inhibit reason
and @set so that the call sites are visually similar to those that bounce
through the wrapper.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use an enum for the APICv inhibit reasons, there is no meaning behind
their values and they most definitely are not "unsigned longs". Rename
the various params to "reason" for consistency and clarity (inhibit may
be confused as a command, i.e. inhibit APICv, instead of the reason that
is getting toggled/checked).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two kinds of implicit supervisor access
implicit supervisor access when CPL = 3
implicit supervisor access when CPL < 3
Current permission_fault() handles only the first kind for SMAP.
But if the access is implicit when SMAP is on, data may not be read
nor write from any user-mode address regardless the current CPL.
So the second kind should be also supported.
The first kind can be detect via CPL and access mode: if it is
supervisor access and CPL = 3, it must be implicit supervisor access.
But it is not possible to detect the second kind without extra
information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
into @access. This extra information also works for the first kind, so
the logic is changed to use this information for both cases.
The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
which is in the most significant 16 bits of u64 and less likely to be
forced to change due to future hardware uses it.
This patch removes the call to ->get_cpl() for access mode is determined
by @access. Not only does it reduce a function call, but also remove
confusions when the permission is checked for nested TDP. The nested
TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
on it. The original code works just because it is always user walk for
NPT and SMAP fault is not set for EPT in update_permission_bitmask.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().
The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.
And SMAP relies a factor for supervisor access: explicit or implicit.
So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.
Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
o Extra bits will be in the higher 32 bits, so that we can
easily obtain the traditional access mode (UWX) by converting
it to u32.
o Reuse the value for the access kind defined by SVM's nested
paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
@error_code in kvm_handle_page_fault().
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The third nybble of AMD's event select overlaps with Intel's IN_TX and
IN_TXCP bits. Therefore, we can't use AMD64_RAW_EVENT_MASK on Intel
platforms that support TSX.
Declare a raw_event_mask in the kvm_pmu structure, initialize it in
the vendor-specific pmu_refresh() functions, and use that mask for
PERF_TYPE_RAW configurations in reprogram_gp_counter().
Fixes: 710c476514 ("KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220308012452.3468611-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
kvm_arch_init_vm also has to undo all the initialization, so
group all the MMU initialization code at the beginning and
handle cleaning up of kvm_page_track_init.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Proper emulation of the OSLock feature of the debug architecture
- Scalibility improvements for the MMU lock when dirty logging is on
- New VMID allocator, which will eventually help with SVA in VMs
- Better support for PMUs in heterogenous systems
- PSCI 1.1 support, enabling support for SYSTEM_RESET2
- Implement CONFIG_DEBUG_LIST at EL2
- Make CONFIG_ARM64_ERRATUM_2077057 default y
- Reduce the overhead of VM exit when no interrupt is pending
- Remove traces of 32bit ARM host support from the documentation
- Updated vgic selftests
- Various cleanups, doc updates and spelling fixes
RISC-V:
- Prevent KVM_COMPAT from being selected
- Optimize __kvm_riscv_switch_to() implementation
- RISC-V SBI v0.3 support
s390:
- memop selftest
- fix SCK locking
- adapter interruptions virtualization for secure guests
- add Claudio Imbrenda as maintainer
- first step to do proper storage key checking
x86:
- Continue switching kvm_x86_ops to static_call(); introduce
static_call_cond() and __static_call_ret0 when applicable.
- Cleanup unused arguments in several functions
- Synthesize AMD 0x80000021 leaf
- Fixes and optimization for Hyper-V sparse-bank hypercalls
- Implement Hyper-V's enlightened MSR bitmap for nested SVM
- Remove MMU auditing
- Eager splitting of page tables (new aka "TDP" MMU only) when dirty
page tracking is enabled
- Cleanup the implementation of the guest PGD cache
- Preparation for the implementation of Intel IPI virtualization
- Fix some segment descriptor checks in the emulator
- Allow AMD AVIC support on systems with physical APIC ID above 255
- Better API to disable virtualization quirks
- Fixes and optimizations for the zapping of page tables:
- Zap roots in two passes, avoiding RCU read-side critical sections
that last too long for very large guests backed by 4 KiB SPTEs.
- Zap invalid and defunct roots asynchronously via concurrency-managed
work queue.
- Allowing yielding when zapping TDP MMU roots in response to the root's
last reference being put.
- Batch more TLB flushes with an RCU trick. Whoever frees the paging
structure now holds RCU as a proxy for all vCPUs running in the guest,
i.e. to prolongs the grace period on their behalf. It then kicks the
the vCPUs out of guest mode before doing rcu_read_unlock().
Generic:
- Introduce __vcalloc and use it for very large allocations that
need memcg accounting
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmI4fdwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMq8gf/WoeVHtw2QlL5Mmz6McvRRmPAYPLV
wLUIFNrRqRvd8Tw4kivzZoh/xTpwmnojv0YdK5SjKAiMjgv094YI1LrNp1JSPvmL
pitocMkA10RSJNWHeEMg9cMSKH0rKiqeYl6S1e2XsdB+UZZ2BINOCVtvglmjTAvJ
dFBdKdBkqjAUZbdXAGIvz4JEEER3N/LkFDKGaUGX+0QIQOzGBPIyLTxynxIDG6mt
RViCCFyXdy5NkVp5hZFm96vQ2qAlWL9B9+iKruQN++82+oqWbeTdSqPhdwF7GyFz
BfOv3gobQ2c4ef/aMLO5LswZ9joI1t/4kQbbAn6dNybpOAz/NXfDnbNefg==
=keox
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Proper emulation of the OSLock feature of the debug architecture
- Scalibility improvements for the MMU lock when dirty logging is on
- New VMID allocator, which will eventually help with SVA in VMs
- Better support for PMUs in heterogenous systems
- PSCI 1.1 support, enabling support for SYSTEM_RESET2
- Implement CONFIG_DEBUG_LIST at EL2
- Make CONFIG_ARM64_ERRATUM_2077057 default y
- Reduce the overhead of VM exit when no interrupt is pending
- Remove traces of 32bit ARM host support from the documentation
- Updated vgic selftests
- Various cleanups, doc updates and spelling fixes
RISC-V:
- Prevent KVM_COMPAT from being selected
- Optimize __kvm_riscv_switch_to() implementation
- RISC-V SBI v0.3 support
s390:
- memop selftest
- fix SCK locking
- adapter interruptions virtualization for secure guests
- add Claudio Imbrenda as maintainer
- first step to do proper storage key checking
x86:
- Continue switching kvm_x86_ops to static_call(); introduce
static_call_cond() and __static_call_ret0 when applicable.
- Cleanup unused arguments in several functions
- Synthesize AMD 0x80000021 leaf
- Fixes and optimization for Hyper-V sparse-bank hypercalls
- Implement Hyper-V's enlightened MSR bitmap for nested SVM
- Remove MMU auditing
- Eager splitting of page tables (new aka "TDP" MMU only) when dirty
page tracking is enabled
- Cleanup the implementation of the guest PGD cache
- Preparation for the implementation of Intel IPI virtualization
- Fix some segment descriptor checks in the emulator
- Allow AMD AVIC support on systems with physical APIC ID above 255
- Better API to disable virtualization quirks
- Fixes and optimizations for the zapping of page tables:
- Zap roots in two passes, avoiding RCU read-side critical
sections that last too long for very large guests backed by 4
KiB SPTEs.
- Zap invalid and defunct roots asynchronously via
concurrency-managed work queue.
- Allowing yielding when zapping TDP MMU roots in response to the
root's last reference being put.
- Batch more TLB flushes with an RCU trick. Whoever frees the
paging structure now holds RCU as a proxy for all vCPUs running
in the guest, i.e. to prolongs the grace period on their behalf.
It then kicks the the vCPUs out of guest mode before doing
rcu_read_unlock().
Generic:
- Introduce __vcalloc and use it for very large allocations that need
memcg accounting"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
KVM: use kvcalloc for array allocations
KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
kvm: x86: Require const tsc for RT
KVM: x86: synthesize CPUID leaf 0x80000021h if useful
KVM: x86: add support for CPUID leaf 0x80000021
KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for get_mt_mask
Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"
kvm: x86/mmu: Flush TLB before zap_gfn_range releases RCU
KVM: arm64: fix typos in comments
KVM: arm64: Generalise VM features into a set of flags
KVM: s390: selftests: Add error memop tests
KVM: s390: selftests: Add more copy memop tests
KVM: s390: selftests: Add named stages for memop test
KVM: s390: selftests: Add macro as abstraction for MEM_OP
KVM: s390: selftests: Split memop tests
KVM: s390x: fix SCK locking
RISC-V: KVM: Implement SBI HSM suspend call
RISC-V: KVM: Add common kvm_riscv_vcpu_wfi() function
RISC-V: Add SBI HSM suspend related defines
RISC-V: KVM: Implement SBI v0.3 SRST extension
...
- Fix address filtering for Intel/PT,ARM/CoreSight
- Enable Intel/PEBS format 5
- Allow more fixed-function counters for x86
- Intel/PT: Enable not recording Taken-Not-Taken packets
- Add a few branch-types
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI4WdIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jdTA/7BADTYzFCbdwPzHt2mR8osv7k+pDvYxs9
wxNjyi1X7N8cPkhqgIg9CfdhdyDOqo7+J4fG17f2qbwjNK7b2Fb1/U6ZoZaf+f8F
W0e2LX5KZTXUhkA+TEjrXvYD9FmJaCPM/l2RQg8U7okBs2kb0H6QT2Yn21wd1roC
WwI5KFiWSVS1IzpVLaXjDh+FJfJHd75ReMqJeus+QoVQ9NHeuI+t4DglSB1IBi54
d/zeVXE/Y4dFTQOrU06S2HxcOEptvXZsPmVLvKab/veeGGyWiGPxQpvu6bXm6u3x
0sV+dn67zut2m2pQlUZUucgGTSYIZTpOe+rNukTB9hJ4XeN4/1ohOOCrOuYM+63P
lGFbN1v+LD7Wc6C2eEhw8G5GEL0qbwzFNQ06O3EOFi7C7GKn7WS/ET6XuuMOERFk
uxEPb4pFtbBlJ0SriCprFJSd5NL3PORZlLIhv4hGH5hilLR1TFeKDuwZaM4noQxU
dL3rKGLi9H+P46Eni9H28+0gDISbv1xL+WivHOFQNmhBqAZO52ZcF3J+dgBaR1B5
pBxVTycFpZMjxSZnqTE0gMsFaLIpVGc+75Chns1rajR0mEtRtJUQUbYz4tK4zb0E
dZR1p+VF6+DYmSRhiqeaTi9uz9oE8kMa8o/EcbFIg/9BgEnUwJXU20bjnar30xQ7
9OIn7r9hjHI=
=XPuo
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf event updates from Ingo Molnar:
- Fix address filtering for Intel/PT,ARM/CoreSight
- Enable Intel/PEBS format 5
- Allow more fixed-function counters for x86
- Intel/PT: Enable not recording Taken-Not-Taken packets
- Add a few branch-types
* tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix the build on !CONFIG_PHYS_ADDR_T_64BIT
perf: Add irq and exception return branch types
perf/x86/intel/uncore: Make uncore_discovery clean for 64 bit addresses
perf/x86/intel/pt: Add a capability and config bit for disabling TNTs
perf/x86/intel/pt: Add a capability and config bit for event tracing
perf/x86/intel: Increase max number of the fixed counters
KVM: x86: use the KVM side max supported fixed counter
perf/x86/intel: Enable PEBS format 5
perf/core: Allow kernel address filter when not filtering the kernel
perf/x86/intel/pt: Fix address filter config for 32-bit kernel
perf/core: Fix address filter parser for multiple filters
x86: Share definition of __is_canonical_address()
perf/x86/intel/pt: Relax address filter validation
KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
advertise the set of quirks which may be disabled to userspace, so it is
impossible to predict the behavior of KVM. Worse yet,
KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
it fails to reject attempts to set invalid quirk bits.
The only valid workaround for the quirky quirks API is to add a new CAP.
Actually advertise the set of quirks that can be disabled to userspace
so it can predict KVM's behavior. Reject values for cap->args[0] that
contain invalid bits.
Finally, add documentation for the new capability and describe the
existing quirks.
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220301060351.442881-5-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the system worker threads to zap the roots invalidated
by the TDP MMU's "fast zap" mechanism, implemented by
kvm_tdp_mmu_invalidate_all_roots().
At this point, apart from allowing some parallelism in the zapping of
roots, the workqueue is a glorified linked list: work items are added and
flushed entirely within a single kvm->slots_lock critical section. However,
the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
assumes that it owns a reference to all invalid roots; therefore, no
one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the
invalidated roots on a linked list... erm, on a workqueue ensures that
tdp_mmu_zap_root_work() only puts back those extra references that
kvm_mmu_zap_all_invalidated_roots() had gifted to it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zap only obsolete roots when responding to zapping a single root shadow
page. Because KVM keeps root_count elevated when stuffing a previous
root into its PGD cache, shadowing a 64-bit guest means that zapping any
root causes all vCPUs to reload all roots, even if their current root is
not affected by the zap.
For many kernels, zapping a single root is a frequent operation, e.g. in
Linux it happens whenever an mm is dropped, e.g. process exits, etc...
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220225182248.3812651-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These functions only operate on a given MMU, of which there is more
than one in a vCPU (we care about two, because the third does not have
any roots and is only used to walk guest page tables). They do need a
struct kvm in order to lock the mmu_lock, but they do not needed anything
else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.
The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new capability, KVM_CAP_PMU_CAPABILITY, that takes a bitmask of
settings/features to allow userspace to configure PMU virtualization on
a per-VM basis. For now, support a single flag, KVM_PMU_CAP_DISABLE,
to allow disabling PMU virtualization for a VM even when KVM is configured
with enable_pmu=true a module level.
To keep KVM simple, disallow changing VM's PMU configuration after vCPUs
have been created.
Signed-off-by: David Dunn <daviddunn@google.com>
Message-Id: <20220223225743.2703915-2-daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cast kvm_x86_ops.func to 'void *' when updating KVM static calls that are
conditionally patched to __static_call_return0(). clang complains about
using mismatching pointers in the ternary operator, which breaks the
build when compiling with CONFIG_KVM_WERROR=y.
>> arch/x86/include/asm/kvm-x86-ops.h:82:1: warning: pointer type mismatch
('bool (*)(struct kvm_vcpu *)' and 'void *') [-Wpointer-type-mismatch]
Fixes: 5be2226f41 ("KVM: x86: allow defining return-0 static calls")
Reported-by: Like Xu <like.xu.linux@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Message-Id: <20220223162355.3174907-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove mmu_audit.c and all its collateral, the auditing code has suffered
severe bitrot, ironically partly due to shadow paging being more stable
and thus not benefiting as much from auditing, but mostly due to TDP
supplanting shadow paging for non-nested guests and shadowing of nested
TDP not heavily stressing the logic that is being audited.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A few vendor callbacks are only used by VMX, but they return an integer
or bool value. Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is
NULL in struct kvm_x86_ops, it will be changed to __static_call_return0
when updating static calls.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the newly corrected KVM_X86_OP annotations to warn about possible
NULL pointer dereferences as soon as the vendor module is loaded.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The original use of KVM_X86_OP_NULL, which was to mark calls
that do not follow a specific naming convention, is not in use
anymore. Instead, let's mark calls that are optional because
they are always invoked within conditionals or with static_call_cond.
Those that are _not_, i.e. those that are defined with KVM_X86_OP,
must be defined by both vendor modules or some kind of NULL pointer
dereference is bound to happen at runtime.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The two ioctls used to implement userspace-accelerated TPR,
KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available
even if hardware-accelerated TPR can be used. So there is
no reason not to report KVM_CAP_VAPIC.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_vcpu_arch currently contains the guest supported features in both
guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field.
Currently both fields are set to the same value in
kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that.
Since it's not good to keep duplicated data, remove guest_supported_xcr0.
To keep the code more readable, introduce kvm_guest_supported_xcr()
and kvm_guest_supported_xfd() to replace the previous usages of
guest_supported_xcr0.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220217053028.96432-3-leobras@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
write-protected when dirty logging is enabled on the memslot. Instead
they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
the first time and only for the specific sub-region being cleared.
Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls, rather than splitting the entire
memslot as is the case when initially-all-set is not used.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use slightly more verbose names for the so called "memory encrypt",
a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
super short kvm_x86_ops names and SVM's more verbose, but non-conforming
names. This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
to fill svm_x86_ops.
Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
reflect its true nature, as it really is a full fledged ioctl() of its
own. Ideally, the hook would be named confidential_vm_ioctl() or so, as
the ioctl() is a gateway to more than just memory encryption, and because
its underlying purpose to support Confidential VMs, which can be provided
without memory encryption, e.g. if the TCB of the guest includes the host
kernel but not host userspace, or by isolation in hardware without
encrypting memory. But, diverging from KVM_MEMORY_ENCRYPT_OP even
further is undeseriable, and short of creating alises for all related
ioctl()s, which introduces a different flavor of divergence, KVM is stuck
with the nomenclature.
Defer renaming SVM's functions to a future commit as there are additional
changes needed to make SVM fully conforming and to match reality (looking
at you, svm_vm_copy_asid_from()).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that
its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a
superfluous export from KVM x86.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.
In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name. Justification for using the VMX nomenclature:
- set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
event that has already been "set" in e.g. the vIRR. SVM's relevant
VMCB field is even named event_inj, and KVM's stat is irq_injections.
- prepare_guest_switch => prepare_switch_to_guest because the former is
ambiguous, e.g. it could mean switching between multiple guests,
switching from the guest to host, etc...
- update_pi_irte => pi_update_irte to allow for matching match the rest
of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
- start_assignment => pi_start_assignment to again follow VMX's posted
interrupt naming scheme, and to provide context for what bit of code
might care about an otherwise undescribed "assignment".
The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.
VMX and SVM function names are intentionally left as is to minimize the
diff. Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The "struct kvm_vcpu *vcpu" parameter of kvm_scale_tsc() is not used,
so remove it. No functional change intended.
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Message-Id: <20220125095909.38122-18-cloudliang@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- A couple of fixes when handling an exception while a SError has been
delivered
- Workaround for Cortex-A510's single-step[ erratum
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmH9LlcPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDLTcP/3Ry8CzvPubZquMyNdRUFvEg2EcfTa6vtIGW
Fw7ap2hwPUaXUgJKDihMFIWj3Wf/wPmXw4t2Sr8R/yq8v9kWe+IG1isnT0yQhY3W
kLXEqc8Mu4Rf8+jvlFHsp5mLENHIswpWAv/EY49ChgZkNmtkKpnPm1qnD89d8bNv
tUwooDWidQ/7nXdM3z6zygSROJS24+OGTYTWzOQ1KgV3FGaXbqYiCleoPOpRR/Tc
DQQWF/tVl8bZCqgkGKZCv3aXT0ZUPrQggARJGai78vP0l2sE/Kyaydgq5I7npZja
2L2U4kDNoPYIVa8A1jvV3Ef3AqNFs6B7+jXWfYIgAcXjCYzDK3cZcxavf/Inq9F1
3udVGJGSzH1KkGaihW3BVhsqGORRHKCdksJzWRgqf6vGyJhJw0u0D2u1rTWcT+jw
Nm4KxShp0CX59HSLnVF5sR0Mct3jNNZ7UCCgH7q10wuBqYRfJT32hCo2ZrT7g9oD
IQ+pa2dVYa3SaKZ4O6T/lSlbLOuuxtvmcEIfxYpPD6m10S5RrxOdsW3MCtiYM5HQ
24oo2mk6NIu/va0XxhcW+NBMcYtLQD9JUGbkUkpcRy2mgilTi9b4YPp+muYM7plQ
/S1gj2kGY8vjMg0H+wysjMJyl2huEwSRsZ/UfxCAgW+MYhHLDxhxAnDWc8EcwGgE
tUzomowB
=Mbx/
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.17, take #2
- A couple of fixes when handling an exception while a SError has been
delivered
- Workaround for Cortex-A510's single-step[ erratum
KVM vPMU doesn't support to emulate all the fixed counters that the
host PMU driver has supported, e.g. the fixed counter 3 used by
Topdown metrics hasn't been supported by KVM so far.
Rename MAX_FIXED_COUNTERS to KVM_PMC_MAX_FIXED to have a more
straightforward naming convention as INTEL_PMC_MAX_FIXED used by the
host PMU driver, and fix vPMU to use the KVM side KVM_PMC_MAX_FIXED
for the virtual fixed counter emulation, instead of the host side
INTEL_PMC_MAX_FIXED.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1643750603-100733-2-git-send-email-kan.liang@linux.intel.com
Handle non-APICv interrupt delivery in vendor code, even though it means
VMX and SVM will temporarily have duplicate code. SVM's AVIC has a race
condition that requires KVM to fall back to legacy interrupt injection
_after_ the interrupt has been logged in the vIRR, i.e. to fix the race,
SVM will need to open code the full flow anyways[*]. Refactor the code
so that the SVM bug without introducing other issues, e.g. SVM would
return "success" and thus invoke trace_kvm_apicv_accept_irq() even when
delivery through the AVIC failed, and to opportunistically prepare for
using KVM_X86_OP to fill each vendor's kvm_x86_ops struct, which will
rely on the vendor function matching the kvm_x86_op pointer name.
No functional change intended.
[*] https://lore.kernel.org/all/20211213104634.199141-4-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Redo incorrect fix for SEV/SMAP erratum
* Windows 11 Hyper-V workaround
Other x86 changes:
* Various x86 cleanups
* Re-enable access_tracking_perf_test
* Fix for #GP handling on SVM
* Fix for CPUID leaf 0Dh in KVM_GET_SUPPORTED_CPUID
* Fix for ICEBP in interrupt shadow
* Avoid false-positive RCU splat
* Enable Enlightened MSR-Bitmap support for real
ARM:
* Correctly update the shadow register on exception injection when
running in nVHE mode
* Correctly use the mm_ops indirection when performing cache invalidation
from the page-table walker
* Restrict the vgic-v3 workaround for SEIS to the two known broken
implementations
Generic code changes:
* Dead code cleanup
There will be another pull request for ARM fixes next week, but
those patches need a bit more soak time.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHz5eIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNv4wgAopj0Zlutrrtw3KT4/XnmSdMPgN0j
jQNzysSLTO5wGQCEogycjYXkGUDFu1Gdi+K91QAyjeKja20pIhPLeS2CBDRJyOc5
73K7sxqz51JnQiVFzkTuA+qzn+lXaJ9LUXtdg8BnQMSKyt2AJOqE8uT10kcYOD5q
mW4V3QUA0QpVKN0cYHv/G/zvBwQGGSLZetFbuAzwH2EDTpIi1aio5ZN1r0AoH18L
2x5kYPpqmnoBvo2cB4b7SNmxv3ZPQ5K+wta0uwZ4pO+UuYiRd84RPr5lErywJC3w
nci0eC0DoXrC6h+35UItqM8RqAGv6LADbDnr1RGojmfogSD0OtbX8y3hjw==
=iKnI
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Two larger x86 series:
- Redo incorrect fix for SEV/SMAP erratum
- Windows 11 Hyper-V workaround
Other x86 changes:
- Various x86 cleanups
- Re-enable access_tracking_perf_test
- Fix for #GP handling on SVM
- Fix for CPUID leaf 0Dh in KVM_GET_SUPPORTED_CPUID
- Fix for ICEBP in interrupt shadow
- Avoid false-positive RCU splat
- Enable Enlightened MSR-Bitmap support for real
ARM:
- Correctly update the shadow register on exception injection when
running in nVHE mode
- Correctly use the mm_ops indirection when performing cache
invalidation from the page-table walker
- Restrict the vgic-v3 workaround for SEIS to the two known broken
implementations
Generic code changes:
- Dead code cleanup"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
KVM: eventfd: Fix false positive RCU usage warning
KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use
KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread()
KVM: nVMX: Rename vmcs_to_field_offset{,_table}
KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCS
selftests: kvm: check dynamic bits against KVM_X86_XCOMP_GUEST_SUPP
KVM: x86: add system attribute to retrieve full set of supported xsave states
KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr
selftests: kvm: move vm_xsave_req_perm call to amx_test
KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time
KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS
KVM: x86: Keep MSR_IA32_XSS unchanged for INIT
KVM: x86: Free kvm_cpuid_entry2 array on post-KVM_RUN KVM_SET_CPUID{,2}
KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02
KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest
KVM: x86: Check .flags in kvm_cpuid_check_equal() too
KVM: x86: Forcibly leave nested virt when SMM state is toggled
KVM: SVM: drop unnecessary code in svm_hv_vmcb_dirty_nested_enlightenments()
KVM: SVM: hyper-v: Enable Enlightened MSR-Bitmap support for real
...
Forcibly leave nested virtualization operation if userspace toggles SMM
state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace
forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
Don't attempt to gracefully handle the transition as (a) most transitions
are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
sufficient information to handle all transitions, e.g. SVM wants access
to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
KVM_SET_NESTED_STATE during state restore as the latter disallows putting
the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
being post-VMXON in SMM if SMM is not active.
Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
in an architecturally impossible state.
WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
Modules linked in:
CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
Call Trace:
<TASK>
kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
__fput+0x286/0x9f0 fs/file_table.c:311
task_work_run+0xdd/0x1a0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0xb29/0x2a30 kernel/exit.c:806
do_group_exit+0xd2/0x2f0 kernel/exit.c:935
get_signal+0x4b0/0x28c0 kernel/signal.c:2862
arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
Cc: stable@vger.kernel.org
Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220125220358.2091737-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that
a future commit can harden KVM's SEV support to WARN on emulation
scenarios that should never happen.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Message-Id: <20220120010719.711476-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make kvm_vcpu_reload_apic_access_page() static
as it is no longer invoked directly by vmx
and it is also no longer exported.
No functional change intended.
Signed-off-by: Quanfa Fu <quanfafu@gmail.com>
Message-Id: <20211219091446.174584-1-quanfafu@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>