1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

1071 commits

Author SHA1 Message Date
Linus Torvalds
636b5284d8 Generic:
- selftest compilation fix for non-x86
 
 - KVM: avoid warning on s390 in mark_page_dirty
 
 x86:
 - fix page write-protection bug and improve comments
 
 - use binary search to lookup the PMU event filter, add test
 
 - enable_pmu module parameter support for Intel CPUs
 
 - switch blocked_vcpu_on_cpu_lock to raw spinlock
 
 - cleanups of blocked vCPU logic
 
 - partially allow KVM_SET_CPUID{,2} after KVM_RUN (5.16 regression)
 
 - various small fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHpmT0UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOstggAi1VSpT43oGslQjXNDZacHEARoYQs
 b0XpoW7HXicGSGRMWspCmiAPdJyYTsioEACttAmXUMs7brAgHb9n/vzdlcLh1ymL
 rQw2YFQlfqqB1Ki1iRhNkWlH9xOECsu28WLng6ylrx51GuT/pzWRt+V3EGUFTxIT
 ldW9HgZg2oFJIaLjg2hQVR/8EbBf0QdsAD3KV3tyvhBlXPkyeLOMcGe9onfjZ/NE
 JQeW7FtKtP4SsIFt1KrJpDPjtiwFt3bRM0gfgGw7//clvtKIqt1LYXZiq4C3b7f5
 tfYiC8lO2vnOoYcfeYEmvybbSsoS/CgSliZB32qkwoVvRMIl82YmxtDD+Q==
 =/Mak
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:
 "Generic:

   - selftest compilation fix for non-x86

   - KVM: avoid warning on s390 in mark_page_dirty

 x86:

   - fix page write-protection bug and improve comments

   - use binary search to lookup the PMU event filter, add test

   - enable_pmu module parameter support for Intel CPUs

   - switch blocked_vcpu_on_cpu_lock to raw spinlock

   - cleanups of blocked vCPU logic

   - partially allow KVM_SET_CPUID{,2} after KVM_RUN (5.16 regression)

   - various small fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (46 commits)
  docs: kvm: fix WARNINGs from api.rst
  selftests: kvm/x86: Fix the warning in lib/x86_64/processor.c
  selftests: kvm/x86: Fix the warning in pmu_event_filter_test.c
  kvm: selftests: Do not indent with spaces
  kvm: selftests: sync uapi/linux/kvm.h with Linux header
  selftests: kvm: add amx_test to .gitignore
  KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled
  KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops
  KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
  KVM: VMX: Don't do full kick when handling posted interrupt wakeup
  KVM: VMX: Fold fallback path into triggering posted IRQ helper
  KVM: VMX: Pass desired vector instead of bool for triggering posted IRQ
  KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
  KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU
  KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption
  KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
  KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIs
  KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
  KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
  KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
  ...
2022-01-22 09:40:01 +02:00
Sean Christopherson
635e6357f9 KVM: VMX: Don't do full kick when handling posted interrupt wakeup
When waking vCPUs in the posted interrupt wakeup handling, do exactly
that and no more.  There is no need to kick the vCPU as the wakeup
handler just needs to get the vCPU task running, and if it's in the guest
then it's definitely running.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:47 -05:00
Sean Christopherson
ccf8d68754 KVM: VMX: Fold fallback path into triggering posted IRQ helper
Move the fallback "wake_up" path into the helper to trigger posted
interrupt helper now that the nested and non-nested paths are identical.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:46 -05:00
Sean Christopherson
296aa26644 KVM: VMX: Pass desired vector instead of bool for triggering posted IRQ
Refactor the posted interrupt helper to take the desired notification
vector instead of a bool so that the callers are self-documenting.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:46 -05:00
Sean Christopherson
0f65a9d337 KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
Replace the full "kick" with just the "wake" in the fallback path when
triggering a virtual interrupt via a posted interrupt fails because the
guest is not IN_GUEST_MODE.  If the guest transitions into guest mode
between the check and the kick, then it's guaranteed to see the pending
interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting
IN_GUEST_MODE.  Kicking the guest in this case is nothing more than an
unnecessary VM-Exit (and host IRQ).

Opportunistically update comments to explain the various ordering rules
and barriers at play.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:45 -05:00
Sean Christopherson
c3e8abf0f3 KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:40 -05:00
Sean Christopherson
98c25ead5e KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
Handle the switch to/from the hypervisor/software timer when a vCPU is
blocking in common x86 instead of in VMX.  Even though VMX is the only
user of a hypervisor timer, the logic and all functions involved are
generic x86 (unless future CPUs do something completely different and
implement a hypervisor timer that runs regardless of mode).

Handling the switch in common x86 will allow for the elimination of the
pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
timer if and only if it was in use (without additional params).  Add a
comment explaining why the switch cannot be deferred to kvm_sched_out()
or kvm_vcpu_block().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:39 -05:00
Sean Christopherson
12a8eee568 KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
rename the list and all associated variables to clarify that it tracks
the set of vCPU that need to be poked on a posted interrupt to the wakeup
vector.  The list is not used to track _all_ vCPUs that are blocking, and
the term "blocked" can be misleading as it may refer to a blocking
condition in the host or the guest, where as the PI wakeup case is
specifically for the vCPUs that are actively blocking from within the
guest.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:38 -05:00
Sean Christopherson
d76fb40637 KVM: VMX: Handle PI descriptor updates during vcpu_put/load
Move the posted interrupt pre/post_block logic into vcpu_put/load
respectively, using the kvm_vcpu_is_blocking() to determining whether or
not the wakeup handler needs to be set (and unset).  This avoids updating
the PI descriptor if halt-polling is successful, reduces the number of
touchpoints for updating the descriptor, and eliminates the confusing
behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
is scheduled back in after preemption.

The downside is that KVM will do the PID update twice if the vCPU is
preempted after prepare_to_rcuwait() but before schedule(), but that's a
rare case (and non-existent on !PREEMPT kernels).

The notable wart is the need to send a self-IPI on the wakeup vector if
an outstanding notification is pending after configuring the wakeup
vector.  Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
but the scheduler doesn't support waking a task from its preemption
notifier callback, i.e. while the task is right in the middle of
being scheduled out.

Note, setting the wakeup vector before halt-polling is not necessary:
once the pending IRQ will be recorded in the PIR, kvm_vcpu_has_events()
will detect this (via kvm_cpu_get_interrupt(), kvm_apic_get_interrupt(),
apic_has_interrupt_for_ppr() and finally vmx_sync_pir_to_irr()) and
terminate the polling.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:37 -05:00
Paolo Bonzini
4f5a884fc2 Merge branch 'kvm-pi-raw-spinlock' into HEAD
Bring in fix for VT-d posted interrupts before further changing the code in 5.17.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:14:02 -05:00
Sean Christopherson
fc4fad79fc KVM: VMX: Reject KVM_RUN if emulation is required with pending exception
Reject KVM_RUN if emulation is required (because VMX is running without
unrestricted guest) and an exception is pending, as KVM doesn't support
emulating exceptions except when emulating real mode via vm86.  The vCPU
is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
the impossible condition.  Alternatively, the WARN could be removed, but
then userspace and/or KVM bugs would result in the vCPU silently running
in a bad state, which isn't very friendly to users.

Originally, the bug was hit by syzkaller with a nested guest as that
doesn't require kvm_intel.unrestricted_guest=0.  That particular flavor
is likely fixed by commit cd0e615c49 ("KVM: nVMX: Synthesize
TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
trigger the WARN with a non-nested guest, and userspace can likely force
bad state via ioctls() for a nested guest as well.

Checking for the impossible condition needs to be deferred until KVM_RUN
because KVM can't force specific ordering between ioctls.  E.g. clearing
exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
emulation_required would prevent userspace from queuing an exception and
then stuffing sregs.  Note, if KVM were to try and detect/prevent the
condition prior to KVM_RUN, handle_invalid_guest_state() and/or
handle_emulation_failure() would need to be modified to clear the pending
exception prior to exiting to userspace.

 ------------[ cut here ]------------
 WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
 CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
 Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
 RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
 Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
 RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
 RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
 RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
 RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
 R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
 FS:  00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
 Call Trace:
  kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
  kvm_vcpu_ioctl+0x279/0x690 [kvm]
  __x64_sys_ioctl+0x83/0xb0
  do_syscall_64+0x3b/0xc0
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211228232437.1875318-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 12:12:25 -05:00
Marcelo Tosatti
5f02ef741a KVM: VMX: switch blocked_vcpu_on_cpu_lock to raw spinlock
blocked_vcpu_on_cpu_lock is taken from hard interrupt context
(pi_wakeup_handler), therefore it cannot sleep.

Switch it to a raw spinlock.

Fixes:

[41297.066254] BUG: scheduling while atomic: CPU 0/KVM/635218/0x00010001
[41297.066323] Preemption disabled at:
[41297.066324] [<ffffffff902ee47f>] irq_enter_rcu+0xf/0x60
[41297.066339] Call Trace:
[41297.066342]  <IRQ>
[41297.066346]  dump_stack_lvl+0x34/0x44
[41297.066353]  ? irq_enter_rcu+0xf/0x60
[41297.066356]  __schedule_bug.cold+0x7d/0x8b
[41297.066361]  __schedule+0x439/0x5b0
[41297.066365]  ? task_blocks_on_rt_mutex.constprop.0.isra.0+0x1b0/0x440
[41297.066369]  schedule_rtlock+0x1e/0x40
[41297.066371]  rtlock_slowlock_locked+0xf1/0x260
[41297.066374]  rt_spin_lock+0x3b/0x60
[41297.066378]  pi_wakeup_handler+0x31/0x90 [kvm_intel]
[41297.066388]  sysvec_kvm_posted_intr_wakeup_ipi+0x9d/0xd0
[41297.066392]  </IRQ>
[41297.066392]  asm_sysvec_kvm_posted_intr_wakeup_ipi+0x12/0x20
...

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-18 04:44:08 -05:00
Like Xu
4732f2444a KVM: x86: Making the module parameter of vPMU more common
The new module parameter to control PMU virtualization should apply
to Intel as well as AMD, for situations where userspace is not trusted.
If the module parameter allows PMU virtualization, there could be a
new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
PMU virtualization on a per-VM basis.

If the module parameter does not allow PMU virtualization, there
should be no userspace override, since we have no precedent for
authorizing that kind of override. If it's false, other counter-based
profiling features (such as LBR including the associated CPUID bits
if any) will not be exposed.

Change its name from "pmu" to "enable_pmu" as we have temporary
variables with the same name in our code like "struct kvm_pmu *pmu".

Fixes: b1d66dad65 ("KVM: x86/svm: Add module param to control PMU virtualization")
Suggested-by : Jim Mattson <jmattson@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220111073823.21885-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-17 12:56:03 -05:00
Like Xu
a21864486f KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event
According to CPUID 0x0A.EBX bit vector, the event [7] should be the
unrealized event "Topdown Slots" instead of the *kernel* generalized
common hardware event "REF_CPU_CYCLES", so we need to skip the cpuid
unavaliblity check in the intel_pmc_perf_hw_id() for the last
REF_CPU_CYCLES event and update the confusing comment.

If the event is marked as unavailable in the Intel guest CPUID
0AH.EBX leaf, we need to avoid any perf_event creation, whether
it's a gp or fixed counter. To distinguish whether it is a rejected
event or an event that needs to be programmed with PERF_TYPE_RAW type,
a new special returned value of "PERF_COUNT_HW_MAX + 1" is introduced.

Fixes: 62079d8a43 ("KVM: PMU: add proper support for fixed counter 2")
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20220105051509.69437-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-17 12:19:41 -05:00
Linus Torvalds
79e06c4c49 RISCV:
- Use common KVM implementation of MMU memory caches
 
 - SBI v0.2 support for Guest
 
 - Initial KVM selftests support
 
 - Fix to avoid spurious virtual interrupts after clearing hideleg CSR
 
 - Update email address for Anup and Atish
 
 ARM:
 - Simplification of the 'vcpu first run' by integrating it into
   KVM's 'pid change' flow
 
 - Refactoring of the FP and SVE state tracking, also leading to
   a simpler state and less shared data between EL1 and EL2 in
   the nVHE case
 
 - Tidy up the header file usage for the nvhe hyp object
 
 - New HYP unsharing mechanism, finally allowing pages to be
   unmapped from the Stage-1 EL2 page-tables
 
 - Various pKVM cleanups around refcounting and sharing
 
 - A couple of vgic fixes for bugs that would trigger once
   the vcpu xarray rework is merged, but not sooner
 
 - Add minimal support for ARMv8.7's PMU extension
 
 - Rework kvm_pgtable initialisation ahead of the NV work
 
 - New selftest for IRQ injection
 
 - Teach selftests about the lack of default IPA space and
   page sizes
 
 - Expand sysreg selftest to deal with Pointer Authentication
 
 - The usual bunch of cleanups and doc update
 
 s390:
 - fix sigp sense/start/stop/inconsistency
 
 - cleanups
 
 x86:
 - Clean up some function prototypes more
 
 - improved gfn_to_pfn_cache with proper invalidation, used by Xen emulation
 
 - add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
 
 - completely remove potential TOC/TOU races in nested SVM consistency checks
 
 - update some PMCs on emulated instructions
 
 - Intel AMX support (joint work between Thomas and Intel)
 
 - large MMU cleanups
 
 - module parameter to disable PMU virtualization
 
 - cleanup register cache
 
 - first part of halt handling cleanups
 
 - Hyper-V enlightened MSR bitmap support for nested hypervisors
 
 Generic:
 - clean up Makefiles
 
 - introduce CONFIG_HAVE_KVM_DIRTY_RING
 
 - optimize memslot lookup using a tree
 
 - optimize vCPU array usage by converting to xarray
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHhxvsUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPZkAf+Nz92UL/5nNGcdHtE4m7AToMmitE9
 bYkesf9BMQvAe5wjkABLuoHGi6ay4jabo4fiGzbdkiK7lO5YgfsWiMB3/MT5fl4E
 jRPzaVQabp3YZLM8UYCBmfUVuRj524S967SfSRe0AvYjDEH8y7klPf4+7sCsFT0/
 Px9Vf2KGuOlf0eM78yKg4rGaF0jS22eLgXm6FfNMY8/e29ZAo/jyUmqBY+Z2xxZG
 aWhceDtSheW1jwLHLj3nOlQJvHTn8LVGXBE/R8Gda3ZjrBV2rKaDi4Fh+HD+dz86
 2zVXwzQ7uck2CMW73GMoXMTWoKSHMyvlBOs1BdvBm4UsnGcXR+q8IFCeuQ==
 =s73m
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "RISCV:

   - Use common KVM implementation of MMU memory caches

   - SBI v0.2 support for Guest

   - Initial KVM selftests support

   - Fix to avoid spurious virtual interrupts after clearing hideleg CSR

   - Update email address for Anup and Atish

  ARM:

   - Simplification of the 'vcpu first run' by integrating it into KVM's
     'pid change' flow

   - Refactoring of the FP and SVE state tracking, also leading to a
     simpler state and less shared data between EL1 and EL2 in the nVHE
     case

   - Tidy up the header file usage for the nvhe hyp object

   - New HYP unsharing mechanism, finally allowing pages to be unmapped
     from the Stage-1 EL2 page-tables

   - Various pKVM cleanups around refcounting and sharing

   - A couple of vgic fixes for bugs that would trigger once the vcpu
     xarray rework is merged, but not sooner

   - Add minimal support for ARMv8.7's PMU extension

   - Rework kvm_pgtable initialisation ahead of the NV work

   - New selftest for IRQ injection

   - Teach selftests about the lack of default IPA space and page sizes

   - Expand sysreg selftest to deal with Pointer Authentication

   - The usual bunch of cleanups and doc update

  s390:

   - fix sigp sense/start/stop/inconsistency

   - cleanups

  x86:

   - Clean up some function prototypes more

   - improved gfn_to_pfn_cache with proper invalidation, used by Xen
     emulation

   - add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery

   - completely remove potential TOC/TOU races in nested SVM consistency
     checks

   - update some PMCs on emulated instructions

   - Intel AMX support (joint work between Thomas and Intel)

   - large MMU cleanups

   - module parameter to disable PMU virtualization

   - cleanup register cache

   - first part of halt handling cleanups

   - Hyper-V enlightened MSR bitmap support for nested hypervisors

  Generic:

   - clean up Makefiles

   - introduce CONFIG_HAVE_KVM_DIRTY_RING

   - optimize memslot lookup using a tree

   - optimize vCPU array usage by converting to xarray"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (268 commits)
  x86/fpu: Fix inline prefix warnings
  selftest: kvm: Add amx selftest
  selftest: kvm: Move struct kvm_x86_state to header
  selftest: kvm: Reorder vcpu_load_state steps for AMX
  kvm: x86: Disable interception for IA32_XFD on demand
  x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
  kvm: selftests: Add support for KVM_CAP_XSAVE2
  kvm: x86: Add support for getting/setting expanded xstate buffer
  x86/fpu: Add uabi_size to guest_fpu
  kvm: x86: Add CPUID support for Intel AMX
  kvm: x86: Add XCR0 support for Intel AMX
  kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
  kvm: x86: Emulate IA32_XFD_ERR for guest
  kvm: x86: Intercept #NM for saving IA32_XFD_ERR
  x86/fpu: Prepare xfd_err in struct fpu_guest
  kvm: x86: Add emulation for IA32_XFD
  x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
  kvm: x86: Enable dynamic xfeatures at KVM_SET_CPUID2
  x86/fpu: Provide fpu_enable_guest_xfd_features() for KVM
  x86/fpu: Add guest support to xfd_enable_feature()
  ...
2022-01-16 16:15:14 +02:00
Kevin Tian
b5274b1b7b kvm: x86: Disable interception for IA32_XFD on demand
Always intercepting IA32_XFD causes non-negligible overhead when this
register is updated frequently in the guest.

Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
with a non-zero value.

Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
with the software states in fpstate and the per-cpu xfd cache. This
leads to two additional changes accordingly:

  - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring
    software states back in-sync with the MSR, before handle_exit_irqoff()
    is called.

  - Always trap #NM once write interception is disabled for IA32_XFD.
    The #NM exception is rare if the guest doesn't use dynamic
    features. Otherwise, there is at most one exception per guest
    task given a dynamic feature.

p.s. We have confirmed that SDM is being revised to say that
when setting IA32_XFD[18] the AMX register state is not guaranteed
to be preserved. This clarification avoids adding mess for a creative
guest which sets IA32_XFD[18]=1 before saving active AMX state to
its own storage.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-22-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-14 13:44:43 -05:00
Jing Liu
61f208134a kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
This saves one unnecessary VM-exit in guest #NM handler, given that the
MSR is already restored with the guest value before the guest is resumed.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-15-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-14 13:44:33 -05:00
Jing Liu
ec5be88ab2 kvm: x86: Intercept #NM for saving IA32_XFD_ERR
Guest IA32_XFD_ERR is generally modified in two places:

  - Set by CPU when #NM is triggered;
  - Cleared by guest in its #NM handler;

Intercept #NM for the first case when a nonzero value is written
to IA32_XFD. Nonzero indicates that the guest is willing to do
dynamic fpstate expansion for certain xfeatures, thus KVM needs to
manage and virtualize guest XFD_ERR properly. The vcpu exception
bitmap is updated in XFD write emulation according to guest_fpu::xfd.

Save the current XFD_ERR value to the guest_fpu container in the #NM
VM-exit handler. This must be done with interrupt disabled, otherwise
the unsaved MSR value may be clobbered by host activity.

The saving operation is conducted conditionally only when guest_fpu:xfd
includes a non-zero value. Doing so also avoids misread on a platform
which doesn't support XFD but #NM is triggered due to L1 interception.

Queueing #NM to the guest is postponed to handle_exception_nmi(). This
goes through the nested_vmx check so a virtual vmexit is queued instead
when #NM is triggered in L2 but L1 wants to intercept it.

Restore the host value (always ZERO outside of the host #NM
handler) before enabling interrupt.

Restore the guest value from the guest_fpu container right before
entering the guest (with interrupt disabled).

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20220105123532.12586-13-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-14 13:44:18 -05:00
Linus Torvalds
64ad946152 - Get rid of all the .fixup sections because this generates
misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and
 LIVEPATCH as the backtrace misses the function which is being fixed up.
 
 - Add Straight Light Speculation mitigation support which uses a new
 compiler switch -mharden-sls= which sticks an INT3 after a RET or an
 indirect branch in order to block speculation after them. Reportedly,
 CPUs do speculate behind such insns.
 
 - The usual set of cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHfKA0ACgkQEsHwGGHe
 VUqLJg/2I2X2xXr5filJVaK+sQgmvDzk67DKnbxRBW2xcPF+B5sSW5yhe3G5UPW7
 SJVdhQ3gHcTiliGGlBf/VE7KXbqxFN0vO4/VFHZm78r43g7OrXTxz6WXXQRJ1n67
 U3YwRH3b6cqXZNFMs+X4bJt6qsGJM1kdTTZ2as4aERnaFr5AOAfQvfKbyhxLe/XA
 3SakfYISVKCBQ2RkTfpMpwmqlsatGFhTC5IrvuDQ83dDsM7O+Dx1J6Gu3fwjKmie
 iVzPOjCh+xTpZQp/SIZmt7MzoduZvpSym4YVyHvEnMiexQT4AmyaRthWqrhnEXY/
 qOvj8/XIqxmix8EaooGqRIK0Y2ZegxkPckNFzaeC3lsWohwMIGIhNXwHNEeuhNyH
 yvNGAW9Cq6NeDRgz5MRUXcimYw4P4oQKYLObS1WqFZhNMqm4sNtoEAYpai/lPYfs
 zUDckgXF2AoPOsSqy3hFAVaGovAgzfDaJVzkt0Lk4kzzjX2WQiNLhmiior460w+K
 0l2Iej58IajSp3MkWmFH368Jo8YfUVmkjbbpsmjsBppA08e1xamJB7RmswI/Ezj6
 s5re6UioCD+UYdjWx41kgbvYdvIkkZ2RLrktoZd/hqHrOLWEIiwEbyFO2nRFJIAh
 YjvPkB1p7iNuAeYcP1x9Ft9GNYVIsUlJ+hK86wtFCqy+abV+zQ==
 =R52z
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Get rid of all the .fixup sections because this generates
   misleading/wrong stacktraces and confuse RELIABLE_STACKTRACE and
   LIVEPATCH as the backtrace misses the function which is being fixed
   up.

 - Add Straight Line Speculation mitigation support which uses a new
   compiler switch -mharden-sls= which sticks an INT3 after a RET or an
   indirect branch in order to block speculation after them. Reportedly,
   CPUs do speculate behind such insns.

 - The usual set of cleanups and improvements

* tag 'x86_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  x86/entry_32: Fix segment exceptions
  objtool: Remove .fixup handling
  x86: Remove .fixup section
  x86/word-at-a-time: Remove .fixup usage
  x86/usercopy: Remove .fixup usage
  x86/usercopy_32: Simplify __copy_user_intel_nocache()
  x86/sgx: Remove .fixup usage
  x86/checksum_32: Remove .fixup usage
  x86/vmx: Remove .fixup usage
  x86/kvm: Remove .fixup usage
  x86/segment: Remove .fixup usage
  x86/fpu: Remove .fixup usage
  x86/xen: Remove .fixup usage
  x86/uaccess: Remove .fixup usage
  x86/futex: Remove .fixup usage
  x86/msr: Remove .fixup usage
  x86/extable: Extend extable functionality
  x86/entry_32: Remove .fixup usage
  x86/entry_64: Remove .fixup usage
  x86/copy_mc_64: Remove .fixup usage
  ...
2022-01-12 16:31:19 -08:00
Linus Torvalds
8e5b0adeea Peter Zijlstra says:
"Cleanup of the perf/kvm interaction."
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHdvbkACgkQEsHwGGHe
 VUrX7w/9FwKUm0WlGcQIAOSdWk85N2qAVH3brYcQHNpTCVe68TOqTCrxCDrGgyUq
 2XnCOim99MUlnsVU6QRZqF4yJ8S1tGrc0COJ/qR4SGntucu0oYuDe2aMVq+mWUD7
 /IThA0oMRfhki9WwAyUuyCrXzk4blZdlrXyYIRMJGl9xeGNy3cvUtU8f68Kiy22E
 OcmQ/o9Etsr38dueAMU1KYEmgSTvG47rS8nfyRUu3QpJHbyLmRXH32PQrm3tduxS
 Bw3gMAH5vqq1UDZJ8ZvsPsO0vFX7dtnKEwEKz4qdtRWk9gi8oLGHIwIXC+VtNqpf
 mCmX33Jw8uFz9h3JhE84J0j/CgsWHoU6MOs0MOch4Tb69/BfCjQnw1enImhejG8q
 YEIDjJf/vgRNaw9PYshiTHT+EJTe9inT3S4eK/ynLRDUEslAqyWZZm7bUE/XrEDi
 yRyGIxry/hNZVvRkXT9QBw32fpgnIH2NAMPLEjJSGCRxT89Tfqz0aRDfacCuHTTh
 P8pAeiDuy/6RkDlQckOZJWOFFh2IHsykX2l3IJcHqVRqt4ob9b+SZB5qoH/Mv9qb
 MSAqdFUupYZFC+6XuPAeX5/Mo+wSkP+pYYSbWNxjUa0yNiYecOjE7/8T2SB2y6Mx
 lk2L0ypsZUYSmpHSfvOdPmf6ucj19/5B4+VCX6PQfcNJTnvvhTE=
 =tU5G
 -----END PGP SIGNATURE-----

Merge tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf updates from Borislav Petkov:
 "Cleanup of the perf/kvm interaction."

* tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Drop guest callback (un)register stubs
  KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c
  KVM: arm64: Hide kvm_arm_pmu_available behind CONFIG_HW_PERF_EVENTS=y
  KVM: arm64: Convert to the generic perf callbacks
  KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
  KVM: Move x86's perf guest info callbacks to generic KVM
  KVM: x86: More precisely identify NMI from guest when handling PMI
  KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable
  perf/core: Use static_call to optimize perf_guest_info_callbacks
  perf: Force architectures to opt-in to guest callbacks
  perf: Add wrappers for invoking guest callbacks
  perf/core: Rework guest callbacks to prepare for static_call support
  perf: Drop dead and useless guest "support" from arm, csky, nds32 and riscv
  perf: Stop pretending that perf can handle multiple guest callbacks
  KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest
  KVM: x86: Register perf callbacks after calling vendor's hardware_setup()
  perf: Protect perf_guest_cbs with RCU
2022-01-12 16:26:58 -08:00
Michael Roth
405329fc9a KVM: SVM: include CR3 in initial VMSA state for SEV-ES guests
Normally guests will set up CR3 themselves, but some guests, such as
kselftests, and potentially CONFIG_PVH guests, rely on being booted
with paging enabled and CR3 initialized to a pre-allocated page table.

Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest
VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this
is too late, since it will have switched over to using the VMSA page
prior to that point, with the VMSA CR3 copied from the VMCB initial
CR3 value: 0.

Address this by sync'ing the CR3 value into the VMCB save area
immediately when KVM_SET_SREGS* is issued so it will find it's way into
the initial VMSA.

Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20211216171358.61140-10-michael.roth@amd.com>
[Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the
 new hook. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:46 -05:00
Peter Zijlstra
907d139318 KVM: VMX: Provide vmread version using asm-goto-with-outputs
Use asm-goto-output for smaller fast path code.

Message-Id: <YbcbbGW2GcMx6KpD@hirez.programming.kicks-ass.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:46 -05:00
Eric Hankland
018d70ffcf KVM: x86: Update vPMCs when retiring branch instructions
When KVM retires a guest branch instruction through emulation,
increment any vPMCs that are configured to monitor "branch
instructions retired," and update the sample period of those counters
so that they will overflow at the right time.

Signed-off-by: Eric Hankland <ehankland@google.com>
[jmattson:
  - Split the code to increment "branch instructions retired" into a
    separate commit.
  - Moved/consolidated the calls to kvm_pmu_trigger_event() in the
    emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
    that code.
]
Fixes: f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20211130074221.93635-7-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:43 -05:00
Like Xu
6ed1298eb0 KVM: x86/pmu: Reuse pmc_perf_hw_id() and drop find_fixed_event()
Since we set the same semantic event value for the fixed counter in
pmc->eventsel, returning the perf_hw_id for the fixed counter via
find_fixed_event() can be painlessly replaced by pmc_perf_hw_id()
with the help of pmc_is_fixed() check.

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211130074221.93635-4-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:42 -05:00
Like Xu
7c174f305c KVM: x86/pmu: Refactoring find_arch_event() to pmc_perf_hw_id()
The find_arch_event() returns a "unsigned int" value,
which is used by the pmc_reprogram_counter() to
program a PERF_TYPE_HARDWARE type perf_event.

The returned value is actually the kernel defined generic
perf_hw_id, let's rename it to pmc_perf_hw_id() with simpler
incoming parameters for better self-explanation.

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211130074221.93635-3-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:41 -05:00
Like Xu
761875634a KVM: x86/pmu: Setup pmc->eventsel for fixed PMCs
The current pmc->eventsel for fixed counter is underutilised. The
pmc->eventsel can be setup for all known available fixed counters
since we have mapping between fixed pmc index and
the intel_arch_events array.

Either gp or fixed counter, it will simplify the later checks for
consistency between eventsel and perf_hw_id.

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211130074221.93635-2-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:41 -05:00
Paolo Bonzini
006a0f0607 KVM: x86: avoid out of bounds indices for fixed performance counters
Because IceLake has 4 fixed performance counters but KVM only
supports 3, it is possible for reprogram_fixed_counters to pass
to reprogram_fixed_counter an index that is out of bounds for the
fixed_pmc_events array.

Ultimately intel_find_fixed_event, which is the only place that uses
fixed_pmc_events, handles this correctly because it checks against the
size of fixed_pmc_events anyway.  Every other place operates on the
fixed_counters[] array which is sized according to INTEL_PMC_MAX_FIXED.
However, it is cleaner if the unsupported performance counters are culled
early on in reprogram_fixed_counters.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:40 -05:00
Lai Jiangshan
5b61178cd2 KVM: VMX: Mark VCPU_EXREG_CR3 dirty when !CR0_PG -> CR0_PG if EPT + !URG
When !CR0_PG -> CR0_PG, vcpu->arch.cr3 becomes active, but GUEST_CR3 is
still vmx->ept_identity_map_addr if EPT + !URG.  So VCPU_EXREG_CR3 is
considered to be dirty and GUEST_CR3 needs to be updated in this case.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-4-jiangshanlai@gmail.com>
Fixes: c62c7bd4f9 ("KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:40 -05:00
Lai Jiangshan
a9f2705ec8 KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()
The host CR3 in the vcpu thread can only be changed when scheduling,
so commit 15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
changed vmx.c to only save it in vmx_prepare_switch_to_guest().

However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS.
vmx_set_host_fs_gs() is called in both places, so rename it to
vmx_set_vmcs_host_state() and make it update HOST_CR3.

Fixes: 15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:39 -05:00
Paolo Bonzini
855fb0384a Merge remote-tracking branch 'kvm/master' into HEAD
Pick commit fdba608f15 ("KVM: VMX: Wake vCPU when delivering posted
IRQ even if vCPU == this vCPU").  In addition to fixing a bug, it
also aligns the non-nested and nested usage of triggering posted
interrupts, allowing for additional cleanups.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-21 12:51:09 -05:00
Sean Christopherson
fdba608f15 KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU
Drop a check that guards triggering a posted interrupt on the currently
running vCPU, and more importantly guards waking the target vCPU if
triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
If a vIRQ is delivered from asynchronous context, the target vCPU can be
the currently running vCPU and can also be blocking, in which case
skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
be a wake event for the vCPU.

The "do nothing" logic when "vcpu == running_vcpu" mostly works only
because the majority of calls to ->deliver_posted_interrupt(), especially
when using posted interrupts, come from synchronous KVM context.  But if
a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
use posted interrupts, wake events from the device will be delivered to
KVM from IRQ context, e.g.

  vfio_msihandler()
  |
  |-> eventfd_signal()
      |
      |-> ...
          |
          |->  irqfd_wakeup()
               |
               |->kvm_arch_set_irq_inatomic()
                  |
                  |-> kvm_irq_delivery_to_apic_fast()
                      |
                      |-> kvm_apic_set_irq()

This also aligns the non-nested and nested usage of triggering posted
interrupts, and will allow for additional cleanups.

Fixes: 379a3c8ee4 ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
Cc: stable@vger.kernel.org
Reported-by: Longpeng (Mike) <longpeng2@huawei.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211208015236.1616697-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-21 12:39:03 -05:00
Sean Christopherson
cd0e615c49 KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required
Synthesize a triple fault if L2 guest state is invalid at the time of
VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs
guest state via ioctls(), e.g. KVM_SET_SREGS.  KVM should never emulate
invalid guest state, since from L1's perspective, it's architecturally
impossible for L2 to have invalid state while L2 is running in hardware.
E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit
or #GP.

Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that
can trigger this scenario, as nested VM-Enter correctly rejects any
attempt to enter L2 with invalid state.

RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and
behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits
via RSM results in shutdown, i.e. there is precedent for KVM's behavior.
Following AMD's SMRAM layout is important as AMD's layout saves/restores
the descriptor cache information, including CS.RPL and SS.RPL, and also
defines all the fields relevant to invalid guest state as read-only, i.e.
so long as the vCPU had valid state before the SMI, which is guaranteed
for L2, RSM will generate valid state unless SMRAM was modified.  Intel's
layout saves/restores only the selector, which means that scenarios where
the selector and cached RPL don't match, e.g. conforming code segments,
would yield invalid guest state.  Intel CPUs fudge around this issued by
stuffing SS.RPL and CS.RPL on RSM.  Per Intel's SDM on the "Default
Treatment of RSM", paraphrasing for brevity:

  IF internal storage indicates that the [CPU was post-VMXON]
  THEN
     enter VMX operation (root or non-root);
     restore VMX-critical state as defined in Section 34.14.1;
     set to their fixed values any bits in CR0 and CR4 whose values must
     be fixed in VMX operation [unless coming from an unrestricted guest];
     IF RFLAGS.VM = 0 AND (in VMX root operation OR the
        “unrestricted guest” VM-execution control is 0)
     THEN
       CS.RPL := SS.DPL;
       SS.RPL := SS.DPL;
     FI;
     restore current VMCS pointer;
  FI;

Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM
will sythesize TRIPLE_FAULT in this scenario.  KVM's behavior is allowed
as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the
only way for CR0 and/or CR4 to have illegal values is if they were
modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map"
section states "modifying these registers will result in unpredictable
behavior".

KVM's ioctl() behavior is less straightforward.  Because KVM allows
ioctls() to be executed in any order, rejecting an ioctl() if it would
result in invalid L2 guest state is not an option as KVM cannot know if
a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or
drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE.  Ideally, KVM would
reject KVM_RUN if L2 contained invalid guest state, but that carries the
risk of a false positive, e.g. if RSM loaded invalid guest state and KVM
exited to userspace.  Setting a flag/request to detect such a scenario is
undesirable because (a) it's extremely unlikely to add value to KVM as a
whole, and (b) KVM would need to consider ioctl() interactions with such
a flag, e.g. if userspace migrated the vCPU while the flag were set.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211207193006.120997-3-seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-20 08:06:54 -05:00
Sean Christopherson
a80dfc0259 KVM: VMX: Always clear vmx->fail on emulation_required
Revert a relatively recent change that set vmx->fail if the vCPU is in L2
and emulation_required is true, as that behavior is completely bogus.
Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong:

  (a) it's impossible to have both a VM-Fail and VM-Exit
  (b) vmcs.EXIT_REASON is not modified on VM-Fail
  (c) emulation_required refers to guest state and guest state checks are
      always VM-Exits, not VM-Fails.

For KVM specifically, emulation_required is handled before nested exits
in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect,
i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored.
Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit()
firing when tearing down the VM as KVM never expects vmx->fail to be set
when L2 is active, KVM always reflects those errors into L1.

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548
                                nested_vmx_vmexit+0x16bd/0x17e0
                                arch/x86/kvm/vmx/nested.c:4547
  Modules linked in:
  CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547
  Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80
  Call Trace:
   vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline]
   nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330
   vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799
   kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989
   kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441
   kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline]
   kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545
   kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline]
   kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220
   kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489
   __fput+0x3fc/0x870 fs/file_table.c:280
   task_work_run+0x146/0x1c0 kernel/task_work.c:164
   exit_task_work include/linux/task_work.h:32 [inline]
   do_exit+0x705/0x24f0 kernel/exit.c:832
   do_group_exit+0x168/0x2d0 kernel/exit.c:929
   get_signal+0x1740/0x2120 kernel/signal.c:2852
   arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
   handle_signal_work kernel/entry/common.c:148 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
   exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
   __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
   syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
   do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: c8607e4a08 ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry")
Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211207193006.120997-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-20 08:06:54 -05:00
Marc Orr
c5063551bf KVM: x86: Always set kvm_run->if_flag
The kvm_run struct's if_flag is a part of the userspace/kernel API. The
SEV-ES patches failed to set this flag because it's no longer needed by
QEMU (according to the comment in the source code). However, other
hypervisors may make use of this flag. Therefore, set the flag for
guests with encrypted registers (i.e., with guest_state_protected set).

Fixes: f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Signed-off-by: Marc Orr <marcorr@google.com>
Message-Id: <20211209155257.128747-1-marcorr@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
2021-12-20 08:06:53 -05:00
Peter Zijlstra
3e8ea7803a x86/vmx: Remove .fixup usage
In the vmread exceptin path, use the, thus far, unused output register
to push the @fault argument onto the stack. This, in turn, enables the
exception handler to not do pushes and only modify that register when
an exception does occur.

As noted by Sean the input constraint needs to be changed to "=&r" to
avoid the value and field occupying the same register.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211110101325.781308550@infradead.org
2021-12-11 09:09:49 +01:00
Sean Christopherson
45af1bb99b KVM: VMX: Clean up PI pre/post-block WARNs
Move the WARN sanity checks out of the PI descriptor update loop so as
not to spam the kernel log if the condition is violated and the update
takes multiple attempts due to another writer.  This also eliminates a
few extra uops from the retry path.

Technically not checking every attempt could mean KVM will now fail to
WARN in a scenario that would have failed before, but any such failure
would be inherently racy as some other agent (CPU or device) would have
to concurrent modify the PI descriptor.

Add a helper to handle the actual write and more importantly to document
why the write may need to be retried.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-09 09:30:47 -05:00
Sean Christopherson
83c98007d9 KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ fails
Add a memory barrier between writing vcpu->requests and reading
vcpu->guest_mode to ensure the read is ordered after the write when
(potentially) delivering an IRQ to L2 via nested posted interrupt.  If
the request were to be completed after reading vcpu->mode, it would be
possible for the target vCPU to enter the guest without posting the
interrupt and without handling the event request.

Note, the barrier is only for documentation since atomic operations are
serializing on x86.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 6b6977117f ("KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2")
Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-09 09:30:46 -05:00
Vitaly Kuznetsov
502d2bf5f2 KVM: nVMX: Implement Enlightened MSR Bitmap feature
Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V
offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to
inform L0 when it changes MSR bitmap, this eliminates the need to examine
L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets
constructed.

Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature.

Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a
nested hypervisor on top of Hyper-V. The newly introduced feature is going
to be used by Hyper-V guests on KVM.

When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU
cycles from a nested vmexit cost (tight cpuid loop test).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 09:06:05 -05:00
Vitaly Kuznetsov
ed2a4800ae KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt
Introduce a flag to keep track of whether MSR bitmap for L2 needs to be
rebuilt due to changes in MSR bitmap for L1 or switching to a different
L2. This information will be used for Enlightened MSR Bitmap feature for
Hyper-V guests.

Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is
not really needed for Enlightened MSR Bitmap as the feature can only
be used in conjunction with Enlightened VMCS but let's keep tracking
information complete, it's cheap and in the future similar PV feature can
easily be implemented for KVM on KVM too.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 09:06:04 -05:00
Vitaly Kuznetsov
b84155c380 KVM: VMX: Introduce vmx_msr_bitmap_l01_changed() helper
In preparation to enabling 'Enlightened MSR Bitmap' feature for Hyper-V
guests move MSR bitmap update tracking to a dedicated helper.

Note: vmx_msr_bitmap_l01_changed() is called when MSR bitmap might be
updated. KVM doesn't check if the bit we're trying to set is already set
(or the bit it's trying to clear is already cleared). Such situations
should not be common and a few false positives should not be a problem.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211129094704.326635-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 09:06:04 -05:00
Peter Zijlstra
f94909ceb1 x86: Prepare asm files for straight-line-speculation
Replace all ret/retq instructions with RET in preparation of making
RET a macro. Since AS is case insensitive it's a big no-op without
RET defined.

  find arch/x86/ -name \*.S | while read file
  do
	sed -i 's/\<ret[q]*\>/RET/' $file
  done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.905503893@infradead.org
2021-12-08 12:25:37 +01:00
Paolo Bonzini
93b350f884 Merge branch 'kvm-on-hv-msrbm-fix' into HEAD
Merge bugfix for enlightened MSR Bitmap, before adding support
to KVM for exposing the feature to nested guests.
2021-12-08 05:30:48 -05:00
Vitaly Kuznetsov
250552b925 KVM: nVMX: Don't use Enlightened MSR Bitmap for L3
When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened
VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which
are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is
updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from
clean fields to make Hyper-V aware of the change. For KVM's L1s, this is
done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr().
MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending
MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't
check if the resulting bitmap is different and never cleans
HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and
may result in Hyper-V missing the update.

The issue could've been solved by calling evmcs_touch_msr_bitmap() for
eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so
would not give any performance benefits (compared to not using Enlightened
MSR Bitmap at all). 3-level nesting is also not a very common setup
nowadays.

Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for
now.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:53:13 -05:00
Lai Jiangshan
2df4a5eb6c KVM: X86: Remove mmu parameter from load_pdptrs()
It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs,
and nested NPT treats them like all other non-leaf page table levels
instead of caching them.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:14 -05:00
Lai Jiangshan
f8cd457f06 KVM: VMX: Use ept_caps_to_lpage_level() in hardware_setup()
Using ept_caps_to_lpage_level is simpler.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-9-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:12 -05:00
Lai Jiangshan
cc022ae144 KVM: X86: Add parameter huge_page_level to kvm_init_shadow_ept_mmu()
The level of supported large page on nEPT affects the rsvds_bits_mask.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:12 -05:00
Lai Jiangshan
fe26f91d30 KVM: VMX: Change comments about vmx_get_msr()
The variable name is changed in the code.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-8-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:06 -05:00
Lai Jiangshan
ed07ef5a66 KVM: VMX: Use kvm_set_msr_common() for MSR_IA32_TSC_ADJUST in the default way
MSR_IA32_TSC_ADJUST can be left to the default way which also uese
kvm_set_msr_common().

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-7-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:06 -05:00
Lai Jiangshan
15ad9762d6 KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()
The host CR3 in the vcpu thread can only be changed when scheduling.
Moving the code in vmx_prepare_switch_to_guest() makes the code
simpler.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:06 -05:00
Lai Jiangshan
3ab4ac877c KVM: VMX: Update msr value after kvm_set_user_return_msr() succeeds
Aoid earlier modification.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:05 -05:00