1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

1040 commits

Author SHA1 Message Date
Wanpeng Li
664f8e26b0 KVM: X86: Fix loss of exception which has not yet been injected
vmx_complete_interrupts() assumes that the exception is always injected,
so it can be dropped by kvm_clear_exception_queue().  However,
an exception cannot be injected immediately if it is: 1) originally
destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.

This patch applies to exceptions the same algorithm that is used for
NMIs, replacing exception.reinject with "exception.injected" (equivalent
to nmi_injected).

exception.pending now represents an exception that is queued and whose
side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
If exception.pending is true, the exception might result in a nested
vmexit instead, too (in which case the side effects must not be applied).

exception.injected instead represents an exception that is going to be
injected into the guest at the next vmentry.

Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:19 +02:00
Yu Zhang
fd8cb43373 KVM: MMU: Expose the LA57 feature to VM.
This patch exposes 5 level page table feature to the VM.
At the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:17 +02:00
Yu Zhang
855feb6736 KVM: MMU: Add 5 level EPT & Shadow page table support.
Extends the shadow paging code, so that 5 level shadow page
table can be constructed if VM is running in 5 level paging
mode.

Also extends the ept code, so that 5 level ept table can be
constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
shadow logic, KVM should still use 4 level ept table for a VM
whose physical address width is less than 48 bits, even when
the VM is running in 5 level paging mode.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
[Unconditionally reset the MMU context in kvm_cpuid_update.
 Changing MAXPHYADDR invalidates the reserved bit bitmasks.
 - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:17 +02:00
Yu Zhang
2a7266a8f9 KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL.
Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.

Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just
redefine it to 5 whenever a replacement is needed for 5 level
paging.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:16 +02:00
Yu Zhang
d1cd3ce900 KVM: MMU: check guest CR3 reserved bits based on its physical address width.
Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
reserved bits in CR3. Yet the length of reserved bits in
guest CR3 should be based on the physical address width
exposed to the VM. This patch changes CR3 check logic to
calculate the reserved bits at runtime.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:16 +02:00
Brijesh Singh
618232e219 KVM: x86: Avoid guest page table walk when gpa_available is set
When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.

Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
 new code. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Moved "ret < 0" to the else brach, as per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 14:37:49 +02:00
Paolo Bonzini
eebed24389 kvm: nVMX: Add support for fast unprotection of nested guest page tables
This is the same as commit 147277540b ("kvm: svm: Add support for
additional SVM NPF error codes", 2016-11-23), but for Intel processors.
In this case, the exit qualification field's bit 8 says whether the
EPT violation occurred while translating the guest's final physical
address or rather while translating the guest page tables.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10 16:44:04 +02:00
Longpeng(Mike)
de63ad4cf4 KVM: X86: implement the logic for spinlock optimization
get_cpl requires vcpu_load, so we must cache the result (whether the
vcpu was preempted when its cpl=0) in kvm_vcpu_arch.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08 10:57:43 +02:00
Tom Lendacky
d0ec49d4de kvm/x86/svm: Support Secure Memory Encryption within KVM
Update the KVM support to work with SME. The VMCB has a number of fields
where physical addresses are used and these addresses must contain the
memory encryption mask in order to properly access the encrypted memory.
Also, use the memory encryption mask when creating and using the nested
page tables.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/89146eccfa50334409801ff20acd52a90fb5efcf.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-18 11:38:04 +02:00
Roman Kagan
d3457c877b kvm: x86: hyperv: make VP_INDEX managed by userspace
Hyper-V identifies vCPUs by Virtual Processor Index, which can be
queried via HV_X64_MSR_VP_INDEX msr.  It is defined by the spec as a
sequential number which can't exceed the maximum number of vCPUs per VM.
APIC ids can be sparse and thus aren't a valid replacement for VP
indices.

Current KVM uses its internal vcpu index as VP_INDEX.  However, to make
it predictable and persistent across VM migrations, the userspace has to
control the value of VP_INDEX.

This patch achieves that, by storing vp_index explicitly on vcpu, and
allowing HV_X64_MSR_VP_INDEX to be set from the host side.  For
compatibility it's initialized to KVM vcpu index.  Also a few variables
are renamed to make clear distinction betweed this Hyper-V vp_index and
KVM vcpu_id (== APIC id).  Besides, a new capability,
KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip
attempting msr writes where unsupported, to avoid spamming error logs.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 16:28:18 +02:00
Wanpeng Li
52a5c155cf KVM: async_pf: Let guest support delivery of async_pf from guest mode
Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1,
async page faults are delivered to L1 as #PF vmexits; if bit 2 is 0,
kvm_can_do_async_pf returns 0 if in guest mode.

This is similar to what svm.c wanted to do all along, but it is only
enabled for Linux as L1 hypervisor.  Foreign hypervisors must never
receive async page faults as vmexits, because they'd probably be very
confused about that.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 14:26:16 +02:00
Wanpeng Li
adfe20fb48 KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf
Add an nested_apf field to vcpu->arch.exception to identify an async page
fault, and constructs the expected vm-exit information fields. Force a
nested VM exit from nested_vmx_check_exception() if the injected #PF is
async page fault.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 14:26:16 +02:00
Wanpeng Li
1261bfa326 KVM: async_pf: Add L1 guest async_pf #PF vmexit handler
This patch adds the L1 guest async page fault #PF vmexit handler, such
by L1 similar to ordinary async page fault.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Passed insn parameters to kvm_mmu_page_fault().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 14:25:24 +02:00
Wanpeng Li
cfcd20e5ca KVM: x86: Simplify kvm_x86_ops->queue_exception parameter list
This patch removes all arguments except the first in
kvm_x86_ops->queue_exception since they can extract the arguments from
vcpu->arch.exception themselves.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14 14:24:28 +02:00
Roman Kagan
efc479e690 kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2
There is a flaw in the Hyper-V SynIC implementation in KVM: when message
page or event flags page is enabled by setting the corresponding msr,
KVM zeroes it out.  This is problematic because on migration the
corresponding MSRs are loaded on the destination, so the content of
those pages is lost.

This went unnoticed so far because the only user of those pages was
in-KVM hyperv synic timers, which could continue working despite that
zeroing.

Newer QEMU uses those pages for Hyper-V VMBus implementation, and
zeroing them breaks the migration.

Besides, in newer QEMU the content of those pages is fully managed by
QEMU, so zeroing them is undesirable even when writing the MSRs from the
guest side.

To support this new scheme, introduce a new capability,
KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
pages aren't zeroed out in KVM.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-13 17:41:04 +02:00
Ladi Prosek
a826faf108 KVM: x86: make backwards_tsc_observed a per-VM variable
The backwards_tsc_observed global introduced in commit 16a9602 is never
reset to false. If a VM happens to be running while the host is suspended
(a common source of the TSC jumping backwards), master clock will never
be enabled again for any VM. In contrast, if no VM is running while the
host is suspended, master clock is unaffected. This is inconsistent and
unnecessarily strict. Let's track the backwards_tsc_observed variable
separately and let each VM start with a clean slate.

Real world impact: My Windows VMs get slower after my laptop undergoes a
suspend/resume cycle. The only way to get the perf back is unloading and
reloading the kvm module.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-13 17:25:39 +02:00
Peter Feiner
ac8d57e573 kvm: x86: mmu: allow A/D bits to be disabled in an mmu
Adds the plumbing to disable A/D bits in the MMU based on a new role
bit, ad_disabled. When A/D is disabled, the MMU operates as though A/D
aren't available (i.e., using access tracking faults instead).

To avoid SP -> kvm_mmu_page.role.ad_disabled lookups all over the
place, A/D disablement is now stored in the SPTE. This state is stored
in the SPTE by tweaking the use of SPTE_SPECIAL_MASK for access
tracking. Rather than just setting SPTE_SPECIAL_MASK when an
access-tracking SPTE is non-present, we now always set
SPTE_SPECIAL_MASK for access-tracking SPTEs.

Signed-off-by: Peter Feiner <pfeiner@google.com>
[Use role.ad_disabled even for direct (non-shadow) EPT page tables.  Add
 documentation and a few MMU_WARN_ONs. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-03 11:19:54 +02:00
Paolo Bonzini
04a7ea04d5 KVM/ARM updates for 4.13
- vcpu request overhaul
 - allow timer and PMU to have their interrupt number
   selected from userspace
 - workaround for Cavium erratum 30115
 - handling of memory poisonning
 - the usual crop of fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAllWCM0VHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDjJ0QAI16x6+trKhH31lTSYekYfqm4hZ2
 Fp7IbALW9KNCaY35tZov2Zuh99qGRduxTh7ewqhKpON8kkU+UKj0F7zH22+vfN4m
 yas/+uNr8R9VLyvea4ysPsgx8Q8v1Ix9setohHYNZIL9/klVqtaHpYvArHVF/mzq
 p2j/NxRS2dlp9r2TtoMRMhA05u6r0wolhUuh+z9v2ipib0gfOBIG24jsqCTEcD9n
 5A/cVd+ztYshkrV95h3y9peahwt3zOA4QBGzrQ2K25jp0s54nqhmC7JTNSa8dtar
 YGW2MuAMoIFTwCFAlpwCzrwpOJFzF3Q6A8bOxei2fjclzjPMgT1xQxuhOoe4ntFa
 lTPxSHalm5W6dFTW90YSo2DBcPe+N7sQkhjR0cCeY3GYsOFhXMLTlOl5Pt1YK1or
 +3FAI74tFRKvVmb9mhZeGTvuzhDgRvtf3Qq5rjwlGzKc2BBOEgtMyj/Wgwo4N6Dz
 IjOnoRaUGELoBCWoTorMxLpsPBdPVSUxNyJTdAhqZ/ZtT1xqjhFNLZcrVWmOTzDM
 1cav+jZkla4sLmJSNDD54aCSvvtPHis0nZn9PRlh12xgOyYiAVx4K++MNuWP0P37
 hbh1gbPT+FcoVxPurUsX/pjNlTucPZcBwFytZDQlpwtPBpEFzJiImLYe/PldRb0f
 9WQOH1Y1+q14MF+N
 =6hNK
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM updates for 4.13

- vcpu request overhaul
- allow timer and PMU to have their interrupt number
  selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups

Conflicts:
	arch/s390/include/asm/kvm_host.h
2017-06-30 12:38:26 +02:00
Andrew Jones
2387149ead KVM: improve arch vcpu request defining
Marc Zyngier suggested that we define the arch specific VCPU request
base, rather than requiring each arch to remember to start from 8.
That suggestion, along with Radim Krcmar's recent VCPU request flag
addition, snowballed into defining something of an arch VCPU request
defining API.

No functional change.

(Looks like x86 is running out of arch VCPU request bits.  Maybe
 someday we'll need to extend to 64.)

Signed-off-by: Andrew Jones <drjones@redhat.com>
Acked-by: Christoffer Dall <cdall@linaro.org>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-06-04 16:53:00 +02:00
Paolo Bonzini
b401ee0b85 KVM: x86: lower default for halt_poll_ns
In some fio benchmarks, halt_poll_ns=400000 caused CPU utilization to
increase heavily even in cases where the performance improvement was
small.  In particular, bandwidth divided by CPU usage was as much as
60% lower.

To some extent this is the expected effect of the patch, and the
additional CPU utilization is only visible when running the
benchmarks.  However, halving the threshold also halves the extra
CPU utilization (from +30-130% to +20-70%) and has no negative
effect on performance.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-05-16 21:15:50 +02:00
Bandan Das
bab4165e2f kvm: x86: Add a hook for arch specific dirty logging emulation
When KVM updates accessed/dirty bits, this hook can be used
to invoke an arch specific function that implements/emulates
dirty logging such as PML.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-09 11:54:16 +02:00
David Hildenbrand
5c0aea0e8d KVM: x86: don't hold kvm->lock in KVM_SET_GSI_ROUTING
We needed the lock to avoid racing with creation of the irqchip on x86. As
kvm_set_irq_routing() calls srcu_synchronize_expedited(), this lock
might be held for a longer time.

Let's introduce an arch specific callback to check if we can actually
add irq routes. For x86, all we have to do is check if we have an
irqchip in the kernel. We don't need kvm->lock at that point as the
irqchip is marked as inititalized only when actually fully created.

Reported-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Fixes: 1df6ddede1 ("KVM: x86: race between KVM_SET_GSI_ROUTING and KVM_CREATE_IRQCHIP")
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-02 14:45:45 +02:00
Paolo Bonzini
7a97cec26b KVM: mark requests that need synchronization
kvm_make_all_requests() provides a synchronization that waits until all
kicked VCPUs have acknowledged the kick.  This is important for
KVM_REQ_MMU_RELOAD as it prevents freeing while lockless paging is
underway.

This patch adds the synchronization property into all requests that are
currently being used with kvm_make_all_requests() in order to preserve
the current behavior and only introduce a new framework.  Removing it
from requests where it is not necessary is left for future patches.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 14:36:44 +02:00
Radim Krčmář
930f7fd6da KVM: mark requests that do not need a wakeup
Some operations must ensure that the guest is not running with stale
data, but if the guest is halted, then the update can wait until another
event happens.  kvm_make_all_requests() currently doesn't wake up, so we
can mark all requests used with it.

First 8 bits were arbitrarily reserved for request numbers.

Most uses of requests have the request type as a constant, so a compiler
will optimize the '&'.

An alternative would be to have an inline function that would return
whether the request needs a wake-up or not, but I like this one better
even though it might produce worse assembly.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 14:13:57 +02:00
Kyle Huey
db2336a804 KVM: x86: virtualize cpuid faulting
Hardware support for faulting on the cpuid instruction is not required to
emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
cpuid-induced VM exit checks the cpuid faulting state and the CPL.
kvm_require_cpl is even kind enough to inject the GP fault for us.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[Return "1" from kvm_emulate_cpuid, it's not void. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-21 12:50:06 +02:00
David Hildenbrand
637e3f86fa KVM: x86: new irqchip mode KVM_IRQCHIP_INIT_IN_PROGRESS
Let's add a new mode and set it while we create the irqchip via
KVM_CREATE_IRQCHIP and KVM_CAP_SPLIT_IRQCHIP.

This mode will be used later to test if adding routes
(in kvm_set_routing_entry()) is already allowed.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
Paolo Bonzini
4b4357e025 kvm: make KVM_COALESCED_MMIO_PAGE_OFFSET public
Its value has never changed; we might as well make it part of the ABI instead
of using the return value of KVM_CHECK_EXTENSION(KVM_CAP_COALESCED_MMIO).

Because PPC does not always make MMIO available, the code has to be made
dependent on CONFIG_KVM_MMIO rather than KVM_COALESCED_MMIO_PAGE_OFFSET.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:01 +02:00
Paolo Bonzini
ae1e2d1082 kvm: nVMX: support EPT accessed/dirty bits
Now use bit 6 of EPTP to optionally enable A/D bits for EPTP.  Another
thing to change is that, when EPT accessed and dirty bits are not in use,
VMX treats accesses to guest paging structures as data reads.  When they
are in use (bit 6 of EPTP is set), they are treated as writes and the
corresponding EPT dirty bit is set.  The MMU didn't know this detail,
so this patch adds it.

We also have to fix up the exit qualification.  It may be wrong because
KVM sets bit 6 but the guest might not.

L1 emulates EPT A/D bits using write permissions, so in principle it may
be possible for EPT A/D bits to be used by L1 even though not available
in hardware.  The problem is that guest page-table walks will be treated
as reads rather than writes, so they would not cause an EPT violation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed typo in walk_addr_generic() comment and changed bit clear +
 conditional-set pattern in handle_ept_violation() to conditional-clear]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
bd7e5b0899 KVM: x86: remove code for lazy FPU handling
The FPU is always active now when running KVM.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-17 12:28:01 +01:00
Paolo Bonzini
76dfafd536 KVM: x86: do not scan IRR twice on APICv vmentry
Calls to apic_find_highest_irr are scanning IRR twice, once
in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
now that it does the computation, it can also do the RVI write.

In order to avoid complications in svm.c, make the callback optional.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:35 +01:00
Paolo Bonzini
0f1e261ead KVM: x86: add VCPU stat for KVM_REQ_EVENT processing
This statistic can be useful to estimate the cost of an IRQ injection
scenario, by comparing it with irq_injections.  For example the stat
shows that sti;hlt triggers more KVM_REQ_EVENT than sti;nop.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:59 +01:00
Tom Lendacky
0f89b207b0 kvm: svm: Use the hardware provided GPA instead of page walk
When a guest causes a NPF which requires emulation, KVM sometimes walks
the guest page tables to translate the GVA to a GPA. This is unnecessary
most of the time on AMD hardware since the hardware provides the GPA in
EXITINFO2.

The only exception cases involve string operations involving rep or
operations that use two memory locations. With rep, the GPA will only be
the value of the initial NPF and with dual memory locations we won't know
which memory address was translated into EXITINFO2.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:58 +01:00
Junaid Shahid
f160c7b7bb kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:11 +01:00
Junaid Shahid
37f0e8fe6b kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs
MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
acceptable. However, the upcoming fast access tracking feature adds another
type of special tracking PTE, which uses not-Present PTEs and hence should
not set bit 63.

In order to use common bits to distinguish both type of special PTEs, we
now use only bit 62 as the special bit.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:10 +01:00
David Matlack
114df303a7 kvm: x86: reduce collisions in mmu_page_hash
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.

The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.

hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):

hash            max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits     49847
hash_64         105
perfect         97

While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).

Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").

Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:04 +01:00
David Matlack
f3414bc774 kvm: x86: export maximum number of mmu_page_hash collisions
Report the maximum number of mmu_page_hash collisions as a per-VM stat.
This will make it easy to identify problems with the mmu_page_hash in
the future.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:03 +01:00
Radim Krčmář
49776faf93 KVM: x86: decouple irqchip_in_kernel() and pic_irqchip()
irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
just complicated the code.
Add a separate state for the irqchip mode.

Reviewed-by: David Hildenbrand <david@redhat.com>
[Used Paolo's version of condition in irqchip_in_kernel().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-09 14:42:47 +01:00
Thomas Gleixner
a5a1d1c291 clocksource: Use a plain u64 instead of cycle_t
There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
2016-12-25 11:04:12 +01:00
Paolo Bonzini
3f5ad8be37 KVM: hyperv: fix locking of struct kvm_hv fields
Introduce a new mutex to avoid an AB-BA deadlock between kvm->lock and
vcpu->mutex.  Protect accesses in kvm_hv_setup_tsc_page too, as suggested
by Roman.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-16 17:53:38 +01:00
Ladi Prosek
9ed38ffad4 KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.

* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary

This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:10 +01:00
Kyle Huey
6affcbedca KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.

Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:

- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
  MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
  KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
  IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
  take precedence. The singlestep will be triggered again on the next
  instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
  the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
  which do not trigger singlestep traps as mentioned previously.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:05 +01:00
Kyle Huey
6a908b628c KVM: x86: Add a return value to kvm_emulate_cpuid
Once skipping the emulated instruction can potentially trigger an exit to
userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
propagate a return value.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:03 +01:00
Tom Lendacky
8370c3d08b kvm: svm: Add kvm_fast_pio_in support
Update the I/O interception support to add the kvm_fast_pio_in function
to speed up the in instruction similar to the out instruction.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:45 +01:00
Tom Lendacky
147277540b kvm: svm: Add support for additional SVM NPF error codes
AMD hardware adds two additional bits to aid in nested page fault handling.

Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables

The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.

Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:26 +01:00
Paolo Bonzini
ea26e4ec08 KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK
Since commit a545ab6a00 ("kvm: x86: add tsc_offset field to struct
kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
cached and need not be fished out of the VMCS or VMCB.  This means
that we can implement adjust_tsc_offset_guest and read_l1_tsc
entirely in generic code.  The simplification is particularly
significant for VMX code, where vmx->nested.vmcs01_tsc_offset
was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
the vmcs01_tsc_offset can be dropped completely.

More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
which, after commit 108b249c45 ("KVM: x86: introduce get_kvmclock_ns",
2016-09-01) called read_l1_tsc while the VMCS was not loaded.
It thus returned bogus values on Intel CPUs.

Fixes: 108b249c45
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 20:03:07 +01:00
Paolo Bonzini
095cf55df7 KVM: x86: Hyper-V tsc page setup
Lately tsc page was implemented but filled with empty
values. This patch setup tsc page scale and offset based
on vcpu tsc, tsc_khz and  HV_X64_MSR_TIME_REF_COUNT value.

The valid tsc page drops HV_X64_MSR_TIME_REF_COUNT msr
reads count to zero which potentially improves performance.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
[Computation of TSC page parameters rewritten to use the Linux timekeeper
 parameters. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:20 +02:00
Luiz Capitulino
3e3f50262e kvm: x86: drop read_tsc_offset()
The TSC offset can now be read directly from struct kvm_arch_vcpu.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:45 +02:00
Luiz Capitulino
a545ab6a00 kvm: x86: add tsc_offset field to struct kvm_vcpu_arch
A future commit will want to easily read a vCPU's TSC offset,
so we store it in struct kvm_arch_vcpu_arch for easy access.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:45 +02:00
Paolo Bonzini
ad53e35ae5 Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
Paul Mackerras writes:

    The highlights are:

    * Reduced latency for interrupts from PCI pass-through devices, from
      Suresh Warrier and me.
    * Halt-polling implementation from Suraj Jitindar Singh.
    * 64-bit VCPU statistics, also from Suraj.
    * Various other minor fixes and improvements.
2016-09-13 15:20:55 +02:00
Suravee Suthikulpanit
5881f73757 svm: Introduce AMD IOMMU avic_ga_log_notifier
This patch introduces avic_ga_log_notifier, which will be called
by IOMMU driver whenever it handles the Guest vAPIC (GA) log entry.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 13:11:56 +02:00