linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Paolo Bonzini	594bef7931	KVM: x86/mmu: do not consult levels when freeing roots Right now, PGD caching requires a complicated dance of first computing the MMU role and passing it to __kvm_mmu_new_pgd(), and then separately calling kvm_init_mmu(). Part of this is due to kvm_mmu_free_roots using mmu->root_level and mmu->shadow_root_level to distinguish whether the page table uses a single root or 4 PAE roots. Because kvm_init_mmu() can overwrite mmu->root_level, kvm_mmu_free_roots() must be called before kvm_init_mmu(). However, even after kvm_init_mmu() there is a way to detect whether the page table may hold PAE roots, as root.hpa isn't backed by a shadow when it points at PAE roots. Using this method results in simpler code, and is one less obstacle in moving all calls to __kvm_mmu_new_pgd() after the MMU has been initialized. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:17 -05:00
Paolo Bonzini	b9e5603c2a	KVM: x86: use struct kvm_mmu_root_info for mmu->root The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:16 -05:00
Paolo Bonzini	9191b8f074	KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugs WARN and bail if KVM attempts to free a root that isn't backed by a shadow page. KVM allocates a bare page for "special" roots, e.g. when using PAE paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa will be valid but won't be backed by a shadow page. It's all too easy to blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of crashing KVM and possibly the kernel. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-25 08:20:16 -05:00
Liang Zhang	6f3c1fc53d	KVM: x86/mmu: make apf token non-zero to fix bug In current async pagefault logic, when a page is ready, KVM relies on kvm_arch_can_dequeue_async_page_present() to determine whether to deliver a READY event to the Guest. This function test token value of struct kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a READY event is finished by Guest. If value is zero meaning that a READY event is done, so the KVM can deliver another. But the kvm_arch_setup_async_pf() may produce a valid token with zero value, which is confused with previous mention and may lead the loss of this READY event. This bug may cause task blocked forever in Guest: INFO: task stress:7532 blocked for more than 1254 seconds. Not tainted 5.10.0 #16 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack: 0 pid: 7532 ppid: 1409 flags:0x00000080 Call Trace: __schedule+0x1e7/0x650 schedule+0x46/0xb0 kvm_async_pf_task_wait_schedule+0xad/0xe0 ? exit_to_user_mode_prepare+0x60/0x70 __kvm_handle_async_pf+0x4f/0xb0 ? asm_exc_page_fault+0x8/0x30 exc_page_fault+0x6f/0x110 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x402d00 RSP: 002b:00007ffd31912500 EFLAGS: 00010206 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000 Signed-off-by: Liang Zhang <zhangliang5@huawei.com> Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-24 13:04:46 -05:00
Sean Christopherson	1bbc60d0c7	KVM: x86/mmu: Remove MMU auditing Remove mmu_audit.c and all its collateral, the auditing code has suffered severe bitrot, ironically partly due to shadow paging being more stable and thus not benefiting as much from auditing, but mostly due to TDP supplanting shadow paging for non-nested guests and shadowing of nested TDP not heavily stressing the logic that is being audited. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-18 13:46:23 -05:00
David Matlack	cb00a70bd4	KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for the first time and only for the specific sub-region being cleared. Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of huge page splitting across multiple ioctls, rather than splitting the entire memslot as is the case when initially-all-set is not used. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-17-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:43 -05:00
David Matlack	a3fe5dbda0	KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:42 -05:00
David Matlack	315d86da89	KVM: x86/mmu: Move restore_acc_track_spte() to spte.h restore_acc_track_spte() is pure SPTE bit manipulation, making it a good fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed, there isn't any good reason to not inline it. This move also prepares for a follow-up commit that will need to call restore_acc_track_spte() from spte.c No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:40 -05:00
David Matlack	77c23c77f9	KVM: x86/mmu: Drop new_spte local variable from restore_acc_track_spte() The new_spte local variable is unnecessary. Deleting it can save a line of code and simplify the remaining lines a bit. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-10-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:39 -05:00
David Matlack	59940e76d1	KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte() The warnings in restore_acc_track_spte() can be removed because the only caller checks is_access_track_spte(), and is_access_track_spte() checks !spte_ad_enabled(). In other words, the warning can never be triggered. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-9-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:39 -05:00
David Matlack	1346bbb6b4	KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect() The function formerly known as rmap_write_protect() has been renamed to kvm_vcpu_write_protect_gfn(), so we can get rid of the double underscores in front of __rmap_write_protect(). No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:37 -05:00
David Matlack	cf48f9e286	KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() rmap_write_protect() is a poor name because it also write-protects SPTEs in the TDP MMU, not just SPTEs in the rmap. It is also confusing that rmap_write_protect() is not a simple wrapper around __rmap_write_protect(), since that is the common pattern for functions with double-underscore names. Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() to convey that KVM is write-protecting a specific gfn in the context of a vCPU. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:36 -05:00
David Matlack	02844ac1eb	KVM: x86/mmu: Consolidate comments about {Host,MMU}-writable Consolidate the large comment above DEFAULT_SPTE_HOST_WRITABLE with the large comment above is_writable_pte() into one comment. This comment explains the different reasons why an SPTE may be non-writable and KVM keeps track of that with the {Host,MMU}-writable bits. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230723.1701061-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:33 -05:00
David Matlack	1ca87e015d	KVM: x86/mmu: Rename DEFAULT_SPTE_MMU_WRITEABLE to DEFAULT_SPTE_MMU_WRITABLE Both "writeable" and "writable" are valid, but we should be consistent about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230713.1700406-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:33 -05:00
David Matlack	115111efd9	KVM: x86/mmu: Check SPTE writable invariants when setting leaf SPTEs Check SPTE writable invariants when setting SPTEs rather than in spte_can_locklessly_be_made_writable(). By the time KVM checks spte_can_locklessly_be_made_writable(), the SPTE has long been since corrupted. Note that these invariants only apply to shadow-present leaf SPTEs (i.e. not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the restriction and only instrument the code paths that set shadow-present leaf SPTEs. To account for access tracking, also check the SPTE writable invariants when marking an SPTE as an access track SPTE. This also lets us remove a redundant WARN from mark_spte_for_access_track(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:50:32 -05:00
Sean Christopherson	e27bc0440e	KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names Rename a variety of kvm_x86_op function pointers so that preferred name for vendor implementations follows the pattern <vendor>_<function>, e.g. rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will allow vendor implementations to be wired up via the KVM_X86_OP macro. In many cases, VMX and SVM "disagree" on the preferred name, though in reality it's VMX and x86 that disagree as SVM blindly prepended _svm to the kvm_x86_ops name. Justification for using the VMX nomenclature: - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an event that has already been "set" in e.g. the vIRR. SVM's relevant VMCB field is even named event_inj, and KVM's stat is irq_injections. - prepare_guest_switch => prepare_switch_to_guest because the former is ambiguous, e.g. it could mean switching between multiple guests, switching from the guest to host, etc... - update_pi_irte => pi_update_irte to allow for matching match the rest of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>(). - start_assignment => pi_start_assignment to again follow VMX's posted interrupt naming scheme, and to provide context for what bit of code might care about an otherwise undescribed "assignment". The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave that change for another time as the Hyper-V hooks always start as NULL, i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all names requires an astounding amount of churn. VMX and SVM function names are intentionally left as is to minimize the diff. Both VMX and SVM will need to rename even more functions in order to fully utilize KVM_X86_OPS, i.e. an additional patch for each is inevitable. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:47:17 -05:00
Jinrong Liang	e8f6e7383c	KVM: x86/mmu: Remove unused "vcpu" of reset_{tdp,ept}_shadow_zero_bits_mask() The "struct kvm_vcpu *vcpu" parameter of reset_ept_shadow_zero_bits_mask() and reset_tdp_shadow_zero_bits_mask() is not used, so remove it. No functional change intended. Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Message-Id: <20220125095909.38122-4-cloudliang@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:47:10 -05:00
Jinrong Liang	a0e72cd1e9	KVM: x86/mmu: Remove unused "kvm" of __rmap_write_protect() The "struct kvm *kvm" parameter of __rmap_write_protect() is not used, so remove it. No functional change intended. Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Message-Id: <20220125095909.38122-3-cloudliang@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:47:10 -05:00
Jinrong Liang	61827671ca	KVM: x86/mmu: Remove unused "kvm" of kvm_mmu_unlink_parents() The "struct kvm *kvm" parameter of kvm_mmu_unlink_parents() is not used, so remove it. No functional change intended. Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Message-Id: <20220125095909.38122-2-cloudliang@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-02-10 13:47:09 -05:00
David Matlack	6ff94f27fd	KVM: x86/mmu: Improve TLB flush comment in kvm_mmu_slot_remove_write_access() Rewrite the comment in kvm_mmu_slot_remove_write_access() that explains why it is safe to flush TLBs outside of the MMU lock after write-protecting SPTEs for dirty logging. The current comment is a long run-on sentence that was difficult to understand. In addition it was specific to the shadow MMU (mentioning mmu_spte_update()) when the TDP MMU has to handle this as well. The new comment explains: - Why the TLB flush is necessary at all. - Why it is desirable to do the TLB flush outside of the MMU lock. - Why it is safe to do the TLB flush outside of the MMU lock. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220113233020.3986005-5-dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-01-19 12:09:07 -05:00
Paolo Bonzini	855fb0384a	Merge remote-tracking branch 'kvm/master' into HEAD Pick commit `fdba608f15` ("KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU"). In addition to fixing a bug, it also aligns the non-nested and nested usage of triggering posted interrupts, allowing for additional cleanups. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-21 12:51:09 -05:00
Sean Christopherson	18c841e1f4	KVM: x86: Retry page fault if MMU reload is pending and root has no sp Play nice with a NULL shadow page when checking for an obsolete root in the page fault handler by flagging the page fault as stale if there's no shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending. Invalidating memslots, which is the only case where _all_ roots need to be reloaded, requests all vCPUs to reload their MMUs while holding mmu_lock for lock. The "special" roots, e.g. pae_root when KVM uses PAE paging, are not backed by a shadow page. Running with TDP disabled or with nested NPT explodes spectaculary due to dereferencing a NULL shadow page pointer. Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the root. Zapping shadow pages in response to guest activity, e.g. when the guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current vCPU isn't using the affected root. I.e. KVM_REQ_MMU_RELOAD can be seen with a completely valid root shadow page. This is a bit of a moot point as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will be cleaned up in the future. Fixes: `a955cad84c` ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211209060552.2956723-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-19 19:38:58 +01:00
Lai Jiangshan	bb3b394d35	KVM: X86: Rename gpte_is_8_bytes to has_4_byte_gpte and invert the direction This bit is very close to mean "role.quadrant is not in use", except that it is false also when the MMU is mapping guest physical addresses directly. In that case, role.quadrant is indeed not in use, but there are no guest PTEs at all. Changing the name and direction of the bit removes the special case, since a guest with paging disabled, or not considering guest paging structures as is the case for two-dimensional paging, does not have to deal with 4-byte guest PTEs. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:13 -05:00
Lai Jiangshan	cc022ae144	KVM: X86: Add parameter huge_page_level to kvm_init_shadow_ept_mmu() The level of supported large page on nEPT affects the rsvds_bits_mask. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:12 -05:00
Lai Jiangshan	84ea5c09a6	KVM: X86: Add huge_page_level to __reset_rsvds_bits_mask_ept() Bit 7 on pte depends on the level of supported large page. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:11 -05:00
Lai Jiangshan	c59a0f57fa	KVM: X86: Remove mmu->translate_gpa Reduce an indirect function call (retpoline) and some intialization code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:11 -05:00
Lai Jiangshan	1f5a21ee84	KVM: X86: Add parameter struct kvm_mmu mmu into mmu->gva_to_gpa() The mmu->gva_to_gpa() has no "struct kvm_mmu mmu", so an extra FNAME(gva_to_gpa_nested) is needed. Add the parameter can simplify the code. And it makes it explicit that the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2 gpa in translate_nested_gpa() via the new parameter. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:10 -05:00
Lai Jiangshan	b46a13cb7e	KVM: X86: Calculate quadrant when !role.gpte_is_8_bytes role.quadrant is only valid when gpte size is 4 bytes and only be calculated when gpte size is 4 bytes. Although "vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL" also means gpte size is 4 bytes, but using "!role.gpte_is_8_bytes" is clearer Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-15-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:10 -05:00
Lai Jiangshan	41e35604ea	KVM: X86: Remove useless code to set role.gpte_is_8_bytes when role.direct role.gpte_is_8_bytes is unused when role.direct; there is no point in changing a bit in the role, the value that was set when the MMU is initialized is just fine. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-14-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:09 -05:00
Lai Jiangshan	84432316cd	KVM: X86: Fix comment in __kvm_mmu_create() The allocation of special roots is moved to mmu_alloc_special_roots(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-12-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:08 -05:00
Lai Jiangshan	27f4fca29f	KVM: X86: Skip allocating pae_root for vcpu->arch.guest_mmu when !tdp_enabled It is never used. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:25:08 -05:00
Vihas Mak	98a26b69d8	KVM: x86: change TLB flush indicator to bool change 0 to false and 1 to true to fix following cocci warnings: arch/x86/kvm/mmu/mmu.c:1485:9-10: WARNING: return of 0/1 in function 'kvm_set_pte_rmapp' with return type bool arch/x86/kvm/mmu/mmu.c:1636:10-11: WARNING: return of 0/1 in function 'kvm_test_age_rmapp' with return type bool Signed-off-by: Vihas Mak <makvihas@gmail.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Message-Id: <20211114164312.GA28736@makvihas> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:44 -05:00
Ben Gardon	8283e36abf	KVM: x86/mmu: Propagate memslot const qualifier In preparation for implementing in-place hugepage promotion, various functions will need to be called from zap_collapsible_spte_range, which has the const qualifier on its memslot argument. Propagate the const qualifier to the various functions which will be needed. This just serves to simplify the following patch. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-11-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:43 -05:00
Ben Gardon	4d78d0b39a	KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages The vCPU argument to mmu_try_to_unsync_pages is now only used to get a pointer to the associated struct kvm, so pass in the kvm pointer from the beginning to remove the need for a vCPU when calling the function. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:42 -05:00
Ben Gardon	9d395a0a7a	KVM: x86/mmu: Remove need for a vcpu from kvm_slot_page_track_is_active kvm_slot_page_track_is_active only uses its vCPU argument to get a pointer to the assoicated struct kvm, so just pass in the struct KVM to remove the need for a vCPU pointer. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-08 04:24:42 -05:00
Maciej S. Szmigiero	f4209439b5	KVM: Optimize gfn lookup in kvm_zap_gfn_range() Introduce a memslots gfn upper bound operation and use it to optimize kvm_zap_gfn_range(). This way this handler can do a quick lookup for intersecting gfns and won't have to do a linear scan of the whole memslot set. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <ef242146a87a335ee93b441dcf01665cb847c902.1638817641.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:35 -05:00
Maciej S. Szmigiero	a54d806688	KVM: Keep memslots in tree-based structures instead of array-based ones The current memslot code uses a (reverse gfn-ordered) memslot array for keeping track of them. Because the memslot array that is currently in use cannot be modified every memslot management operation (create, delete, move, change flags) has to make a copy of the whole array so it has a scratch copy to work on. Strictly speaking, however, it is only necessary to make copy of the memslot that is being modified, copying all the memslots currently present is just a limitation of the array-based memslot implementation. Two memslot sets, however, are still needed so the VM continues to run on the currently active set while the requested operation is being performed on the second, currently inactive one. In order to have two memslot sets, but only one copy of actual memslots it is necessary to split out the memslot data from the memslot sets. The memslots themselves should be also kept independent of each other so they can be individually added or deleted. These two memslot sets should normally point to the same set of memslots. They can, however, be desynchronized when performing a memslot management operation by replacing the memslot to be modified by its copy. After the operation is complete, both memslot sets once again point to the same, common set of memslot data. This commit implements the aforementioned idea. For tracking of gfns an ordinary rbtree is used since memslots cannot overlap in the guest address space and so this data structure is sufficient for ensuring that lookups are done quickly. The "last used slot" mini-caches (both per-slot set one and per-vCPU one), that keep track of the last found-by-gfn memslot, are still present in the new code. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:34 -05:00
Maciej S. Szmigiero	f5756029ee	KVM: x86: Use nr_memslot_pages to avoid traversing the memslots array There is no point in recalculating from scratch the total number of pages in all memslots each time a memslot is created or deleted. Use KVM's cached nr_memslot_pages to compute the default max number of MMU pages. Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can possibly overflow an unsigned long variable. Write this "* 20 / 1000" operation as "/ 50" instead to avoid such overflow. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: use common KVM field and rework changelog accordingly] Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>	2021-12-08 04:24:29 -05:00
Sean Christopherson	a955cad84c	KVM: x86/mmu: Retry page fault if root is invalidated by memslot update Bail from the page fault handler if the root shadow page was obsoleted by a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU doesn't rely on the memslot/MMU generation, and instead relies on the root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes mmu_lock for write. For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has moved past the gfn associated with the SP. For other MMUs, the resulting behavior is far more convoluted, though unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete root isn't directly problematic, as the obsolete root will be unloaded and dropped before the vCPU re-enters the guest. But because the legacy MMU tracks shadow pages by their role, any SP created by the fault can can be reused in the new post-reload root. Again, that _shouldn't_ be problematic as any leaf child SPTEs will be created for the current/valid memslot generation, and kvm_mmu_get_page() will not reuse child SPs from the old generation as they will be flagged as obsolete. But, given that continuing with the fault is pointess (the root will be unloaded), apply the check to all MMUs. Fixes: `b7cccd397f` ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-12-02 04:12:12 -05:00
Sean Christopherson	f47491d7f3	KVM: x86/mmu: Handle "default" period when selectively waking kthread Account for the '0' being a default, "let KVM choose" period, when determining whether or not the recovery worker needs to be awakened in response to userspace reducing the period. Failure to do so results in the worker not being awakened properly, e.g. when changing the period from '0' to any small-ish value. Fixes: `4dfe4f40d8` ("kvm: x86: mmu: Make NX huge page recovery period configurable") Cc: stable@vger.kernel.org Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120015706.3830341-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:27 -05:00
Paolo Bonzini	28f091bc2f	KVM: MMU: shadow nested paging does not have PKU Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding incorrect interpretations of the nested hypervisor's page tables. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:26 -05:00
Sean Christopherson	4b85c921cd	KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path Drop the "flush" param and return values to/from the TDP MMU's helper for zapping collapsible SPTEs. Because the helper runs with mmu_lock held for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic zap handles the necessary remote TLB flush. Similarly, because mmu_lock is dropped and re-acquired between zapping legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must handle remote TLB flushes from the legacy MMU before calling into the TDP MMU. Fixes: `e2209710cc` ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211120045046.3940942-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:09:25 -05:00
Lai Jiangshan	05b29633c7	KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg() INVLPG operates on guest virtual address, which are represented by vcpu->arch.walk_mmu. In nested virtualization scenarios, kvm_mmu_invlpg() was using the wrong MMU structure; if L2's invlpg were emulated by L0 (in practice, it hardly happen) when nested two-dimensional paging is enabled, the call to ->tlb_flush_gva() would be skipped and the hardware TLB entry would not be invalidated. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-5-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:21 -05:00
Lai Jiangshan	12ec33a705	KVM: X86: Fix when shadow_root_level=5 && guest root_level<4 If the is an L1 with nNPT in 32bit, the shadow walk starts with pae_root. Fixes: `a717a780fc` ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host) Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:20 -05:00
Vitaly Kuznetsov	feb627e8d6	KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN Commit `63f5a1909f` ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls after first successful KVM_RUN and promissed to make this sequence forbiden in 5.16. It's time to fulfil the promise. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211122175818.608220-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-26 08:14:20 -05:00
Hou Wenlong	8ed716ca7d	KVM: x86/mmu: Pass parameter flush as false in kvm_tdp_mmu_zap_collapsible_sptes() Since tlb flush has been done for legacy MMU before kvm_tdp_mmu_zap_collapsible_sptes(), so the parameter flush should be false for kvm_tdp_mmu_zap_collapsible_sptes(). Fixes: `e2209710cc` ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <21453a1d2533afb6e59fb6c729af89e771ff2e76.1637140154.git.houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 07:05:58 -05:00
Hou Wenlong	c7785d85b6	KVM: x86/mmu: Skip tlb flush if it has been done in zap_gfn_range() If the parameter flush is set, zap_gfn_range() would flush remote tlb when yield, then tlb flush is not needed outside. So use the return value of zap_gfn_range() directly instead of OR on it in kvm_unmap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range(). Fixes: `3039bcc744` ("KVM: Move x86's MMU notifier memslot walkers to generic code") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <5e16546e228877a4d974f8c0e448a93d52c7a5a9.1637140154.git.houwenlong93@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 07:05:57 -05:00
Paolo Bonzini	817506df9d	Merge branch 'kvm-5.16-fixes' into kvm-master * Fixes for Xen emulation * Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache * Fixes for migration of 32-bit nested guests on 64-bit hypervisor * Compilation fixes * More SEV cleanups	2021-11-18 02:11:57 -05:00
Maxim Levitsky	b8453cdcf2	KVM: x86/mmu: include EFER.LMA in extended mmu role Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the guest root level and is not reflected in kvm_mmu_page_role.level when TDP is in use. When simply running the guest, it is impossible for EFER.LMA and kvm_mmu.root_level to get out of sync, as the guest cannot transition from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without first bouncing through a different MMU context. And stuffing guest state via KVM_SET_SREGS{,2} also ensures a full MMU context reset. However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to set guest state when migrating the VM while L2 is active, the vCPU state will reflect L2, not L1. If L1 is using TDP for L2, then root_mmu will have been configured using L2's state, despite not being used for L2. If L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will be configured for guest PAE paging, but will match the mmu_role for 64-bit paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit. Alternatively, the root_mmu's role could be invalidated after a successful KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu, i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily tricky, and not even needed if L1 and L2 do have the same role (e.g., they are both 64-bit guests and run with the same CR4). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211115131837.195527-3-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-18 02:03:42 -05:00
Paolo Bonzini	f5396f2d82	Merge branch 'kvm-5.16-fixes' into kvm-master * Fix misuse of gfn-to-pfn cache when recording guest steal time / preempted status * Fix selftests on APICv machines * Fix sparse warnings * Fix detection of KVM features in CPUID * Cleanups for bogus writes to MSR_KVM_PV_EOI_EN * Fixes and cleanups for MSR bitmap handling * Cleanups for INVPCID * Make x86 KVM_SOFT_MAX_VCPUS consistent with other architectures	2021-11-11 11:03:05 -05:00

... 5 6 7 8 9 ...

748 commits