summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/mmu/mmu_internal.h
AgeCommit message (Collapse)AuthorFilesLines
2024-07-12KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped levelPaolo Bonzini1-1/+4
The guest memory population logic will need to know what page size or level (4K, 2M, ...) is mapped. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <eabc3f3e5eb03b370cadf6e1901ea34d7a020adc.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"Sean Christopherson1-13/+0
Move the accounting of the result of kvm_mmu_do_page_fault() to its callers, as only pf_fixed is common to guest page faults and async #PFs, and upcoming support KVM_PRE_FAULT_MEMORY won't bump _any_ stats. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handlerSean Christopherson1-8/+0
Account stat.pf_taken in kvm_mmu_page_fault(), i.e. the actual page fault handler, instead of conditionally bumping it in kvm_mmu_do_page_fault(). The "real" page fault handler is the only path that should ever increment the number of taken page faults, as all other paths that "do page fault" are by definition not handling faults that occurred in the guest. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error valuesSean Christopherson1-0/+3
Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault() to harden KVM against using "uninitialized" values. In quotes because the fields are actually zero-initialized, and zero is a legal value for both page frame numbers and virtual addresses. E.g. failure to set "pfn" prior to creating an SPTE could result in KVM pointing at physical address '0', which is far less desirable than KVM generating a SPTE with reserved PA bits set and thus effectively killing the VM. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Use synthetic page fault error code to indicate private faultsSean Christopherson1-1/+1
Add and use a synthetic, KVM-defined page fault error code to indicate whether a fault is to private vs. shared memory. TDX and SNP have different mechanisms for reporting private vs. shared, and KVM's software-protected VMs have no mechanism at all. Usurp an error code flag to avoid having to plumb another parameter to kvm_mmu_page_fault() and friends. Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it for TDX and software-protected VMs as appropriate, but that would require *clearing* the flag for SEV and SEV-ES VMs, which support encrypted memory at the hardware layer, but don't utilize private memory at the KVM layer. Opportunistically add a comment to call out that the logic for software- protected VMs is (and was before this commit) broken for nested MMUs, i.e. for nested TDP, as the GPA is an L2 GPA. Punt on trying to play nice with nested MMUs as there is a _lot_ of functionality that simply doesn't work for software-protected VMs, e.g. all of the paths where KVM accesses guest memory need to be updated to be aware of private vs. shared memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20240228024147.41573-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Pass full 64-bit error code when handling page faultsIsaku Yamahata1-2/+2
Plumb the full 64-bit error code throughout the page fault handling code so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK will be used to determine whether or not a fault is private vs. shared. Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change the behavior of permission_fault() when invoked in the page fault path, as KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault(). Continue passing '0' from the async #PF worker, as guest_memfd and thus private memory doesn't support async page faults. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [mdr: drop references/changes on rebase, update commit message] Signed-off-by: Michael Roth <michael.roth@amd.com> [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240228024147.41573-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulationSean Christopherson1-0/+19
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault triggers emulation of any kind, as KVM doesn't currently support emulating access to guest private memory. Practically speaking, private faults and emulation are already mutually exclusive, but there are many flow that can result in KVM returning RET_PF_EMULATE, and adding one last check to harden against weird, unexpected combinations and/or KVM bugs is inexpensive. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240228024147.41573-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-14Merge branch 'x86/bugs' into x86/core, to pick up pending changes before ↵Ingo Molnar1-1/+1
dependent patches Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-01-10x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINEBreno Leitao1-1/+1
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ariel Miculas <amiculas@cisco.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2023-11-29KVM: x86/mmu: Drop non-PA bits when getting GFN for guest's PGDBinbin Wu1-0/+1
Drop non-PA bits when getting GFN for guest's PGD with the maximum theoretical mask for guest MAXPHYADDR. Do it unconditionally because it's harmless for 32-bit guests, querying 64-bit mode would be more expensive, and for EPT the mask isn't tied to guest mode. Using PT_BASE_ADDR_MASK would be technically wrong (PAE paging has 64-bit elements _except_ for CR3, which has only 32 valid bits), it wouldn't matter in practice though. Opportunistically use GENMASK_ULL() to define __PT_BASE_ADDR_MASK. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-6-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-14KVM: x86: Add support for "protected VMs" that can utilize private memorySean Christopherson1-0/+1
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development and testing vehicle for Confidential (CoCo) VMs, and potentially to even become a "real" product in the distant future, e.g. a la pKVM. The private memory support in KVM x86 is aimed at AMD's SEV-SNP and Intel's TDX, but those technologies are extremely complex (understatement), difficult to debug, don't support running as nested guests, and require hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX for maintaining guest private memory isn't a realistic option. At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of selftests for guest_memfd and private memory support without requiring unique hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20231027182217.3615211-24-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14KVM: x86/mmu: Handle page fault for private memoryChao Peng1-0/+1
Add support for resolving page faults on guest private memory for VMs that differentiate between "shared" and "private" memory. For such VMs, KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and hva-based shared memory, and KVM needs to map in the "correct" variant, i.e. KVM needs to map the gfn shared/private as appropriate based on the current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag. For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request shared vs. private via a bit in the guest page tables, i.e. what the guest wants may conflict with the current memory attributes. To support such "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT to forward the request to userspace. Add a new flag for memory faults, KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to map memory as shared vs. private. Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to exit on missing mappings when handling guest page fault VM-Exits. In that case, userspace will want to know RWX information in order to correctly/precisely resolve the fault. Note, private memory *must* be backed by guest_memfd, i.e. shared mappings always come from the host userspace page tables, and private mappings always come from a guest_memfd instance. Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronouslySean Christopherson1-8/+7
Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest rootsSean Christopherson1-0/+10
When attempting to allocate a shadow root for a !visible guest root gfn, e.g. that resides in MMIO space, load a dummy root that is backed by the zero page instead of immediately synthesizing a triple fault shutdown (using the zero page ensures any attempt to translate memory will generate a !PRESENT fault and thus VM-Exit). Unless the vCPU is racing with memslot activity, KVM will inject a page fault due to not finding a visible slot in FNAME(walk_addr_generic), i.e. the end result is mostly same, but critically KVM will inject a fault only *after* KVM runs the vCPU with the bogus root. Waiting to inject a fault until after running the vCPU fixes a bug where KVM would bail from nested VM-Enter if L1 tried to run L2 with TDP enabled and a !visible root. Even though a bad root will *probably* lead to shutdown, (a) it's not guaranteed and (b) the CPU won't read the underlying memory until after VM-Enter succeeds. E.g. if L1 runs L2 with a VMX preemption timer value of '0', then architecturally the preemption timer VM-Exit is guaranteed to occur before the CPU executes any instruction, i.e. before the CPU needs to translate a GPA to a HPA (so long as there are no injected events with higher priority than the preemption timer). If KVM manages to get to FNAME(fetch) with a dummy root, e.g. because userspace created a memslot between installing the dummy root and handling the page fault, simply unload the MMU to allocate a new root and retry the instruction. Use KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to drop the root, as invoking kvm_mmu_free_roots() while holding mmu_lock would deadlock, and conceptually the dummy root has indeeed become obsolete. The only difference versus existing usage of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS is that the root has become obsolete due to memslot *creation*, not memslot deletion or movement. Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Link: https://lore.kernel.org/r/20230729005200.1057358-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Use BUILD_BUG_ON_INVALID() for KVM_MMU_WARN_ON() stubSean Christopherson1-1/+1
Use BUILD_BUG_ON_INVALID() instead of an empty do-while loop to stub out KVM_MMU_WARN_ON() when CONFIG_KVM_PROVE_MMU=n, that way _some_ build issues with the usage of KVM_MMU_WARN_ON() will be dected even if the kernel is using the stubs, e.g. basic syntax errors will be detected. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Replace MMU_DEBUG with proper KVM_PROVE_MMU KconfigSean Christopherson1-3/+1
Replace MMU_DEBUG, which requires manually modifying KVM to enable the macro, with a proper Kconfig, KVM_PROVE_MMU. Now that pgprintk() and rmap_printk() are gone, i.e. the macro guards only KVM_MMU_WARN_ON() and won't flood the kernel logs, enabling the option for debug kernels is both desirable and feasible. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()Sean Christopherson1-1/+1
Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Rename MMU_WARN_ON() to KVM_MMU_WARN_ON()Sean Christopherson1-2/+2
Rename MMU_WARN_ON() to make it super obvious that the assertions are all about KVM's MMU, not the primary MMU. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Delete the "dbg" module paramSean Christopherson1-2/+0
Delete KVM's "dbg" module param now that its usage in KVM is gone (it used to guard pgprintk() and rmap_printk()). Link: https://lore.kernel.org/r/20230729004722.1056172-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Delete rmap_printk() and all its usageSean Christopherson1-2/+0
Delete rmap_printk() so that MMU_WARN_ON() and MMU_DEBUG can be morphed into something that can be regularly enabled for debug kernels. The information provided by rmap_printk() isn't all that useful now that the rmap and unsync code is mature, as the prints are simultaneously too verbose (_lots_ of message) and yet not verbose enough to be helpful for debug (most instances print just the SPTE pointer/value, which is rarely sufficient to root cause anything but trivial bugs). Alternatively, rmap_printk() could be reworked to into tracepoints, but it's not clear there is a real need as rmap bugs rarely escape initial development, and when bugs do escape to production, they are often edge cases and/or reside in code that isn't directly related to the rmaps. In other words, the problems with rmap_printk() being unhelpful also apply to tracepoints. And deleting rmap_printk() doesn't preclude adding tracepoints in the future. Link: https://lore.kernel.org/r/20230729004722.1056172-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Delete pgprintk() and all its usageSean Christopherson1-2/+0
Delete KVM's pgprintk() and all its usage, as the code is very prone to bitrot due to being buried behind MMU_DEBUG, and the functionality has been rendered almost entirely obsolete by the tracepoints KVM has gained over the years. And for the situations where the information provided by KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps, and is almost always far too noisy, i.e. developers end up implementing custom prints anyways. Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-17KVM: Allow range-based TLB invalidation from common codeDavid Matlack1-3/+0
Make kvm_flush_remote_tlbs_range() visible in common code and create a default implementation that just invalidates the whole TLB. This paves the way for several future features/cleanups: - Introduction of range-based TLBI on ARM. - Eliminating kvm_arch_flush_remote_tlbs_memslot() - Moving the KVM/x86 TDP MMU to common code. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com
2023-03-18KVM: x86/mmu: Use gfn_t in kvm_flush_remote_tlbs_range()David Matlack1-1/+2
Use gfn_t instead of u64 for kvm_flush_remote_tlbs_range()'s parameters, since gfn_t is the standard type for GFNs throughout KVM. Opportunistically rename pages to nr_pages to make its role even more obvious. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230126184025.2294823-6-dmatlack@google.com [sean: convert pages to gfn_t too, and rename] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-18KVM: x86/mmu: Rename kvm_flush_remote_tlbs_with_address()David Matlack1-4/+3
Rename kvm_flush_remote_tlbs_with_address() to kvm_flush_remote_tlbs_range(). This name is shorter, which reduces the number of callsites that need to be broken up across multiple lines, and more readable since it conveys a range of memory is being flushed rather than a single address. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230126184025.2294823-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-14KVM: x86/mmu: Use EMULTYPE flag to track write #PFs to shadow pagesSean Christopherson1-1/+11
Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults on self-changing writes to shadowed page tables instead of propagating that information to the emulator via a semi-persistent vCPU flag. Using a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented, as it's not at all obvious that clearing the flag only when emulation actually occurs is correct. E.g. if KVM sets the flag and then retries the fault without ever getting to the emulator, the flag will be left set for future calls into the emulator. But because the flag is consumed if and only if both EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated MMIO, or while L2 is active, KVM avoids false positives on a stale flag since FNAME(page_fault) is guaranteed to be run and refresh the flag before it's ultimately consumed by the tail end of reexecute_instruction(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230202182817.407394-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-24KVM: x86/mmu: Fix wrong gfn range of tlb flushing in kvm_set_pte_rmapp()Hou Wenlong1-0/+9
When the spte of hupe page is dropped in kvm_set_pte_rmapp(), the whole gfn range covered by the spte should be flushed. However, rmap_walk_init_level() doesn't align down the gfn for new level like tdp iterator does, then the gfn used in kvm_set_pte_rmapp() is not the base gfn of huge page. And the size of gfn range is wrong too for huge page. Use the base gfn of huge page and the size of huge page for flushing tlbs for huge page. Also introduce a helper function to flush the given page (huge or not) of guest memory, which would help prevent future buggy use of kvm_flush_remote_tlbs_with_address() in such case. Fixes: c3134ce240eed ("KVM: Replace old tlb flush function with new one to flush a specified range.") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/0ce24d7078fa5f1f8d64b0c59826c50f32f8065e.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Move round_gfn_for_level() helper into mmu_internal.hHou Wenlong1-0/+5
Rounding down the GFN to a huge page size is a common pattern throughout KVM, so move round_gfn_for_level() helper in tdp_iter.c to mmu_internal.h for common usage. Also rename it as gfn_round_for_level() to use gfn_* prefix and clean up the other call sites. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/415c64782f27444898db650e21cf28eeb6441dfa.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-12-29KVM: x86/mmu: Rename __direct_map() to direct_map()David Matlack1-1/+1
Rename __direct_map() to direct_map() since the leading underscores are unnecessary. This also makes the page fault handler names more consistent: kvm_tdp_mmu_page_fault() calls kvm_tdp_mmu_map() and direct_page_fault() calls direct_map(). Opportunistically make some trivial cleanups to comments that had to be modified anyway since they mentioned __direct_map(). Specifically, use "()" when referring to functions, and include kvm_tdp_mmu_map() among the various callers of disallowed_hugepage_adjust(). No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86/mmu: Initialize fault.{gfn,slot} earlier for direct MMUsDavid Matlack1-0/+5
Move the initialization of fault.{gfn,slot} earlier in the page fault handling code for fully direct MMUs. This will enable a future commit to split out TDP MMU page fault handling without needing to duplicate the initialization of these 2 fields. Opportunistically take advantage of the fact that fault.gfn is initialized in kvm_tdp_page_fault() rather than recomputing it from fault->addr. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()David Matlack1-0/+1
Grab mmu_invalidate_seq in kvm_faultin_pfn() and stash it in struct kvm_page_fault. The eliminates duplicate code and reduces the amount of parameters needed for is_page_fault_stale(). Preemptively split out __kvm_faultin_pfn() to a separate function for use in subsequent commits. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220921173546.2674386-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Add helper to convert SPTE value to its shadow pageSean Christopherson1-12/+0
Add a helper to convert a SPTE to its shadow page to deduplicate a variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to get the shadow page for a SPTE without dropping high bits. Opportunistically add a comment in mmu_free_root_page() documenting why it treats the root HPA as a SPTE. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTESean Christopherson1-3/+2
Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP visible to other readers, i.e. before setting its SPTE. This will allow KVM to query the flag when determining if a shadow page can be replaced by a NX huge page without violating the rules of the mitigation. Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible for another CPU to see a shadow page without an up-to-date nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated dance. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-Id: <20221019165618.927057-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Rename NX huge pages fields/functions for consistencySean Christopherson1-8/+14
Rename most of the variables/functions involved in the NX huge page mitigation to provide consistency, e.g. lpage vs huge page, and NX huge vs huge NX, and also to provide clarity, e.g. to make it obvious the flag applies only to the NX huge page mitigation, not to any condition that prevents creating a huge page. Add a comment explaining what the newly named "possible_nx_huge_pages" tracks. Leave the nx_lpage_splits stat alone as the name is ABI and thus set in stone. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221019165618.927057-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Tag disallowed NX huge pages even if they're not trackedSean Christopherson1-1/+9
Tag shadow pages that cannot be replaced with an NX huge page regardless of whether or not zapping the page would allow KVM to immediately create a huge page, e.g. because something else prevents creating a huge page. I.e. track pages that are disallowed from being NX huge pages regardless of whether or not the page could have been huge at the time of fault. KVM currently tracks pages that were disallowed from being huge due to the NX workaround if and only if the page could otherwise be huge. But that fails to handled the scenario where whatever restriction prevented KVM from installing a huge page goes away, e.g. if dirty logging is disabled, the host mapping level changes, etc... Failure to tag shadow pages appropriately could theoretically lead to false negatives, e.g. if a fetch fault requests a small page and thus isn't tracked, and a read/write fault later requests a huge page, KVM will not reject the huge page as it should. To avoid yet another flag, initialize the list_head and use list_empty() to determine whether or not a page is on the list of NX huge pages that should be recovered. Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is more involved due to mmu_lock being held for read. This will be addressed in a future commit. Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEsSean Christopherson1-1/+1
Drop the requirement that a pfn be backed by a refcounted, compound or or ZONE_DEVICE, struct page, and instead rely solely on the host page tables to identify huge pages. The PageCompound() check is a remnant of an old implementation that identified (well, attempt to identify) huge pages without walking the host page tables. The ZONE_DEVICE check was added as an exception to the PageCompound() requirement. In other words, neither check is actually a hard requirement, if the primary has a pfn backed with a huge page, then KVM can back the pfn with a huge page regardless of the backing store. Dropping the @pfn parameter will also allow KVM to query the max host mapping level without having to first get the pfn, which is advantageous for use outside of the page fault path where KVM wants to take action if and only if a page can be mapped huge, i.e. avoids the pfn lookup for gfns that can't be backed with a huge page. Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220715232107.3775620-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Cache the access bits of shadowed translationsDavid Matlack1-2/+15
Splitting huge pages requires allocating/finding shadow pages to replace the huge page. Shadow pages are keyed, in part, off the guest access permissions they are shadowing. For fully direct MMUs, there is no shadowing so the access bits in the shadow page role are always ACC_ALL. But during shadow paging, the guest can enforce whatever access permissions it wants. In particular, eager page splitting needs to know the permissions to use for the subpages, but KVM cannot retrieve them from the guest page tables because eager page splitting does not have a vCPU. Fortunately, the guest access permissions are easy to cache whenever page faults or FNAME(sync_page) update the shadow page tables; this is an extension of the existing cache of the shadowed GFNs in the gfns array of the shadow page. The access bits only take up 3 bits, which leaves 61 bits left over for gfns, which is more than enough. Now that the gfns array caches more information than just GFNs, rename it to shadowed_translation. While here, preemptively fix up the WARN_ON() that detects gfn mismatches in direct SPs. The WARN_ON() was paired with a pr_err_ratelimited(), which means that users could sometimes see the WARN without the accompanying error message. Fix this by outputting the error message as part of the WARN splat, and opportunistically make them WARN_ONCE() because if these ever fire, they are all but guaranteed to fire a lot and will bring down the kernel. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-18-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basisBen Gardon1-3/+4
In some cases, the NX hugepage mitigation for iTLB multihit is not needed for all guests on a host. Allow disabling the mitigation on a per-VM basis to avoid the performance hit of NX hugepages on trusted workloads. In order to disable NX hugepages on a VM, ensure that the userspace actor has permission to reboot the system. Since disabling NX hugepages would allow a guest to crash the system, it is similar to reboot permissions. Ideally, KVM would require userspace to prove it has access to KVM's nx_huge_pages module param, e.g. so that userspace can opt out without needing full reboot permissions. But getting access to the module param file info is difficult because it is buried in layers of sysfs and module glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases. Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20220613212523.3436117-9-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Dedup macros for computing various page table masksSean Christopherson1-0/+14
Provide common helper macros to generate various masks, shifts, etc... for 32-bit vs. 64-bit page tables. Only the inputs differ, the actual calculations are identical. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Expand and clean up page fault statsSean Christopherson1-2/+26
Expand and clean up the page fault stats. The current stats are at best incomplete, and at worst misleading. Differentiate between faults that are actually fixed vs those that result in an MMIO SPTE being created, track faults that are spurious, faults that trigger emulation, faults that that are fixed in the fast path, and last but not least, track the number of faults that are taken. Note, the number of faults that require emulation for write-protected shadow pages can roughly be calculated by subtracting the number of MMIO SPTEs created from the overall number of faults that trigger emulation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faultsSean Christopherson1-3/+3
Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast path for TDP page faults. The generated code is identical, and the #ifdef makes it dangerously difficult to extend the logic (guess who forgot to add an "else" inside the #ifdef and ran through the page fault handler twice). No functional or binary change intented. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Make all page fault handlers internal to the MMUSean Christopherson1-1/+89
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all of the page fault handling collateral that was in mmu.h purely for the async #PF handler into mmu_internal.h, where it belongs. This will allow kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to expose those enums outside of the MMU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"Sean Christopherson1-1/+8
Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and kvm_faultin_pfn() to signal that the page fault handler should continue doing its thing. Aside from being gross and inefficient, using a boolean return to signal continue vs. stop makes it extremely difficult to add more helpers and/or move existing code to a helper. E.g. hypothetically, if nested MMUs were to gain a separate page fault handler in the future, everything up to the "is self-modifying PTE" check can be shared by all shadow MMUs, but communicating up the stack whether to continue on or stop becomes a nightmare. More concretely, proposed support for private guest memory ran into a similar issue, where it'll be forced to forego a helper in order to yield sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com. No functional change intended. Cc: David Matlack <dmatlack@google.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-08KVM: x86/mmu: Zap invalidated roots via asynchronous workerPaolo Bonzini1-1/+7
Use the system worker threads to zap the roots invalidated by the TDP MMU's "fast zap" mechanism, implemented by kvm_tdp_mmu_invalidate_all_roots(). At this point, apart from allowing some parallelism in the zapping of roots, the workqueue is a glorified linked list: work items are added and flushed entirely within a single kvm->slots_lock critical section. However, the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots() assumes that it owns a reference to all invalid roots; therefore, no one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the invalidated roots on a linked list... erm, on a workqueue ensures that tdp_mmu_zap_root_work() only puts back those extra references that kvm_mmu_zap_all_invalidated_roots() had gifted to it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-08KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recoverySean Christopherson1-1/+6
When recovering a potential hugepage that was shattered for the iTLB multihit workaround, precisely zap only the target page instead of iterating over the TDP MMU to find the SP that was passed in. This will allow future simplification of zap_gfn_range() by having it zap only leaf SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: x86/mmu: Propagate memslot const qualifierBen Gardon1-1/+1
In preparation for implementing in-place hugepage promotion, various functions will need to be called from zap_collapsible_spte_range, which has the const qualifier on its memslot argument. Propagate the const qualifier to the various functions which will be needed. This just serves to simplify the following patch. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-11-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pagesBen Gardon1-1/+1
The vCPU argument to mmu_try_to_unsync_pages is now only used to get a pointer to the associated struct kvm, so pass in the kvm pointer from the beginning to remove the need for a vCPU when calling the function. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: x86/mmu: Use shadow page role to detect PML-unfriendly pages for L2Sean Christopherson1-4/+3
Rework make_spte() to query the shadow page's role, specifically whether or not it's a guest_mode page, a.k.a. a page for L2, when determining if the SPTE is compatible with PML. This eliminates a dependency on @vcpu, with a future goal of being able to create SPTEs without a specific vCPU. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86/mmu: clean up prefetch/prefault/speculative namingPaolo Bonzini1-1/+1
"prefetch", "prefault" and "speculative" are used throughout KVM to mean the same thing. Use a single name, standardizing on "prefetch" which is already used by various functions such as direct_pte_prefetch, FNAME(prefetch_gpte), FNAME(pte_prefetch), etc. Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01KVM: x86/mmu: Avoid memslot lookup in make_spte and mmu_try_to_unsync_pagesDavid Matlack1-2/+2
mmu_try_to_unsync_pages checks if page tracking is active for the given gfn, which requires knowing the memslot. We can pass down the memslot via make_spte to avoid this lookup. The memslot is also handy for make_spte's marking of the gfn as dirty: we can test whether dirty page tracking is enabled, and if so ensure that pages are mapped as writable with 4K granularity. Apart from the warning, no functional change is intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01KVM: MMU: clean up make_spte return valuePaolo Bonzini1-5/+0
Now that make_spte is called directly by the shadow MMU (rather than wrapped by set_spte), it only has to return one boolean value. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>