summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2023-06-03KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASKMaciej S. Szmigiero1-1/+1
While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware I noticed that with vCPU count large enough (> 16) they sometimes froze at boot. With vCPU count of 64 they never booted successfully - suggesting some kind of a race condition. Since adding "vnmi=0" module parameter made these guests boot successfully it was clear that the problem is most likely (v)NMI-related. Running kvm-unit-tests quickly showed failing NMI-related tests cases, like "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests and the NMI parts of eventinj test. The issue was that once one NMI was being serviced no other NMI was allowed to be set pending (NMI limit = 0), which was traced to svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather than for the "NMI pending" flag. Fix this by testing for the right flag in svm_is_vnmi_pending(). Once this is done, the NMI-related kvm-unit-tests pass successfully and the Windows guest no longer freezes at boot. Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-03KVM: x86/mmu: Grab memslot for correct address space in NX recovery workerSean Christopherson1-1/+4
Factor in the address space (non-SMM vs. SMM) of the target shadow page when recovering potential NX huge pages, otherwise KVM will retrieve the wrong memslot when zapping shadow pages that were created for SMM. The bug most visibly manifests as a WARN on the memslot being non-NULL, but the worst case scenario is that KVM could unaccount the shadow page without ensuring KVM won't install a huge page, i.e. if the non-SMM slot is being dirty logged, but the SMM slot is not. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015 kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450 R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98 R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0 Call Trace: <TASK> __pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm] kvm_vm_worker_thread+0x106/0x1c0 [kvm] kthread+0xd9/0x100 ret_from_fork+0x2c/0x50 </TASK> ---[ end trace 0000000000000000 ]--- This bug was exposed by commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated"), which allowed KVM to retain SMM TDP MMU roots effectively indefinitely. Before commit edbdb43fc96b, KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages once all vCPUs exited SMM, which made the window where this bug (recovering an SMM NX huge page) could be encountered quite tiny. To hit the bug, the NX recovery thread would have to run while at least one vCPU was in SMM. Most VMs typically only use SMM during boot, and so the problematic shadow pages were gone by the time the NX recovery thread ran. Now that KVM preserves TDP MMU roots until they are explicitly invalidated (e.g. by a memslot deletion), the window to trigger the bug is effectively never closed because most VMMs don't delete memslots after boot (except for a handful of special scenarios). Fixes: eb298605705a ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages") Reported-by: Fabio Coatti <fabio.coatti@gmail.com> Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02KVM: VMX: Inject #GP, not #UD, if SGX2 ENCLS leafs are unsupportedSean Christopherson1-6/+9
Per Intel's SDM, unsupported ENCLS leafs result in a #GP, not a #UD. SGX1 is a special snowflake as the SGX1 flag is used by the CPU as a "soft" disable, e.g. if software disables machine check reporting, i.e. having SGX but not SGX1 is effectively "SGX completely unsupported" and and thus #UDs. Fixes: 9798adbc04cf ("KVM: VMX: Frame in ENCLS handler for SGX virtualization") Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230405234556.696927-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02KVM: VMX: Inject #GP on ENCLS if vCPU has paging disabled (CR0.PG==0)Sean Christopherson1-1/+1
Inject a #GP when emulating/forwarding a valid ENCLS leaf if the vCPU has paging disabled, e.g. if KVM is intercepting ECREATE to enforce additional restrictions. The pseudocode in the SDM lists all #GP triggers, including CR0.PG=0, as being checked after the ENLCS-exiting checks, i.e. the VM-Exit will occur before the CPU performs the CR0.PG check. Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230405234556.696927-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02KVM: VMX: restore vmx_vmexit alignmentJon Kohler1-1/+1
Commit 8bd200d23ec4 ("KVM: VMX: Flatten __vmx_vcpu_run()") changed vmx_vmexit from SYM_FUNC_START to SYM_INNER_LABEL, accidentally removing 16 byte alignment as SYM_FUNC_START uses SYM_A_ALIGN and SYM_INNER_LABEL does not. Josh mentioned [1] this was unintentional. Fix by changing to SYM_INNER_LABEL_ALIGN instead. [1] https://lore.kernel.org/lkml/Y3adkSe%2FJ70PqUyt@p183 Fixes: 8bd200d23ec4 ("KVM: VMX: Flatten __vmx_vcpu_run()") Signed-off-by: Jon Kohler <jon@nutanix.com> Suggested-by: Alexey Dobriyan <adobriyan@gmail.com> CC: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20230531155821.80590-1-jon@nutanix.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02KVM: x86: Update number of entries for KVM_GET_CPUID2 on success, not failureSean Christopherson1-9/+4
Update cpuid->nent if and only if kvm_vcpu_ioctl_get_cpuid2() succeeds. The sole caller copies @cpuid to userspace only on success, i.e. the existing code effectively does nothing. Arguably, KVM should report the number of entries when returning -E2BIG so that userspace doesn't have to guess the size, but all other similar KVM ioctls() don't report the size either, i.e. userspace is conditioned to guess. Suggested-by: Takahiro Itazuri <itazur@amazon.com> Link: https://lore.kernel.org/all/20230410141820.57328-1-itazur@amazon.com Link: https://lore.kernel.org/r/20230526210340.2799158-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02KVM: x86/pmu: Remove redundant check for MSR_IA32_DS_AREA set handlerJinrong Liang1-2/+0
After commit 2de154f541fc ("KVM: x86/pmu: Provide "error" semantics for unsupported-but-known PMU MSRs"), the guest_cpuid_has(DS) check is not necessary any more since if the guest supports X86_FEATURE_DS, it never returns 1. And if the guest does not support this feature, the set_msr handler will get false from kvm_pmu_is_valid_msr() before reaching this point. Therefore, the check will not be true in all cases and can be safely removed, which also simplifies the code and improves its readability. Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Link: https://lore.kernel.org/r/20230411130338.8592-1-cloudliang@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: x86: Fix poll commandJinliang Zheng1-0/+3
According to the hardware manual, when the Poll command is issued, the byte returned by the I/O read is 1 in Bit 7 when there is an interrupt, and the highest priority binary code in Bits 2:0. The current pic simulation code is not implemented strictly according to the above expression. Fix the implementation of pic_poll_read(), set Bit 7 when there is an interrupt. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Link: https://lore.kernel.org/r/20230419021924.1342184-1-alexjlzheng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: x86: Move common handling of PAT MSR writes to kvm_set_msr_common()Sean Christopherson3-13/+7
Move the common check-and-set handling of PAT MSR writes out of vendor code and into kvm_set_msr_common(). This aligns writes with reads, which are already handled in common code, i.e. makes the handling of reads and writes symmetrical in common code. Alternatively, the common handling in kvm_get_msr_common() could be moved to vendor code, but duplicating code is generally undesirable (even though the duplicatated code is trivial in this case), and guest writes to PAT should be rare, i.e. the overhead of the extra function call is a non-issue in practice. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: x86: Make kvm_mtrr_valid() static now that there are no external usersSean Christopherson2-3/+1
Make kvm_mtrr_valid() local to mtrr.c now that it's not used to check the validity of a PAT MSR value. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: x86: Move PAT MSR handling out of mtrr.cSean Christopherson2-13/+19
Drop handling of MSR_IA32_CR_PAT from mtrr.c now that SVM and VMX handle writes without bouncing through kvm_set_msr_common(). PAT isn't truly an MTRR even though it affects memory types, and more importantly KVM enables hardware virtualization of guest PAT (by NOT setting "ignore guest PAT") when a guest has non-coherent DMA, i.e. KVM doesn't need to zap SPTEs when the guest PAT changes. The read path is and always has been trivial, i.e. burying it in the MTRR code does more harm than good. WARN and continue for the PAT case in kvm_set_msr_common(), as that code is _currently_ reached if and only if KVM is buggy. Defer cleaning up the lack of symmetry between the read and write paths to a future patch. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: x86: Use MTRR macros to define possible MTRR MSR rangesSean Christopherson2-7/+10
Use the MTRR macros to identify the ranges of possible MTRR MSRs instead of bounding the ranges with a mismash of open coded values and unrelated MSR indices. Carving out the gap for the machine check MSRs in particular is confusing, as it's easy to incorrectly think the case statement handles MCE MSRs instead of skipping them. Drop the range-based funneling of MSRs between the end of the MCE MSRs and MTRR_DEF_TYPE, i.e. 0x2A0-0x2FF, and instead handle MTTR_DEF_TYPE as the one-off case that it is. Extract PAT (0x277) as well in anticipation of dropping PAT "handling" from the MTRR code. Keep the range-based handling for the variable+fixed MTRRs even though capturing unknown MSRs 0x214-0x24F is arguably "wrong". There is a gap in the fixed MTRRs, 0x260-0x267, i.e. the MTRR code needs to filter out unknown MSRs anyways, and using a single range generates marginally better code for the big switch statement. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: x86: Add helper to get variable MTRR range from MSR indexSean Christopherson1-10/+13
Add a helper to dedup the logic for retrieving a variable MTRR range structure given a variable MTRR MSR index. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: x86: Add helper to query if variable MTRR MSR is base (versus mask)Sean Christopherson1-7/+9
Add a helper to query whether a variable MTRR MSR is a base versus as mask MSR. Replace the unnecessarily complex math with a simple check on bit 0; base MSRs are even, mask MSRs are odd. Link: https://lore.kernel.org/r/20230511233351.635053-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: SVM: Use kvm_pat_valid() directly instead of kvm_mtrr_valid()Ke Guo1-1/+1
Use kvm_pat_valid() directly instead of bouncing through kvm_mtrr_valid(). The PAT is not an MTRR, and kvm_mtrr_valid() just redirects to kvm_pat_valid(), i.e. is exempt from KVM's "zap SPTEs" logic that's needed to honor guest MTRRs when the VM has a passthrough device with non-coherent DMA (KVM does NOT set "ignore guest PAT" in this case, and so enables hardware virtualization of the guest's PAT, i.e. doesn't need to manually emulate the PAT memtype). Signed-off-by: Ke Guo <guoke@uniontech.com> [sean: massage changelog] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: VMX: Open code writing vCPU's PAT in VMX's MSR handlerWenyao Hai1-5/+3
Open code setting "vcpu->arch.pat" in vmx_set_msr() instead of bouncing through kvm_set_msr_common() to get to the same code in kvm_mtrr_set_msr(). This aligns VMX with SVM, avoids hiding a very simple operation behind a relatively complicated function call (finding the PAT MSR case in kvm_set_msr_common() is non-trivial), and most importantly, makes it clear that not unwinding the VMCS updates if kvm_set_msr_common() isn't a bug (because kvm_set_msr_common() can never fail for PAT). Opportunistically set vcpu->arch.pat before updating the VMCS info so that a future patch can move the common bits (back) into kvm_set_msr_common() without a functional change. Note, MSR_IA32_CR_PAT is 0x277, and is very subtly handled by case 0x200 ... MSR_IA32_MC0_CTL2 - 1: in kvm_set_msr_common(). Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Wenyao Hai <haiwenyao@uniontech.com> [sean: massage changelog, hoist setting vcpu->arch.pat up] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: SVM: Remove TSS reloading code after VMEXITMingwei Zhang2-25/+0
Remove the dedicated post-VMEXIT TSS reloading code now that KVM uses VMLOAD to load host segment state, which includes TSS state. Fixes: e79b91bb3c91 ("KVM: SVM: use vmsave/vmload for saving/restoring additional host state") Reported-by: Venkatesh Srinivas <venkateshs@google.com> Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230523165635.4002711-1-mizhang@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: VMX: Use proper accessor to read guest CR4 in handle_desc()Sean Christopherson1-1/+7
Use kvm_is_cr4_bit_set() to read guest CR4.UMIP when sanity checking that a descriptor table VM-Exit occurs if and only if guest.CR4.UMIP=1. UMIP can't be guest-owned, i.e. using kvm_read_cr4_bits() to decache guest- owned bits isn't strictly necessary, but eliminating raw reads of vcpu->arch.cr4 is desirable as it makes it easy to visually audit KVM for correctness. Opportunistically add a compile-time assertion that UMIP isn't guest-owned as letting the guest own UMIP isn't compatible with emulation (or any CR4 bit that is emulated by KVM). Opportunistically change the WARN_ON() to a ONCE variant. When the WARN fires, it fires _a lot_, and spamming the kernel logs ends up doing more harm than whatever led to KVM's unnecessary emulation. Reported-by: Robert Hoo <robert.hu@intel.com> Link: https://lore.kernel.org/all/20230310125718.1442088-4-robert.hu@intel.com Link: https://lore.kernel.org/r/20230413231914.1482782-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: VMX: Treat UMIP as emulated if and only if the host doesn't have UMIPSean Christopherson3-5/+4
Advertise UMIP as emulated if and only if the host doesn't natively support UMIP, otherwise vmx_umip_emulated() is misleading when the host _does_ support UMIP. Of the four users of vmx_umip_emulated(), two already check for native support, and the logic in vmx_set_cpu_caps() is relevant if and only if UMIP isn't natively supported as UMIP is set in KVM's caps by kvm_set_cpu_caps() when UMIP is present in hardware. That leaves KVM's stuffing of X86_CR4_UMIP into the default cr4_fixed1 value enumerated for nested VMX. In that case, checking for (lack of) host support is actually a bug fix of sorts, as enumerating UMIP support based solely on descriptor table exiting works only because KVM doesn't sanity check MSR_IA32_VMX_CR4_FIXED1. E.g. if a (very theoretical) host supported UMIP in hardware but didn't allow UMIP+VMX, KVM would advertise UMIP but not actually emulate UMIP. Of course, KVM would explode long before it could run a nested VM on said theoretical CPU, as KVM doesn't modify host CR4 when enabling VMX, i.e. would load an "illegal" value into vmcs.HOST_CR4. Reported-by: Robert Hoo <robert.hu@intel.com> Link: https://lore.kernel.org/all/20230310125718.1442088-2-robert.hu@intel.com Link: https://lore.kernel.org/r/20230413231914.1482782-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: VMX: Move the comment of CR4.MCE handling right above the codeXiaoyao Li1-2/+2
Move the comment about keeping the hosts CR4.MCE loaded in hardware above the code that actually modifies the hardware CR4 value. No functional change indented. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230410125017.1305238-3-xiaoyao.li@intel.com [sean: elaborate in changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: VMX: Use kvm_read_cr4() to get cr4 valueXiaoyao Li1-1/+1
Directly use vcpu->arch.cr4 is not recommended since it gets stale value if the cr4 is not available. Use kvm_read_cr4() instead to ensure correct value. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230410125017.1305238-2-xiaoyao.li@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: x86/mmu: Assert on @mmu in the __kvm_mmu_invalidate_addr()Like Xu1-0/+8
Add assertion to track that "mmu == vcpu->arch.mmu" is always true in the context of __kvm_mmu_invalidate_addr(). for_each_shadow_entry_using_root() and kvm_sync_spte() operate on vcpu->arch.mmu, but the only reason that doesn't cause explosions is because handle_invept() frees roots instead of doing a manual invalidation. As of now, there are no major roadblocks to switching INVEPT emulation over to use kvm_mmu_invalidate_addr(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230523032947.60041-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: x86/mmu: Add comment on try_cmpxchg64 usage in tdp_mmu_set_spte_atomicUros Bizjak1-1/+4
Commit aee98a6838d5 ("KVM: x86/mmu: Use try_cmpxchg64 in tdp_mmu_set_spte_atomic") removed the comment that iter->old_spte is updated when different logical CPU modifies the page table entry. Although this is what try_cmpxchg does implicitly, it won't hurt if this fact is explicitly mentioned in a restored comment. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20230425113932.3148-1-ubizjak@gmail.com [sean: extend comment above try_cmpxchg64()] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-21KVM: VMX: add MSR_IA32_TSX_CTRL into msrs_to_saveMingwei Zhang1-1/+5
Add MSR_IA32_TSX_CTRL into msrs_to_save[] to explicitly tell userspace to save/restore the register value during migration. Missing this may cause userspace that relies on KVM ioctl(KVM_GET_MSR_INDEX_LIST) fail to port the value to the target VM. In addition, there is no need to add MSR_IA32_TSX_CTRL when ARCH_CAP_TSX_CTRL_MSR is not supported in kvm_get_arch_capabilities(). So add the checking in kvm_probe_msr_to_save(). Fixes: c11f83e0626b ("KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20230509032348.1153070-1-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-21KVM: x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)Sean Christopherson1-16/+0
Drop KVM's manipulation of guest's CPUID.0x12.1 ECX and EDX, i.e. the allowed XFRM of SGX enclaves, now that KVM explicitly checks the guest's allowed XCR0 when emulating ECREATE. Note, this could theoretically break a setup where userspace advertises a "bad" XFRM and relies on KVM to provide a sane CPUID model, but QEMU is the only known user of KVM SGX, and QEMU explicitly sets the SGX CPUID XFRM subleaf based on the guest's XCR0. Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230503160838.3412617-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-21KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for ECREATESean Christopherson1-2/+9
Explicitly check the vCPU's supported XCR0 when determining whether or not the XFRM for ECREATE is valid. Checking CPUID works because KVM updates guest CPUID.0x12.1 to restrict the leaf to a subset of the guest's allowed XCR0, but that is rather subtle and KVM should not modify guest CPUID except for modeling true runtime behavior (allowed XFRM is most definitely not "runtime" behavior). Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230503160838.3412617-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-05Merge tag 'kvm-x86-mmu-6.4-2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-65/+56
Fix a long-standing flaw in x86's TDP MMU where unloading roots on a vCPU can result in the root being freed even though the root is completely valid and can be reused as-is (with a TLB flush).
2023-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds29-1066/+1430
Pull kvm updates from Paolo Bonzini: "s390: - More phys_to_virt conversions - Improvement of AP management for VSIE (nested virtualization) ARM64: - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. x86: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features - Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - AMD SVM: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts - Intel AMX: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - AMX selftests improvements - Misc cleanups MIPS: - Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: - Drop unnecessary casts from "void *" throughout kvm_main.c - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: - Fix goof introduced by the conversion to rST" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits) KVM: s390: pci: fix virtual-physical confusion on module unload/load KVM: s390: vsie: clarifications on setting the APCB KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init() KVM: selftests: Test the PMU event "Instructions retired" KVM: selftests: Copy full counter values from guest in PMU event filter test KVM: selftests: Use error codes to signal errors in PMU event filter test KVM: selftests: Print detailed info in PMU event filter asserts KVM: selftests: Add helpers for PMC asserts in PMU event filter test KVM: selftests: Add a common helper for the PMU event filter guest code KVM: selftests: Fix spelling mistake "perrmited" -> "permitted" KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: Handle 32bit CNTPCTSS traps KVM: arm64: nvhe: Synchronise with page table walker on vcpu run KVM: arm64: vgic: Don't acquire its_lock before config_lock KVM: selftests: Add test to verify KVM's supported XCR0 ...
2023-04-29Merge tag 'smp-core-2023-04-27' of ↵Linus Torvalds2-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP cross-CPU function-call updates from Ingo Molnar: - Remove diagnostics and adjust config for CSD lock diagnostics - Add a generic IPI-sending tracepoint, as currently there's no easy way to instrument IPI origins: it's arch dependent and for some major architectures it's not even consistently available. * tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: trace,smp: Trace all smp_function_call*() invocations trace: Add trace_ipi_send_cpu() sched, smp: Trace smp callback causing an IPI smp: reword smp call IPI comment treewide: Trace IPIs sent via smp_send_reschedule() irq_work: Trace self-IPIs sent via arch_irq_work_raise() smp: Trace IPIs sent via arch_send_call_function_ipi_mask() sched, smp: Trace IPIs sent via send_call_function_single_ipi() trace: Add trace_ipi_send_cpumask() kernel/smp: Make csdlock_debug= resettable locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging locking/csd_lock: Remove added data from CSD lock debugging locking/csd_lock: Add Kconfig option for csd_debug default
2023-04-27KVM: x86: Preserve TDP MMU roots until they are explicitly invalidatedSean Christopherson1-65/+56
Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26Merge tag 'kvm-x86-vmx-6.4' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-40/+91
KVM VMX changes for 6.4: - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - Misc cleanups
2023-04-26Merge tag 'kvm-x86-svm-6.4' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini4-52/+267
KVM SVM changes for 6.4: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts
2023-04-26Merge tag 'kvm-x86-selftests-6.4' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-4/+31
KVM selftests, and an AMX/XCR0 bugfix, for 6.4: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Overhaul the AMX selftests to improve coverage and cleanup the test - Misc cleanups
2023-04-26Merge tag 'kvm-x86-pmu-6.4' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini10-113/+184
KVM x86 PMU changes for 6.4: - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - Misc cleanups and fixes
2023-04-26Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini12-546/+500
KVM x86 MMU changes for 6.4: - Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to share the .sync_page() implementation, i.e. utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Misc cleanups
2023-04-26Merge tag 'kvm-x86-misc-6.4' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini12-52/+138
KVM x86 changes for 6.4: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Misc cleanups
2023-04-26Merge tag 'kvmarm-6.4' of ↵Paolo Bonzini1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.4 - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements.
2023-04-26Merge tag 'v6.4-p1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Total usage stats now include all that returned errors (instead of just some) - Remove maximum hash statesize limit - Add cloning support for hmac and unkeyed hashes - Demote BUG_ON in crypto_unregister_alg to a WARN_ON Algorithms: - Use RIP-relative addressing on x86 to prepare for PIE build - Add accelerated AES/GCM stitched implementation on powerpc P10 - Add some test vectors for cmac(camellia) - Remove failure case where jent is unavailable outside of FIPS mode in drbg - Add permanent and intermittent health error checks in jitter RNG Drivers: - Add support for 402xx devices in qat - Add support for HiSTB TRNG - Fix hash concurrency issues in stm32 - Add OP-TEE firmware support in caam" * tag 'v6.4-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (139 commits) i2c: designware: Add doorbell support for Mendocino i2c: designware: Use PCI PSP driver for communication powerpc: Move Power10 feature PPC_MODULE_FEATURE_P10 crypto: p10-aes-gcm - Remove POWER10_CPU dependency crypto: testmgr - Add some test vectors for cmac(camellia) crypto: cryptd - Add support for cloning hashes crypto: cryptd - Convert hash to use modern init_tfm/exit_tfm crypto: hmac - Add support for cloning crypto: hash - Add crypto_clone_ahash/shash crypto: api - Add crypto_clone_tfm crypto: api - Add crypto_tfm_get crypto: x86/sha - Use local .L symbols for code crypto: x86/crc32 - Use local .L symbols for code crypto: x86/aesni - Use local .L symbols for code crypto: x86/sha256 - Use RIP-relative addressing crypto: x86/ghash - Use RIP-relative addressing crypto: x86/des3 - Use RIP-relative addressing crypto: x86/crc32c - Use RIP-relative addressing crypto: x86/cast6 - Use RIP-relative addressing crypto: x86/cast5 - Use RIP-relative addressing ...
2023-04-25Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-12/+14
Pull vfs fget updates from Al Viro: "fget() to fdget() conversions" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse_dev_ioctl(): switch to fdget() cgroup_get_from_fd(): switch to fdget_raw() bpf: switch to fdget_raw() build_mount_idmapped(): switch to fdget() kill the last remaining user of proc_ns_fget() SVM-SEV: convert the rest of fget() uses to fdget() in there convert sgx_set_attribute() to fdget()/fdput() convert setns(2) to fdget()/fdput()
2023-04-24Merge tag 'rcu.6.4.april5.2023.3' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ...
2023-04-21SVM-SEV: convert the rest of fget() uses to fdget() in thereAl Viro1-12/+14
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-11KVM: x86: Filter out XTILE_CFG if XTILE_DATA isn't permittedSean Christopherson1-1/+17
Filter out XTILE_CFG from the supported XCR0 reported to userspace if the current process doesn't have access to XTILE_DATA. Attempting to set XTILE_CFG in XCR0 will #GP if XTILE_DATA is also not set, and so keeping XTILE_CFG as supported results in explosions if userspace feeds KVM_GET_SUPPORTED_CPUID back into KVM and the guest doesn't sanity check CPUID. Fixes: 445ecdf79be0 ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID") Reported-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Aaron Lewis <aaronlewis@google.com> Tested-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20230405004520.421768-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-11KVM: x86: Add a helper to handle filtering of unpermitted XCR0 featuresAaron Lewis3-4/+15
Add a helper, kvm_get_filtered_xcr0(), to dedup code that needs to account for XCR0 features that require explicit opt-in on a per-process basis. In addition to documenting when KVM should/shouldn't consult xstate_get_guest_group_perm(), the helper will also allow sanitizing the filtered XCR0 to avoid enumerating architecturally illegal XCR0 values, e.g. XTILE_CFG without XTILE_DATA. No functional changes intended. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> [sean: rename helper, move to x86.h, massage changelog] Reviewed-by: Aaron Lewis <aaronlewis@google.com> Tested-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20230405004520.421768-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-11KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not interceptedSean Christopherson1-0/+15
Extend VMX's nested intercept logic for emulated instructions to handle "pause" interception, in quotes because KVM's emulator doesn't filter out NOPs when checking for nested intercepts. Failure to allow emulation of NOPs results in KVM injecting a #UD into L2 on any NOP that collides with the emulator's definition of PAUSE, i.e. on all single-byte NOPs. For PAUSE itself, honor L1's PAUSE-exiting control, but ignore PLE to avoid unnecessarily injecting a #UD into L2. Per the SDM, the first execution of PAUSE after VM-Entry is treated as the beginning of a new loop, i.e. will never trigger a PLE VM-Exit, and so L1 can't expect any given execution of PAUSE to deterministically exit. ... the processor considers this execution to be the first execution of PAUSE in a loop. (It also does so for the first execution of PAUSE at CPL 0 after VM entry.) All that said, the PLE side of things is currently a moot point, as KVM doesn't expose PLE to L1. Note, vmx_check_intercept() is still wildly broken when L1 wants to intercept an instruction, as KVM injects a #UD instead of synthesizing a nested VM-Exit. That issue extends far beyond NOP/PAUSE and needs far more effort to fix, i.e. is a problem for the future. Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode") Cc: Mathias Krause <minipli@grsecurity.net> Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230405002359.418138-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-11KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faultsSean Christopherson2-1/+40
Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for permission faults when emulating a guest memory access and CR0.WP may be guest owned. If the guest toggles only CR0.WP and triggers emulation of a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a stale CR0.WP, i.e. use stale protection bits metadata. Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't support per-bit interception controls for CR0. Don't bother checking for EPT vs. NPT as the "old == new" check will always be true under NPT, i.e. the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0 from the VMCB on VM-Exit). Reported-by: Mathias Krause <minipli@grsecurity.net> Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net Fixes: fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit") Tested-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-11KVM: x86/mmu: Move filling of Hyper-V's TLB range struct into Hyper-V codeSean Christopherson3-11/+23
Refactor Hyper-V's range-based TLB flushing API to take a gfn+nr_pages pair instead of a struct, and bury said struct in Hyper-V specific code. Passing along two params generates much better code for the common case where KVM is _not_ running on Hyper-V, as forwarding the flush on to Hyper-V's hv_flush_remote_tlbs_range() from kvm_flush_remote_tlbs_range() becomes a tail call. Cc: David Matlack <dmatlack@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230405003133.419177-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-11KVM: x86: Rename Hyper-V remote TLB hooks to match established schemeSean Christopherson5-22/+18
Rename the Hyper-V hooks for TLB flushing to match the naming scheme used by all the other TLB flushing hooks, e.g. in kvm_x86_ops, vendor code, arch hooks from common code, etc. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230405003133.419177-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-07KVM: x86/pmu: Prevent the PMU from counting disallowed eventsAaron Lewis1-5/+8
When counting "Instructions Retired" (0xc0) in a guest, KVM will occasionally increment the PMU counter regardless of if that event is being filtered. This is because some PMU events are incremented via kvm_pmu_trigger_event(), which doesn't know about the event filter. Add the event filter to kvm_pmu_trigger_event(), so events that are disallowed do not increment their counters. Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230307141400.1486314-2-aaronlewis@google.com [sean: prepend "pmc" to the new function] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-07KVM: x86/pmu: Fix a typo in kvm_pmu_request_counter_reprogam()Like Xu4-5/+5
Fix a "reprogam" => "reprogram" typo in kvm_pmu_request_counter_reprogam(). Fixes: 68fb4757e867 ("KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()") Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230310113349.31799-1-likexu@tencent.com [sean: trim the changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-07KVM: x86/pmu: Rewrite reprogram_counters() to improve performanceLike Xu1-6/+6
A valid pmc is always tested before using pmu->reprogram_pmi. Eliminate this part of the redundancy by setting the counter's bitmask directly, and in addition, trigger KVM_REQ_PMU only once to save more cpu cycles. Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230214050757.9623-4-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>