summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2019-11-14KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast()Sean Christopherson1-3/+2
Acquire the per-VM slots_lock when zapping all shadow pages as part of toggling nx_huge_pages. The fast zap algorithm relies on exclusivity (via slots_lock) to identify obsolete vs. valid shadow pages, because it uses a single bit for its generation number. Holding slots_lock also obviates the need to acquire a read lock on the VM's srcu. Failing to take slots_lock when toggling nx_huge_pages allows multiple instances of kvm_mmu_zap_all_fast() to run concurrently, as the other user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock. (kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough to enforce exclusivity). Concurrent fast zap instances causes obsolete shadow pages to be incorrectly identified as valid due to the single bit generation number wrapping, which results in stale shadow pages being left in KVM's MMU and leads to all sorts of undesirable behavior. The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and toggling nx_huge_pages via its module param. Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used an ulong-sized generation instead of relying on exclusivity for correctness, but all callers except the recently added set_nx_huge_pages() needed to hold slots_lock anyways. Therefore, this patch does not have to be backported to stable kernels. Given that toggling nx_huge_pages is by no means a fast path, force it to conform to the current approach instead of reintroducing the previous generation count. Fixes: b8e8c8303ff28 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE) Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13KVM: X86: Reset the three MSR list number variables to 0 in kvm_init_msr_list()Xiaoyao Li1-0/+4
When applying commit 7a5ee6edb42e ("KVM: X86: Fix initialization of MSR lists"), it forgot to reset the three MSR lists number varialbes to 0 while removing the useless conditionals. Fixes: 7a5ee6edb42e (KVM: X86: Fix initialization of MSR lists) Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13kvm: x86: disable shattered huge page recovery for PREEMPT_RT.Paolo Bonzini1-0/+5
If a huge page is recovered (and becomes no executable) while another thread is executing it, the resulting contention on mmu_lock can cause latency spikes. Disabling recovery for PREEMPT_RT kernels fixes this issue. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-37/+61
Pull kvm fixes from Paolo Bonzini: "Fix unwinding of KVM_CREATE_VM failure, VT-d posted interrupts, DAX/ZONE_DEVICE, and module unload/reload" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved KVM: VMX: Introduce pi_is_pir_empty() helper KVM: VMX: Do not change PID.NDST when loading a blocked vCPU KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON KVM: X86: Fix initialization of MSR lists KVM: fix placement of refcount initialization KVM: Fix NULL-ptr deref after kvm_create_vm fails
2019-11-12KVM: MMU: Do not treat ZONE_DEVICE pages as being reservedSean Christopherson1-4/+4
Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and instead manually handle ZONE_DEVICE on a case-by-case basis. For things like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal pages, e.g. put pages grabbed via gup(). But for flows such as setting A/D bits or shifting refcounts for transparent huge pages, KVM needs to to avoid processing ZONE_DEVICE pages as the flows in question lack the underlying machinery for proper handling of ZONE_DEVICE pages. This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup() when running a KVM guest backed with /dev/dax memory, as KVM straight up doesn't put any references to ZONE_DEVICE pages acquired by gup(). Note, Dan Williams proposed an alternative solution of doing put_page() on ZONE_DEVICE pages immediately after gup() in order to simplify the auditing needed to ensure is_zone_device_page() is called if and only if the backing device is pinned (via gup()). But that approach would break kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til unmap() when accessing guest memory, unlike KVM's secondary MMU, which coordinates with mmu_notifier invalidations to avoid creating stale page references, i.e. doesn't rely on pages being pinned. [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl Reported-by: Adam Borowski <kilobyte@angband.pl> Analyzed-by: David Hildenbrand <david@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Introduce pi_is_pir_empty() helperJoao Martins2-3/+7
Streamline the PID.PIR check and change its call sites to use the newly added helper. Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Do not change PID.NDST when loading a blocked vCPUJoao Martins2-0/+20
When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU linked list of all vCPUs that are blocked on this pCPU. Afterwards, it changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler (wakeup_handler()) is responsible to kick (unblock) any vCPU on that linked list that now has pending posted interrupts. While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which will cause vmx_vcpu_pi_put() to set PID.SN. If later the vCPU will be scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear PID.SN but will also *overwrite PID.NDST to this different pCPU*. Instead of keeping it with original pCPU which vCPU had entered block phase on. This results in an issue because when a posted interrupt is delivered, as the wakeup_handler() will be executed and fail to find blocked vCPU on its per pCPU linked list of all vCPUs that are blocked on this pCPU. Which is due to the vCPU being placed on a *different* per pCPU linked list i.e. the original pCPU in which it entered block phase. The regression is introduced by commit c112b5f50232 ("KVM: x86: Recompute PID.ON when clearing PID.SN"). Therefore, partially revert it and reintroduce the condition in vmx_vcpu_pi_load() responsible for avoiding changing PID.NDST when loading a blocked vCPU. Fixes: c112b5f50232 ("KVM: x86: Recompute PID.ON when clearing PID.SN") Tested-by: Nathan Ni <nathan.ni@oracle.com> Co-developed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Consider PID.PIR to determine if vCPU has pending interruptsJoao Martins1-1/+5
Commit 17e433b54393 ("KVM: Fix leak vCPU's VMCS value into other pCPU") introduced vmx_dy_apicv_has_pending_interrupt() in order to determine if a vCPU have a pending posted interrupt. This routine is used by kvm_vcpu_on_spin() when searching for a a new runnable vCPU to schedule on pCPU instead of a vCPU doing busy loop. vmx_dy_apicv_has_pending_interrupt() determines if a vCPU has a pending posted interrupt solely based on PID.ON. However, when a vCPU is preempted, vmx_vcpu_pi_put() sets PID.SN which cause raised posted interrupts to only set bit in PID.PIR without setting PID.ON (and without sending notification vector), as depicted in VT-d manual section 5.2.3 "Interrupt-Posting Hardware Operation". Therefore, checking PID.ON is insufficient to determine if a vCPU has pending posted interrupts and instead we should also check if there is some bit set on PID.PIR if PID.SN=1. Fixes: 17e433b54393 ("KVM: Fix leak vCPU's VMCS value into other pCPU") Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Co-developed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Fix comment to specify PID.ON instead of PIR.ONLiran Alon1-1/+1
The Outstanding Notification (ON) bit is part of the Posted Interrupt Descriptor (PID) as opposed to the Posted Interrupts Register (PIR). The latter is a bitmap for pending vectors. Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: X86: Fix initialization of MSR listsChenyi Qiang1-30/+26
The three MSR lists(msrs_to_save[], emulated_msrs[] and msr_based_features[]) are global arrays of kvm.ko, which are adjusted (copy supported MSRs forward to override the unsupported MSRs) when insmod kvm-{intel,amd}.ko, but it doesn't reset these three arrays to their initial value when rmmod kvm-{intel,amd}.ko. Thus, at the next installation, kvm-{intel,amd}.ko will do operations on the modified arrays with some MSRs lost and some MSRs duplicated. So define three constant arrays to hold the initial MSR lists and initialize msrs_to_save[], emulated_msrs[] and msr_based_features[] based on the constant arrays. Cc: stable@vger.kernel.org Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> [Remove now useless conditionals. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-04kvm: x86: mmu: Recovery of shattered NX large pagesJunaid Shahid3-0/+144
The page table pages corresponding to broken down large pages are zapped in FIFO order, so that the large page can potentially be recovered, if it is not longer being used for execution. This removes the performance penalty for walking deeper EPT page tables. By default, one large page will last about one hour once the guest reaches a steady state. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-11-04kvm: mmu: ITLB_MULTIHIT mitigationPaolo Bonzini3-12/+167
With some Intel processors, putting the same virtual address in the TLB as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit and cause the processor to issue a machine check resulting in a CPU lockup. Unfortunately when EPT page tables use huge pages, it is possible for a malicious guest to cause this situation. Add a knob to mark huge pages as non-executable. When the nx_huge_pages parameter is enabled (and we are using EPT), all huge pages are marked as NX. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable. This is not an issue for shadow paging (except nested EPT), because then the host is in control of TLB flushes and the problematic situation cannot happen. With nested EPT, again the nested guest can cause problems shadow and direct EPT is treated in the same way. [ tglx: Fixup default to auto and massage wording a bit ] Originally-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-11-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linuxThomas Gleixner2-13/+11
to pick up the KVM fix which is required for the NX series.
2019-10-31KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is activePaolo Bonzini2-13/+11
VMX already does so if the host has SMEP, in order to support the combination of CR0.WP=1 and CR4.SMEP=1. However, it is perfectly safe to always do so, and in fact VMX already ends up running with EFER.NXE=1 on old processors that lack the "load EFER" controls, because it may help avoiding a slow MSR write. Removing all the conditionals simplifies the code. SVM does not have similar code, but it should since recent AMD processors do support SMEP. So this patch also makes the code for the two vendors more similar while fixing NPT=0, CR0.WP=1 and CR4.SMEP=1 on AMD processors. Cc: stable@vger.kernel.org Cc: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-28kvm/x86: Export MDS_NO=0 to guests when TSX is enabledPawan Gupta1-0/+19
Export the IA32_ARCH_CAPABILITIES MSR bit MDS_NO=0 to guests on TSX Async Abort(TAA) affected hosts that have TSX enabled and updated microcode. This is required so that the guests don't complain, "Vulnerable: Clear CPU buffers attempted, no microcode" when the host has the updated microcode to clear CPU buffers. Microcode update also adds support for MSR_IA32_TSX_CTRL which is enumerated by the ARCH_CAP_TSX_CTRL bit in IA32_ARCH_CAPABILITIES MSR. Guests can't do this check themselves when the ARCH_CAP_TSX_CTRL bit is not exported to the guests. In this case export MDS_NO=0 to the guests. When guests have CPUID.MD_CLEAR=1, they deploy MDS mitigation which also mitigates TAA. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
2019-10-22KVM: nVMX: Don't leak L1 MMIO regions to L2Jim Mattson3-31/+54
If the "virtualize APIC accesses" VM-execution control is set in the VMCS, the APIC virtualization hardware is triggered when a page walk in VMX non-root mode terminates at a PTE wherein the address of the 4k page frame matches the APIC-access address specified in the VMCS. On hardware, the APIC-access address may be any valid 4k-aligned physical address. KVM's nVMX implementation enforces the additional constraint that the APIC-access address specified in the vmcs12 must be backed by a "struct page" in L1. If not, L0 will simply clear the "virtualize APIC accesses" VM-execution control in the vmcs02. The problem with this approach is that the L1 guest has arranged the vmcs12 EPT tables--or shadow page tables, if the "enable EPT" VM-execution control is clear in the vmcs12--so that the L2 guest physical address(es)--or L2 guest linear address(es)--that reference the L2 APIC map to the APIC-access address specified in the vmcs12. Without the "virtualize APIC accesses" VM-execution control in the vmcs02, the APIC accesses in the L2 guest will directly access the APIC-access page in L1. When there is no mapping whatsoever for the APIC-access address in L1, the L2 VM just loses the intended APIC virtualization. However, when the APIC-access address is mapped to an MMIO region in L1, the L2 guest gets direct access to the L1 MMIO device. For example, if the APIC-access address specified in the vmcs12 is 0xfee00000, then L2 gets direct access to L1's APIC. Since this vmcs12 configuration is something that KVM cannot faithfully emulate, the appropriate response is to exit to userspace with KVM_INTERNAL_ERROR_EMULATION. Fixes: fe3ef05c7572 ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12") Reported-by: Dan Cross <dcross@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22KVM: SVM: Fix potential wrong physical id in avic_handle_ldr_updateMiaohe Lin3-8/+8
Guest physical APIC ID may not equal to vcpu->vcpu_id in some case. We may set the wrong physical id in avic_handle_ldr_update as we always use vcpu->vcpu_id. Get physical APIC ID from vAPIC page instead. Export and use kvm_xapic_id here and in avic_handle_apic_id_update as suggested by Vitaly. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22kvm: clear kvmclock MSR on resetPaolo Bonzini1-5/+3
After resetting the vCPU, the kvmclock MSR keeps the previous value but it is not enabled. This can be confusing, so fix it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22KVM: x86: fix bugon.cocci warningskbuild test robot1-2/+1
Use BUG_ON instead of a if condition followed by BUG. Generated by: scripts/coccinelle/misc/bugon.cocci Fixes: 4b526de50e39 ("KVM: x86: Check kvm_rebooting in kvm_spurious_fault()") CC: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: kbuild test robot <lkp@intel.com> Signed-off-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22KVM: VMX: Remove specialized handling of unexpected exit-reasonsLiran Alon1-12/+0
Commit bf653b78f960 ("KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit") introduced specialized handling of specific exit-reasons that should not be raised by CPU because KVM configures VMCS such that they should never be raised. However, since commit 7396d337cfad ("KVM: x86: Return to userspace with internal error on unexpected exit reason"), VMX & SVM exit handlers were modified to generically handle all unexpected exit-reasons by returning to userspace with internal error. Therefore, there is no need for specialized handling of specific unexpected exit-reasons (This specialized handling also introduced inconsistency for these exit-reasons to silently skip guest instruction instead of return to userspace on internal-error). Fixes: bf653b78f960 ("KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit") Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUIDJim Mattson1-1/+1
When the RDPID instruction is supported on the host, enumerate it in KVM_GET_SUPPORTED_CPUID. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-04KVM: x86: omit "impossible" pmu MSRs from MSR listPaolo Bonzini1-16/+2
INTEL_PMC_MAX_GENERIC is currently 32, which exceeds the 18 contiguous MSR indices reserved by Intel for event selectors. Since some machines actually have MSRs past the reserved range, filtering them against x86_pmu.num_counters_gp may have false positives. Cut the list to 18 entries to avoid this. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jamttson@google.com> Fixes: e2ada66ec418 ("kvm: x86: Add Intel PMU MSRs to msrs_to_save[]", 2019-08-21) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-03KVM: nVMX: Fix consistency check on injected exception error codeSean Christopherson1-1/+1
Current versions of Intel's SDM incorrectly state that "bits 31:15 of the VM-Entry exception error-code field" must be zero. In reality, bits 31:16 must be zero, i.e. error codes are 16-bit values. The bogus error code check manifests as an unexpected VM-Entry failure due to an invalid code field (error number 7) in L1, e.g. when injecting a #GP with error_code=0x9f00. Nadav previously reported the bug[*], both to KVM and Intel, and fixed the associated kvm-unit-test. [*] https://patchwork.kernel.org/patch/11124749/ Reported-by: Nadav Amit <namit@vmware.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-03KVM: x86: omit absent pmu MSRs from MSR listPaolo Bonzini1-2/+12
INTEL_PMC_MAX_GENERIC is currently 32, which exceeds the 18 contiguous MSR indices reserved by Intel for event selectors. Since some machines actually have MSRs past the reserved range, these may survive the filtering of msrs_to_save array and would be rejected by KVM_GET/SET_MSR. To avoid this, cut the list to whatever CPUID reports for the host's architectural PMU. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jmattson@google.com> Fixes: e2ada66ec418 ("kvm: x86: Add Intel PMU MSRs to msrs_to_save[]", 2019-08-21) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-01kvm: vmx: Limit guest PMCs to those supported on the hostJim Mattson1-2/+5
KVM can only virtualize as many PMCs as the host supports. Limit the number of generic counters and fixed counters to the number of corresponding counters supported on the host, rather than to INTEL_PMC_MAX_GENERIC and INTEL_PMC_MAX_FIXED, respectively. Note that INTEL_PMC_MAX_GENERIC is currently 32, which exceeds the 18 contiguous MSR indices reserved by Intel for event selectors. Since the existing code relies on a contiguous range of MSR indices for event selectors, it can't possibly work for more than 18 general purpose counters. Fixes: f5132b01386b5a ("KVM: Expose a version 2 architectural PMU to a guests") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-30kvm: x86, powerpc: do not allow clearing largepages debugfs entryPaolo Bonzini1-3/+3
The largepages debugfs entry is incremented/decremented as shadow pages are created or destroyed. Clearing it will result in an underflow, which is harmless to KVM but ugly (and could be misinterpreted by tools that use debugfs information), so make this particular statistic read-only. Cc: kvm-ppc@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-27KVM: VMX: Set VMENTER_L1D_FLUSH_NOT_REQUIRED if !X86_BUG_L1TFWaiman Long1-6/+9
The l1tf_vmx_mitigation is only set to VMENTER_L1D_FLUSH_NOT_REQUIRED when the ARCH_CAPABILITIES MSR indicates that L1D flush is not required. However, if the CPU is not affected by L1TF, l1tf_vmx_mitigation will still be set to VMENTER_L1D_FLUSH_AUTO. This is certainly not the best option for a !X86_BUG_L1TF CPU. So force l1tf_vmx_mitigation to VMENTER_L1D_FLUSH_NOT_REQUIRED to make it more explicit in case users are checking the vmentry_l1d_flush parameter. Signed-off-by: Waiman Long <longman@redhat.com> [Patch rewritten accoring to Borislav Petkov's suggestion. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-27KVM: x86: fix nested guest live migration with PMLPaolo Bonzini1-7/+32
Shadow paging is fundamentally incompatible with the page-modification log, because the GPAs in the log come from the wrong memory map. In particular, for the EPT page-modification log, the GPAs in the log come from L2 rather than L1. (If there was a non-EPT page-modification log, we couldn't use it for shadow paging because it would log GVAs rather than GPAs). Therefore, we need to rely on write protection to record dirty pages. This has the side effect of bypassing PML, since writes now result in an EPT violation vmexit. This is relatively easy to add to KVM, because pretty much the only place that needs changing is spte_clear_dirty. The first access to the page already goes through the page fault path and records the correct GPA; it's only subsequent accesses that are wrong. Therefore, we can equip set_spte (where the first access happens) to record that the SPTE will have to be write protected, and then spte_clear_dirty will use this information to do the right thing. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-27KVM: x86: assign two bits to track SPTE kindsPaolo Bonzini1-10/+18
Currently, we are overloading SPTE_SPECIAL_MASK to mean both "A/D bits unavailable" and MMIO, where the difference between the two is determined by mio_mask and mmio_value. However, the next patch will need two bits to distinguish availability of A/D bits from write protection. So, while at it give MMIO its own bit pattern, and move the two bits from bit 62 to bits 52..53 since Intel is allocating EPT page table bits from the top. Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-26KVM: x86: Expose XSAVEERPTR to the guestSebastian Andrzej Siewior1-3/+3
I was surprised to see that the guest reported `fxsave_leak' while the host did not. After digging deeper I noticed that the bits are simply masked out during enumeration. The XSAVEERPTR feature is actually a bug fix on AMD which means the kernel can disable a workaround. Pass XSAVEERPTR to the guest if available on the host. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-26kvm: x86: Enumerate support for CLZERO instructionJim Mattson1-2/+3
CLZERO is available to the guest if it is supported on the host. Therefore, enumerate support for the instruction in KVM_GET_SUPPORTED_CPUID whenever it is supported on the host. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-26kvm: x86: Use AMD CPUID semantics for AMD vCPUsJim Mattson1-2/+4
When the guest CPUID information represents an AMD vCPU, return all zeroes for queries of undefined CPUID leaves, whether or not they are in range. Signed-off-by: Jim Mattson <jmattson@google.com> Fixes: bd22f5cfcfe8f6 ("KVM: move and fix substitue search for missing CPUID entries") Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-26kvm: x86: Improve emulation of CPUID leaves 0BH and 1FHJim Mattson1-36/+47
For these CPUID leaves, the EDX output is not dependent on the ECX input (i.e. the SIGNIFCANT_INDEX flag doesn't apply to EDX). Furthermore, the low byte of the ECX output is always identical to the low byte of the ECX input. KVM does not produce the correct ECX and EDX outputs for any undefined subleaves beyond the first. Special-case these CPUID leaves in kvm_cpuid, so that the ECX and EDX outputs are properly generated for all undefined subleaves. Fixes: 0771671749b59a ("KVM: Enhance guest cpuid management") Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-26KVM: X86: Fix userspace set invalid CR4Wanpeng Li1-17/+21
Reported by syzkaller: WARNING: CPU: 0 PID: 6544 at /home/kernel/data/kvm/arch/x86/kvm//vmx/vmx.c:4689 handle_desc+0x37/0x40 [kvm_intel] CPU: 0 PID: 6544 Comm: a.out Tainted: G OE 5.3.0-rc4+ #4 RIP: 0010:handle_desc+0x37/0x40 [kvm_intel] Call Trace: vmx_handle_exit+0xbe/0x6b0 [kvm_intel] vcpu_enter_guest+0x4dc/0x18d0 [kvm] kvm_arch_vcpu_ioctl_run+0x407/0x660 [kvm] kvm_vcpu_ioctl+0x3ad/0x690 [kvm] do_vfs_ioctl+0xa2/0x690 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x74/0x720 entry_SYSCALL_64_after_hwframe+0x49/0xbe When CR4.UMIP is set, guest should have UMIP cpuid flag. Current kvm set_sregs function doesn't have such check when userspace inputs sregs values. SECONDARY_EXEC_DESC is enabled on writes to CR4.UMIP in vmx_set_cr4 though guest doesn't have UMIP cpuid flag. The testcast triggers handle_desc warning when executing ltr instruction since guest architectural CR4 doesn't set UMIP. This patch fixes it by adding valid CR4 and CPUID combination checking in __set_sregs. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=138efb99600000 Reported-by: syzbot+0f1819555fbdce992df9@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-26kvm: x86: Fix a spurious -E2BIG in __do_cpuid_funcJim Mattson1-6/+10
Don't return -E2BIG from __do_cpuid_func when processing function 0BH or 1FH and the last interesting subleaf occupies the last allocated entry in the result array. Cc: Paolo Bonzini <pbonzini@redhat.com> Fixes: 831bf664e9c1fc ("KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-26KVM: LAPIC: Loosen filter for adaptive tuning of lapic_timer_advance_nsWanpeng Li1-6/+7
5000 guest cycles delta is easy to encounter on desktop, per-vCPU lapic_timer_advance_ns always keeps at 1000ns initial value, let's loosen the filter a bit to let adaptive tuning make progress. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-25KVM: nVMX: cleanup and fix host 64-bit mode checksPaolo Bonzini1-31/+19
KVM was incorrectly checking vmcs12->host_ia32_efer even if the "load IA32_EFER" exit control was reset. Also, some checks were not using the new CC macro for tracing. Cleanup everything so that the vCPU's 64-bit mode is determined directly from EFER_LMA and the VMCS checks are based on that, which matches section 26.2.4 of the SDM. Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com> Fixes: 5845038c111db27902bc220a4f70070fe945871c Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-25KVM: vmx: fix build warnings in hv_enable_direct_tlbflush() on i386Vitaly Kuznetsov1-9/+5
The following was reported on i386: arch/x86/kvm/vmx/vmx.c: In function 'hv_enable_direct_tlbflush': arch/x86/kvm/vmx/vmx.c:503:10: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] pr_debugs() in this function are more or less useless, let's just remove them. evmcs->hv_vm_id can use 'unsigned long' instead of 'u64'. Also, simplify the code a little bit. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-25KVM: VMX: Add error handling to VMREAD helperSean Christopherson2-4/+25
Now that VMREAD flows require a taken branch, courtesy of commit 3901336ed9887 ("x86/kvm: Don't call kvm_spurious_fault() from .fixup") bite the bullet and add full error handling to VMREAD, i.e. replace the JMP added by __ex()/____kvm_handle_fault_on_reboot() with a hinted Jcc. To minimize the code footprint, add a helper function, vmread_error(), to handle both faults and failures so that the inline flow has a single CALL. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-25KVM: VMX: Optimize VMX instruction error and fault handlingSean Christopherson2-32/+74
Rework the VMX instruction helpers using asm-goto to branch directly to error/fault "handlers" in lieu of using __ex(), i.e. the generic ____kvm_handle_fault_on_reboot(). Branching directly to fault handling code during fixup avoids the extra JMP that is inserted after every VMX instruction when using the generic "fault on reboot" (see commit 3901336ed9887, "x86/kvm: Don't call kvm_spurious_fault() from .fixup"). Opportunistically clean up the helpers so that they all have consistent error handling and messages. Leave the usage of ____kvm_handle_fault_on_reboot() (via __ex()) in kvm_cpu_vmxoff() and nested_vmx_check_vmentry_hw() as is. The VMXOFF case is not a fast path, i.e. the cleanliness of __ex() is worth the JMP, and the extra JMP in nested_vmx_check_vmentry_hw() is unavoidable. Note, VMREAD cannot get the asm-goto treatment as output operands aren't compatible with GCC's asm-goto due to internal compiler restrictions. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-25KVM: x86: Check kvm_rebooting in kvm_spurious_fault()Sean Christopherson1-1/+2
Explicitly check kvm_rebooting in kvm_spurious_fault() prior to invoking BUG(), as opposed to assuming the caller has already done so. Letting kvm_spurious_fault() be called "directly" will allow VMX to better optimize its low level assembly flows. As a happy side effect, kvm_spurious_fault() no longer needs to be marked as a dead end since it doesn't unconditionally BUG(). Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: nvmx: limit atomic switch MSRsMarc Orr1-11/+32
Allowing an unlimited number of MSRs to be specified via the VMX load/store MSR lists (e.g., vm-entry MSR load list) is bad for two reasons. First, a guest can specify an unreasonable number of MSRs, forcing KVM to process all of them in software. Second, the SDM bounds the number of MSRs allowed to be packed into the atomic switch MSR lists. Quoting the "Miscellaneous Data" section in the "VMX Capability Reporting Facility" appendix: "Bits 27:25 is used to compute the recommended maximum number of MSRs that should appear in the VM-exit MSR-store list, the VM-exit MSR-load list, or the VM-entry MSR-load list. Specifically, if the value bits 27:25 of IA32_VMX_MISC is N, then 512 * (N + 1) is the recommended maximum number of MSRs to be included in each list. If the limit is exceeded, undefined processor behavior may result (including a machine check during the VMX transition)." Because KVM needs to protect itself and can't model "undefined processor behavior", arbitrarily force a VM-entry to fail due to MSR loading when the MSR load list is too large. Similarly, trigger an abort during a VM exit that encounters an MSR load list or MSR store list that is too large. The MSR list size is intentionally not pre-checked so as to maintain compatibility with hardware inasmuch as possible. Test these new checks with the kvm-unit-test "x86: nvmx: test max atomic switch MSRs". Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: svm: Intercept RDPRUJim Mattson1-0/+8
The RDPRU instruction gives the guest read access to the IA32_APERF MSR and the IA32_MPERF MSR. According to volume 3 of the APM, "When virtualization is enabled, this instruction can be intercepted by the Hypervisor. The intercept bit is at VMCB byte offset 10h, bit 14." Since we don't enumerate the instruction in KVM_SUPPORTED_CPUID, intercept it and synthesize #UD. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Drew Schmitt <dasch@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24kvm: x86: Add "significant index" flag to a few CPUID leavesJim Mattson1-0/+6
According to the Intel SDM, volume 2, "CPUID," the index is significant (or partially significant) for CPUID leaves 0FH, 10H, 12H, 17H, 18H, and 1FH. Add the corresponding flag to these CPUID leaves in do_host_cpuid(). Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support") Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zeroSean Christopherson1-4/+5
Do not skip invalid shadow pages when zapping obsolete pages if the pages' root_count has reached zero, in which case the page can be immediately zapped and freed. Update the comment accordingly. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Explicitly track only a single invalid mmu generationSean Christopherson2-10/+20
Toggle mmu_valid_gen between '0' and '1' instead of blindly incrementing the generation. Because slots_lock is held for the entire duration of zapping obsolete pages, it's impossible for there to be multiple invalid generations associated with shadow pages at any given time. Toggling between the two generations (valid vs. invalid) allows changing mmu_valid_gen from an unsigned long to a u8, which reduces the size of struct kvm_mmu_page from 160 to 152 bytes on 64-bit KVM, i.e. reduces KVM's memory footprint by 8 bytes per shadow page. Set sp->mmu_valid_gen before it is added to active_mmu_pages. Functionally this has no effect as kvm_mmu_alloc_page() has a single caller that sets sp->mmu_valid_gen soon thereafter, but visually it is jarring to see a shadow page being added to the list without its mmu_valid_gen first being set. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"Sean Christopherson2-6/+24
Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrasing the original changelog (commit 5ff0568374ed2 was itself a partial revert): Don't force reloading the remote mmu when zapping an obsolete page, as a MMU_RELOAD request has already been issued by kvm_mmu_zap_all_fast() immediately after incrementing mmu_valid_gen, i.e. after marking pages obsolete. This reverts commit 5ff0568374ed2e585376a3832857ade5daccd381. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all ↵Sean Christopherson1-3/+22
pages"" Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Reload the mmu on all vCPUs after updating the generation number so that obsolete pages are not used by any vCPUs. This allows collapsing all TLB flushes during obsolete page zapping into a single flush, as there is no need to flush when dropping mmu_lock (to reschedule). Note: a remote TLB flush is still needed before freeing the pages as other vCPUs may be doing a lockless shadow page walk. Opportunstically improve the comments restored by the revert (the code itself is a true revert). This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""Sean Christopherson1-26/+9
Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Zap at least 10 shadow pages before releasing mmu_lock to reduce the overhead associated with re-acquiring the lock. Note: "10" is an arbitrary number, speculated to be high enough so that a vCPU isn't stuck zapping obsolete pages for an extended period, but small enough so that other vCPUs aren't starved waiting for mmu_lock. This reverts commit 43d2b14b105fb00b8864c7b0ee7043cc1cc4a969. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for ↵Sean Christopherson2-0/+22
kvm_mmu_invalidate_all_pages"" Now that the fast invalidate mechanism has been reintroduced, restore the tracepoint associated with said mechanism. Note, the name of the tracepoint deviates from the original tracepoint so as to match KVM's current nomenclature. This reverts commit 42560fb1f3c6c7f730897b7fa7a478bc37e0be50. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>