summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx/vmx.h
AgeCommit message (Collapse)AuthorFilesLines
2020-05-13KVM: VMX: Add proper cache tracking for CR4Sean Christopherson1-0/+1
Move CR4 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs and retpolines (when configured) during nested VMX transitions as kvm_read_cr4_bits() is invoked multiple times on each transition, e.g. when stuffing CR0 and CR3. As an added bonus, this eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR4, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr4_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Skip IBPB when temporarily switching between vmcs01 and vmcs02Sean Christopherson1-1/+0
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS switch when temporarily loading vmcs02 to synchronize it to vmcs12, i.e. give copy_vmcs02_to_vmcs12_rare() the same treatment as vmx_switch_vmcs(). Make vmx_vcpu_load() static now that it's only referenced within vmx.c. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506235850.22600-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02Sean Christopherson1-1/+2
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS switch when running with spectre_v2_user=on/auto if the switch is between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02. The IBPB is intended to prevent one guest from attacking another, which is unnecessary in the nested case as it's the same guest from KVM's perspective. This all but eliminates the overhead observed for nested VMX transitions when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which can be significant, e.g. roughly 3x on current systems. Reported-by: Alexander Graf <graf@amazon.com> Cc: KarimAllah Raslan <karahmed@amazon.de> Cc: stable@vger.kernel.org Fixes: 15d45071523d ("KVM/x86: Add IBPB support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200501163117.4655-1-sean.j.christopherson@intel.com> [Invert direction of bool argument. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: VMX: Split out architectural interrupt/NMI blocking checksSean Christopherson1-0/+2
Move the architectural (non-KVM specific) interrupt/NMI blocking checks to a separate helper so that they can be used in a future patch by vmx_check_nested_events(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flagsSean Christopherson1-1/+13
Introduce a new "extended register" type, EXIT_INFO_2 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_INTR_INFO. Drop a comment in vmx_recover_nmi_blocking() that is obsoleted by the generic caching mechanism. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flagsSean Christopherson1-1/+14
Introduce a new "extended register" type, EXIT_INFO_1 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_QUALIFICATION. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Drop manual clearing of segment cache on nested VMCS switchSean Christopherson1-5/+0
Drop the call to vmx_segment_cache_clear() in vmx_switch_vmcs() now that the entire register cache is reset when switching the active VMCS, e.g. vmx_segment_cache_test_set() will reset the segment cache due to VCPU_EXREG_SEGMENTS being unavailable. Move vmx_segment_cache_clear() to vmx.c now that it's no longer invoked by the nested code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Reset register cache (available and dirty masks) on VMCS switchSean Christopherson1-0/+11
Reset the per-vCPU available and dirty register masks when switching between vmcs01 and vmcs02, as the masks track state relative to the current VMCS. The stale masks don't cause problems in the current code base because the registers are either unconditionally written on nested transitions or, in the case of segment registers, have an additional tracker that is manually reset. Note, by dropping (previously implicitly, now explicitly) the dirty mask when switching the active VMCS, KVM is technically losing writes to the associated fields. But, the only regs that can be dirtied (RIP, RSP and PDPTRs) are unconditionally written on nested transitions, e.g. explicit writeback is a waste of cycles, and a WARN_ON would be rather pointless. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Reload APIC access page on nested VM-Exit only if necessarySean Christopherson1-0/+1
Defer reloading L1's APIC page by logging the need for a reload and processing it during nested VM-Exit instead of unconditionally reloading the APIC page on nested VM-Exit. This eliminates a TLB flush on the majority of VM-Exits as the APIC page rarely needs to be reloaded. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-28-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Move vmx_flush_tlb() to vmx.cSean Christopherson1-25/+0
Move vmx_flush_tlb() to vmx.c and make it non-inline static now that all its callers live in vmx.c. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-19-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: x86: Drop @invalidate_gpa param from kvm_x86_ops' tlb_flush()Sean Christopherson1-30/+12
Drop @invalidate_gpa from ->tlb_flush() and kvm_vcpu_flush_tlb() now that all callers pass %true for said param, or ignore the param (SVM has an internal call to svm_flush_tlb() in svm_flush_tlb_guest that somewhat arbitrarily passes %false). Remove __vmx_flush_tlb() as it is no longer used. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-17-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Drop redundant capability checks in low level INVVPID helpersSean Christopherson1-1/+0
Remove the INVVPID capabilities checks from vpid_sync_vcpu_single() and vpid_sync_vcpu_global() now that all callers ensure the INVVPID variant is supported. Note, in some cases the guarantee is provided in concert with hardware_setup(), which enables VPID if and only if at least of invvpid_single() or invvpid_global() is supported. Drop the WARN_ON_ONCE() from vmx_flush_tlb() as vpid_sync_vcpu_single() will trigger a WARN() on INVVPID failure, i.e. if SINGLE_CONTEXT isn't supported. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-13-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-15KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flushSean Christopherson1-1/+27
Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH. Remote TLB flushes require all contexts to be invalidated, not just the active contexts, e.g. all mappings in all contexts for a given HVA need to be invalidated on a mmu_notifier invalidation. Similarly, the instigator of the deferred TLB flush may be expecting all contexts to be flushed, e.g. vmx_vcpu_load_vmcs(). Without nested VMX, flushing only the current EPTP/VPID context isn't problematic because KVM uses a constant VPID for each vCPU, and mmu_alloc_direct_roots() all but guarantees KVM will use a single EPTP for L1. In the rare case where a different EPTP is created or reused, KVM (currently) unconditionally flushes the new EPTP context prior to entering the guest. With nested VMX, KVM conditionally uses a different VPID for L2, and unconditionally uses a different EPTP for L2. Because KVM doesn't _intentionally_ guarantee L2's EPTP/VPID context is flushed on nested VM-Enter, it'd be possible for a malicious L1 to attack the host and/or different VMs by exploiting the lack of flushing for L2. 1) Launch nested guest from malicious L1. 2) Nested VM-Enter to L2. 3) Access target GPA 'g'. CPU inserts TLB entry tagged with L2's ASID mapping 'g' to host PFN 'x'. 2) Nested VM-Exit to L1. 3) L1 triggers kernel same-page merging (ksm) by duplicating/zeroing the page for PFN 'x'. 4) Host kernel merges PFN 'x' with PFN 'y', i.e. unmaps PFN 'x' and remaps the page to PFN 'y'. mmu_notifier sends invalidate command, KVM flushes TLB only for L1's ASID. 4) Host kernel reallocates PFN 'x' to some other task/guest. 5) Nested VM-Enter to L2. KVM does not invalidate L2's EPTP or VPID. 6) L2 accesses GPA 'g' and gains read/write access to PFN 'x' via its stale TLB entry. However, current KVM unconditionally flushes L1's EPTP/VPID context on nested VM-Exit. But, that behavior is mostly unintentional, KVM doesn't go out of its way to flush EPTP/VPID on nested VM-Enter/VM-Exit, rather a TLB flush is guaranteed to occur prior to re-entering L1 due to __kvm_mmu_new_cr3() always being called with skip_tlb_flush=false. On nested VM-Enter, this happens via kvm_init_shadow_ept_mmu() (nested EPT enabled) or in nested_vmx_load_cr3() (nested EPT disabled). On nested VM-Exit it occurs via nested_vmx_load_cr3(). This also fixes a bug where a deferred TLB flush in the context of L2, with EPT disabled, would flush L1's VPID instead of L2's VPID, as vmx_flush_tlb() flushes L1's VPID regardless of is_guest_mode(). Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Junaid Shahid <junaids@google.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: John Haxby <john.haxby@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Fixes: efebf0aaec3d ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-5/+3
Pull kvm updates from Paolo Bonzini: "ARM: - GICv4.1 support - 32bit host removal PPC: - secure (encrypted) using under the Protected Execution Framework ultravisor s390: - allow disabling GISA (hardware interrupt injection) and protected VMs/ultravisor support. x86: - New dirty bitmap flag that sets all bits in the bitmap when dirty page logging is enabled; this is faster because it doesn't require bulk modification of the page tables. - Initial work on making nested SVM event injection more similar to VMX, and less buggy. - Various cleanups to MMU code (though the big ones and related optimizations were delayed to 5.8). Instead of using cr3 in function names which occasionally means eptp, KVM too has standardized on "pgd". - A large refactoring of CPUID features, which now use an array that parallels the core x86_features. - Some removal of pointer chasing from kvm_x86_ops, which will also be switched to static calls as soon as they are available. - New Tigerlake CPUID features. - More bugfixes, optimizations and cleanups. Generic: - selftests: cleanups, new MMU notifier stress test, steal-time test - CSV output for kvm_stat" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (277 commits) x86/kvm: fix a missing-prototypes "vmread_error" KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y KVM: VMX: Add a trampoline to fix VMREAD error handling KVM: SVM: Annotate svm_x86_ops as __initdata KVM: VMX: Annotate vmx_x86_ops as __initdata KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup() KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes KVM: VMX: Configure runtime hooks using vmx_x86_ops KVM: VMX: Move hardware_setup() definition below vmx_x86_ops KVM: x86: Move init-only kvm_x86_ops to separate struct KVM: Pass kvm_init()'s opaque param to additional arch funcs s390/gmap: return proper error code on ksm unsharing KVM: selftests: Fix cosmetic copy-paste error in vm_mem_region_move() KVM: Fix out of range accesses to memslots KVM: X86: Micro-optimize IPI fastpath delay KVM: X86: Delay read msr data iff writes ICR MSR KVM: PPC: Book3S HV: Add a capability for enabling secure guests KVM: arm64: GICv4.1: Expose HW-based SGIs in debugfs KVM: arm64: GICv4.1: Allow non-trapping WFI when using HW SGIs ...
2020-03-31Merge branch 'x86-cleanups-for-linus' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "This topic tree contains more commits than usual: - most of it are uaccess cleanups/reorganization by Al - there's a bunch of prototype declaration (--Wmissing-prototypes) cleanups - misc other cleanups all around the map" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/mm/set_memory: Fix -Wmissing-prototypes warnings x86/efi: Add a prototype for efi_arch_mem_reserve() x86/mm: Mark setup_emu2phys_nid() static x86/jump_label: Move 'inline' keyword placement x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt() kill uaccess_try() x86: unsafe_put-style macro for sigmask x86: x32_setup_rt_frame(): consolidate uaccess areas x86: __setup_rt_frame(): consolidate uaccess areas x86: __setup_frame(): consolidate uaccess areas x86: setup_sigcontext(): list user_access_{begin,end}() into callers x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit) x86: ia32_setup_rt_frame(): consolidate uaccess areas x86: ia32_setup_frame(): consolidate uaccess areas x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers x86/alternatives: Mark text_poke_loc_init() static x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl() x86/mm: Drop pud_mknotpresent() x86: Replace setup_irq() by request_irq() x86/configs: Slightly reduce defconfigs ...
2020-03-23KVM: VMX: Fold loaded_vmcs_init() into alloc_loaded_vmcs()Sean Christopherson1-1/+0
Subsume loaded_vmcs_init() into alloc_loaded_vmcs(), its only remaining caller, and drop the VMCLEAR on the shadow VMCS, which is guaranteed to be NULL. loaded_vmcs_init() was previously used by loaded_vmcs_clear(), but loaded_vmcs_clear() also subsumed loaded_vmcs_init() to properly handle smp_wmb() with respect to VMCLEAR. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200321193751.24985-3-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: rename set_cr3 callback and related flags to load_mmu_pgdPaolo Bonzini1-1/+1
The set_cr3 callback is not setting the guest CR3, it is setting the root of the guest page tables, either shadow or two-dimensional. To make this clearer as well as to indicate that the MMU calls it via kvm_mmu_load_cr3, rename it to load_mmu_pgd. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: x86: Move VMX's host_efer to common x86 codeSean Christopherson1-1/+0
Move host_efer to common x86 code and use it for CPUID's is_efer_nx() to avoid constantly re-reading the MSR. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16KVM: VMX: Add helpers to query Intel PT modeSean Christopherson1-2/+2
Add helpers to query which of the (two) supported PT modes is active. The primary motivation is to help document that there is a third PT mode (host-only) that's currently not supported by KVM. As is, it's not obvious that PT_MODE_SYSTEM != !PT_MODE_HOST_GUEST and vice versa, e.g. that "pt_mode == PT_MODE_SYSTEM" and "pt_mode != PT_MODE_HOST_GUEST" are two distinct checks. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-23KVM: nVMX: Emulate MTF when performing instruction emulationOliver Upton1-0/+3
Since commit 5f3d45e7f282 ("kvm/x86: add support for MONITOR_TRAP_FLAG"), KVM has allowed an L1 guest to use the monitor trap flag processor-based execution control for its L2 guest. KVM simply forwards any MTF VM-exits to the L1 guest, which works for normal instruction execution. However, when KVM needs to emulate an instruction on the behalf of an L2 guest, the monitor trap flag is not emulated. Add the necessary logic to kvm_skip_emulated_instruction() to synthesize an MTF VM-exit to L1 upon instruction emulation for L2. Fixes: 5f3d45e7f282 ("kvm/x86: add support for MONITOR_TRAP_FLAG") Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-17x86/cpu: Move prototype for get_umwait_control_msr() to a global locationBenjamin Thiel1-2/+0
.. in order to fix a -Wmissing-prototypes warning. No functional change. Signed-off-by: Benjamin Thiel <b.thiel@posteo.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20200123172945.7235-1-b.thiel@posteo.de
2020-01-13x86/msr-index: Clean up bit defines for IA32_FEATURE_CONTROL MSRSean Christopherson1-1/+1
As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL are quite a mouthful, especially the VMX bits which must differentiate between enabling VMX inside and outside SMX (TXT) operation. Rename the MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to make them a little friendlier on the eyes. Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name to match Intel's SDM, but a future patch will add a dedicated Kconfig, file and functions for the MSR. Using the full name for those assets is rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its nomenclature is consistent throughout the kernel. Opportunistically, fix a few other annoyances with the defines: - Relocate the bit defines so that they immediately follow the MSR define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL. - Add whitespace around the block of feature control defines to make it clear they're all related. - Use BIT() instead of manually encoding the bit shift. - Use "VMX" instead of "VMXON" to match the SDM. - Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to be consistent with the kernel's verbiage used for all other feature control bits. Note, the SDM refers to the LMCE bit as LMCE_ON, likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN. Ignore the (literal) one-off usage of _ON, the SDM is simply "wrong". Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
2019-12-04kvm: vmx: Stop wasting a page for guest_msrsJim Mattson1-1/+7
We will never need more guest_msrs than there are indices in vmx_msr_index. Thus, at present, the guest_msrs array will not exceed 168 bytes. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-21Merge branch 'kvm-tsx-ctrl' into HEADPaolo Bonzini1-0/+11
Conflicts: arch/x86/kvm/vmx/vmx.c
2019-11-15KVM: nVMX: Add support for capturing highest observable L2 TSCAaron Lewis1-0/+5
The L1 hypervisor may include the IA32_TIME_STAMP_COUNTER MSR in the vmcs12 MSR VM-exit MSR-store area as a way of determining the highest TSC value that might have been observed by L2 prior to VM-exit. The current implementation does not capture a very tight bound on this value. To tighten the bound, add the IA32_TIME_STAMP_COUNTER MSR to the vmcs02 VM-exit MSR-store area whenever it appears in the vmcs12 VM-exit MSR-store area. When L0 processes the vmcs12 VM-exit MSR-store area during the emulation of an L2->L1 VM-exit, special-case the IA32_TIME_STAMP_COUNTER MSR, using the value stored in the vmcs02 VM-exit MSR-store area to derive the value to be stored in the vmcs12 VM-exit MSR-store area. Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15kvm: vmx: Rename NR_AUTOLOAD_MSRS to NR_LOADSTORE_MSRSAaron Lewis1-2/+2
Rename NR_AUTOLOAD_MSRS to NR_LOADSTORE_MSRS. This needs to be done due to the addition of the MSR-autostore area that will be added in a future patch. After that the name AUTOLOAD will no longer make sense. Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15KVM: nVMX: Update vmcs01 TPR_THRESHOLD if L2 changed L1 TPRLiran Alon1-0/+3
When L1 don't use TPR-Shadow to run L2, L0 configures vmcs02 without TPR-Shadow and install intercepts on CR8 access (load and store). If L1 do not intercept L2 CR8 access, L0 intercepts on those accesses will emulate load/store on L1's LAPIC TPR. If in this case L2 lowers TPR such that there is now an injectable interrupt to L1, apic_update_ppr() will request a KVM_REQ_EVENT which will trigger a call to update_cr8_intercept() to update TPR-Threshold to highest pending IRR priority. However, this update to TPR-Threshold is done while active vmcs is vmcs02 instead of vmcs01. Thus, when later at some point L0 will emulate an exit from L2 to L1, L1 will still run with high TPR-Threshold. This will result in every VMEntry to L1 to immediately exit on TPR_BELOW_THRESHOLD and continue to do so infinitely until some condition will cause KVM_REQ_EVENT to be set. (Note that TPR_BELOW_THRESHOLD exit handler do not set KVM_REQ_EVENT until apic_update_ppr() will notice a new injectable interrupt for PPR) To fix this issue, change update_cr8_intercept() such that if L2 lowers L1's TPR in a way that requires to lower L1's TPR-Threshold, save update to TPR-Threshold and apply it to vmcs01 when L0 emulates an exit from L2 to L1. Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Introduce pi_is_pir_empty() helperJoao Martins1-0/+5
Streamline the PID.PIR check and change its call sites to use the newly added helper. Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-12KVM: VMX: Do not change PID.NDST when loading a blocked vCPUJoao Martins1-0/+6
When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU linked list of all vCPUs that are blocked on this pCPU. Afterwards, it changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler (wakeup_handler()) is responsible to kick (unblock) any vCPU on that linked list that now has pending posted interrupts. While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which will cause vmx_vcpu_pi_put() to set PID.SN. If later the vCPU will be scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear PID.SN but will also *overwrite PID.NDST to this different pCPU*. Instead of keeping it with original pCPU which vCPU had entered block phase on. This results in an issue because when a posted interrupt is delivered, as the wakeup_handler() will be executed and fail to find blocked vCPU on its per pCPU linked list of all vCPUs that are blocked on this pCPU. Which is due to the vCPU being placed on a *different* per pCPU linked list i.e. the original pCPU in which it entered block phase. The regression is introduced by commit c112b5f50232 ("KVM: x86: Recompute PID.ON when clearing PID.SN"). Therefore, partially revert it and reintroduce the condition in vmx_vcpu_pi_load() responsible for avoiding changing PID.NDST when loading a blocked vCPU. Fixes: c112b5f50232 ("KVM: x86: Recompute PID.ON when clearing PID.SN") Tested-by: Nathan Ni <nathan.ni@oracle.com> Co-developed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROLTao Xu1-0/+9
UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index E1H to determines the maximum time in TSC-quanta that the processor can reside in either C0.1 or C0.2. This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate IA32_UMWAIT_CONTROL between host and guest. The variable mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value, so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL. Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: VMX: Change ple_window type to unsigned intPeter Xu1-1/+1
The VMX ple_window is 32 bits wide, so logically it can overflow with an int. The module parameter is declared as unsigned int which is good, however the dynamic variable is not. Switching all the ple_window references to use unsigned int. The tracepoint changes will also affect SVM, but SVM is using an even smaller width (16 bits) so it's always fine. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Explicitly initialize controls shadow at VMCS allocationSean Christopherson1-5/+0
Or: Don't re-initialize vmcs02's controls on every nested VM-Entry. VMWRITEs to the major VMCS controls are deceptively expensive. Intel CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. Zero out the controls' shadow copies during VMCS allocation and use the optimized setter when "initializing" controls. While this technically affects both non-nested and nested virtualization, nested virtualization is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows hardware to optimizie away consistency checks. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't reset VMCS controls shadow on VMCS switchSean Christopherson1-4/+0
... now that the shadow copies are per-VMCS. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Shadow VMCS controls on a per-VMCS basisSean Christopherson1-15/+7
... to pave the way for not preserving the shadow copies across switches between vmcs01 and vmcs02, and eventually to avoid VMWRITEs to vmcs02 when the desired value is unchanged across nested VM-Enters. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Shadow VMCS secondary execution controlsSean Christopherson1-0/+2
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Shadow VMCS primary execution controlsSean Christopherson1-0/+2
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02, and more importantly can eliminate costly VMWRITEs to controls when preparing vmcs02. Shadowing exec controls also saves a VMREAD when opening virtual INTR/NMI windows, yay... Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Shadow VMCS pin controlsSean Christopherson1-0/+2
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Shadowing pin controls also allows a future patch to remove the per-VMCS 'hv_timer_armed' flag, as the shadow copy is a superset of said flag. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Add builder macros for shadowing controlsSean Christopherson1-64/+36
... to pave the way for shadowing all (five) major VMCS control fields without massive amounts of error prone copy+paste+modify. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Use adjusted pin controls for vmcs02Sean Christopherson1-0/+1
KVM provides a module parameter to allow disabling virtual NMI support to simplify testing (hardware *without* virtual NMI support is hard to come by but it does have users). When preparing vmcs02, use the accessor for pin controls to ensure that the module param is respected for nested guests. Opportunistically swap the order of applying L0's and L1's pin controls to better align with other controls and to prepare for a future patche that will ignore L1's, but not L0's, preemption timer flag. Fixes: d02fcf50779ec ("kvm: vmx: Allow disabling virtual NMI support") Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't reread VMCS-agnostic state when switching VMCSSean Christopherson1-0/+1
When switching between vmcs01 and vmcs02, there is no need to update state tracking for values that aren't tied to any particular VMCS as the per-vCPU values are already up-to-date (vmx_switch_vmcs() can only be called when the vCPU is loaded). Avoiding the update eliminates a RDMSR, and potentially a RDPKRU and posted-interrupt update (cmpxchg64() and more). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't "put" vCPU or host state when switching VMCSSean Christopherson1-1/+2
When switching between vmcs01 and vmcs02, KVM isn't actually switching between guest and host. If guest state is already loaded (the likely, if not guaranteed, case), keep the guest state loaded and manually swap the loaded_cpu_state pointer after propagating saved host state to the new vmcs0{1,2}. Avoiding the switch between guest and host reduces the latency of switching between vmcs01 and vmcs02 by several hundred cycles, and reduces the roundtrip time of a nested VM by upwards of 1000 cycles. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: simplify vmx_prepare_switch_to_{guest,host}Paolo Bonzini1-6/+12
vmx->loaded_cpu_state can only be NULL or equal to vmx->loaded_vmcs, so change it to a bool. Because the direction of the bool is now the opposite of vmx->guest_msrs_dirty, change the direction of vmx->guest_msrs_dirty so that they match. Finally, do not imply that MSRs have to be reloaded when vmx->guest_state_loaded is false; instead, set vmx->guest_msrs_ready to false explicitly in vmx_prepare_switch_to_host. Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Sync rarely accessed guest fields only when neededSean Christopherson1-0/+7
Many guest fields are rarely read (or written) by VMMs, i.e. likely aren't accessed between runs of a nested VMCS. Delay pulling rarely accessed guest fields from vmcs02 until they are VMREAD or until vmcs12 is dirtied. The latter case is necessary because nested VM-Entry will consume all manner of fields when vmcs12 is dirty, e.g. for consistency checks. Note, an alternative to synchronizing all guest fields on VMREAD would be to read *only* the field being accessed, but switching VMCS pointers is expensive and odds are good if one guest field is being accessed then others will soon follow, or that vmcs12 will be dirtied due to a VMWRITE (see above). And the full synchronization results in slightly cleaner code. Note, although GUEST_PDPTRs are relevant only for a 32-bit PAE guest, they are accessed quite frequently for said guests, and a separate patch is in flight to optimize away GUEST_PDTPR synchronziation for non-PAE guests. Skipping rarely accessed guest fields reduces the latency of a nested VM-Exit by ~200 cycles. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Use descriptive names for VMCS sync functions and flagsSean Christopherson1-1/+1
Nested virtualization involves copying data between many different types of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS. Rename a variety of functions and flags to document both the source and destination of each sync. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Store the host kernel's IDT base in a global variableSean Christopherson1-1/+0
Although the kernel may use multiple IDTs, KVM should only ever see the "real" IDT, e.g. the early init IDT is long gone by the time KVM runs and the debug stack IDT is only used for small windows of time in very specific flows. Before commit a547c6db4d2f1 ("KVM: VMX: Enable acknowledge interupt on vmexit"), the kernel's IDT base was consumed by KVM only when setting constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE. Because constant host state is done once per vCPU, there was ostensibly no need to cache the kernel's IDT base. When support for "ack interrupt on exit" was introduced, KVM added a second consumer of the IDT base as handling already-acked interrupts requires directly calling the interrupt handler, i.e. KVM uses the IDT base to find the address of the handler. Because interrupts are a fast path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE. Presumably, the IDT base was cached on a per-vCPU basis simply because the existing code grabbed the IDT base on a per-vCPU (VMCS) basis. Note, all post-boot IDTs use the same handlers for external interrupts, i.e. the "ack interrupt on exit" use of the IDT base would be unaffected even if the cached IDT somehow did not match the current IDT. And as for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the above analysis is wrong then KVM has had a bug since the beginning of time since KVM has effectively been caching the IDT at vCPU creation since commit a8b732ca01c ("[PATCH] kvm: userspace interface"). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: x86: move MSR_IA32_POWER_CTL handling to common codePaolo Bonzini1-2/+0
Make it available to AMD hosts as well, just in case someone is trying to use an Intel processor's CPUID setup. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24kvm: vmx: Fix -Wmissing-prototypes warningsYi Wang1-0/+1
We get a warning when build kernel W=1: arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes] void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp) Add the missing declaration to fix this. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30KVM/nVMX: Use kvm_vcpu_map for accessing the enlightened VMCSKarimAllah Ahmed1-1/+1
Use kvm_vcpu_map for accessing the enlightened VMCS since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor tableKarimAllah Ahmed1-1/+1
Use kvm_vcpu_map when mapping the posted interrupt descriptor table since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". One additional semantic change is that the virtual host mapping lifecycle has changed a bit. It now has the same lifetime of the pinning of the interrupt descriptor table page on the host side. Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC pageKarimAllah Ahmed1-1/+1
Use kvm_vcpu_map when mapping the virtual APIC page since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". One additional semantic change is that the virtual host mapping lifecycle has changed a bit. It now has the same lifetime of the pinning of the virtual APIC page on the host side. Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>