summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx/vmx.h
AgeCommit message (Collapse)AuthorFilesLines
2022-09-26KVM: VMX: Add missing CPU based VM execution controls to vmcs_configVitaly Kuznetsov1-1/+5
As a preparation to reusing the result of setup_vmcs_config() in nested VMX MSR setup, add the CPU based VM execution controls which KVM doesn't use but supports for nVMX to KVM_OPT_VMX_CPU_BASED_VM_EXEC_CONTROL and filter them out in vmx_exec_control(). No functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-27-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Add missing VMEXIT controls to vmcs_configVitaly Kuznetsov1-0/+3
As a preparation to reusing the result of setup_vmcs_config() in nested VMX MSR setup, add the VMEXIT controls which KVM doesn't use but supports for nVMX to KVM_OPT_VMX_VM_EXIT_CONTROLS and filter them out in vmx_vmexit_ctrl(). No functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-26-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: VMX: Extend VMX controls macro shenanigansVitaly Kuznetsov1-23/+132
When VMX controls macros are used to set or clear a control bit, make sure that this bit was checked in setup_vmcs_config() and thus is properly reflected in vmcs_config. Opportunistically drop pointless "< 0" check for adjust_vmx_controls()'s return value. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-24-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: nVMX: Treat eVMCS as enabled for guest iff Hyper-V is also enabledSean Christopherson1-0/+10
When querying whether or not eVMCS is enabled on behalf of the guest, treat eVMCS as enable if and only if Hyper-V is enabled/exposed to the guest. Note, flows that come from the host, e.g. KVM_SET_NESTED_STATE, must NOT check for Hyper-V being enabled as KVM doesn't require guest CPUID to be set before most ioctls(). Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830133737.1539624-7-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refreshSean Christopherson1-2/+5
Now that the PMU is refreshed when MSR_IA32_PERF_CAPABILITIES is written by host userspace, zero out the number of LBR records for a vCPU during PMU refresh if PMU_CAP_LBR_FMT is not set in PERF_CAPABILITIES instead of handling the check at run-time. guest_cpuid_has() is expensive due to the linear search of guest CPUID entries, intel_pmu_lbr_is_enabled() is checked on every VM-Enter, _and_ simply enumerating the same "Model" as the host causes KVM to set the number of LBR records to a non-zero value. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220727233424.2968356-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpersSean Christopherson1-9/+17
Turn vcpu_to_lbr_desc() and vcpu_to_lbr_records() into functions in order to provide type safety, to document exactly what they return, and to allow consuming the helpers in vmx.h. Move the definitions as necessary (the macros "reference" to_vmx() before its definition). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220727233424.2968356-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-01Merge remote-tracking branch 'kvm/next' into kvm-next-5.20Paolo Bonzini1-33/+62
KVM/s390, KVM/x86 and common infrastructure changes for 5.20 x86: * Permit guests to ignore single-bit ECC errors * Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit (detect microarchitectural hangs) for Intel * Cleanups for MCE MSR emulation s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to use TAP interface * enable interpretive execution of zPCI instructions (for PCI passthrough) * First part of deferred teardown * CPU Topology * PV attestation * Minor fixes Generic: * new selftests API using struct kvm_vcpu instead of a (vm, id) tuple x86: * Use try_cmpxchg64 instead of cmpxchg64 * Bugfixes * Ignore benign host accesses to PMU MSRs when PMU is disabled * Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior * x86/MMU: Allow NX huge pages to be disabled on a per-vm basis * Port eager page splitting to shadow MMU as well * Enable CMCI capability by default and handle injected UCNA errors * Expose pid of vcpu threads in debugfs * x2AVIC support for AMD * cleanup PIO emulation * Fixes for LLDT/LTR emulation * Don't require refcounted "struct page" to create huge SPTEs x86 cleanups: * Use separate namespaces for guest PTEs and shadow PTEs bitmasks * PIO emulation * Reorganize rmap API, mostly around rmap destruction * Do not workaround very old KVM bugs for L0 that runs with nesting enabled * new selftests API for CPUID
2022-07-28KVM: VMX: Add helper to check if the guest PMU has PERF_GLOBAL_CTRLSean Christopherson1-0/+12
Add a helper to check of the guest PMU has PERF_GLOBAL_CTRL, which is unintuitive _and_ diverges from Intel's architecturally defined behavior. Even worse, KVM currently implements the check using two different (but equivalent) checks, _and_ there has been at least one attempt to add a _third_ flavor. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-27KVM: VMX: Prevent RSB underflow before vmenterJosh Poimboeuf1-2/+2
On VMX, there are some balanced returns between the time the guest's SPEC_CTRL value is written, and the vmenter. Balanced returns (matched by a preceding call) are usually ok, but it's at least theoretically possible an NMI with a deep call stack could empty the RSB before one of the returns. For maximum paranoia, don't allow *any* returns (balanced or otherwise) between the SPEC_CTRL write and the vmenter. [ bp: Fix 32-bit build. ] Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27KVM: VMX: Prevent guest RSB poisoning attacks with eIBRSJosh Poimboeuf1-0/+1
On eIBRS systems, the returns in the vmexit return path from __vmx_vcpu_run() to vmx_vcpu_run() are exposed to RSB poisoning attacks. Fix that by moving the post-vmexit spec_ctrl handling to immediately after the vmexit. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27KVM: VMX: Convert launched argument to flagsJosh Poimboeuf1-1/+4
Convert __vmx_vcpu_run()'s 'launched' argument to 'flags', in preparation for doing SPEC_CTRL handling immediately after vmexit, which will need another flag. This is much easier than adding a fourth argument, because this code supports both 32-bit and 64-bit, and the fourth argument on 32-bit would have to be pushed on the stack. Note that __vmx_vcpu_run_flags() is called outside of the noinstr critical section because it will soon start calling potentially traceable functions. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-20KVM: nVMX: Use kvm_vcpu_map() to get/pin vmcs12's APIC-access pageSean Christopherson1-1/+1
Use kvm_vcpu_map() to get/pin the backing for vmcs12's APIC-access page, there's no reason it has to be restricted to 'struct page' backing. The APIC-access page actually doesn't need to be backed by anything, which is ironically why it got left behind by the series which introduced kvm_vcpu_map()[1]; the plan was to shove a dummy pfn into vmcs02[2], but that code never got merged. Switching the APIC-access page to kvm_vcpu_map() doesn't preclude using a magic pfn in the future, and will allow a future patch to drop kvm_vcpu_gpa_to_page(). [1] https://lore.kernel.org/all/1547026933-31226-1-git-send-email-karahmed@amazon.de [2] https://lore.kernel.org/lkml/1543845551-4403-1-git-send-email-karahmed@amazon.de Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: nVMX: Rename nested.vmcs01_* fields to nested.pre_vmenter_*Sean Christopherson1-3/+12
Rename the fields in struct nested_vmx used to snapshot pre-VM-Enter values to reflect that they can hold L2's values when restoring nested state, e.g. if userspace restores MSRs before nested state. As crazy as it seems, restoring MSRs before nested state actually works (because KVM goes out if it's way to make it work), even though the initial MSR writes will hit vmcs01 despite holding L2 values. Add a related comment to vmx_enter_smm() to call out that using the common VM-Exit and VM-Enter helpers to emulate SMI and RSM is wrong and broken. The few MSRs that have snapshots _could_ be fixed by taking a snapshot prior to the forced VM-Exit instead of at forced VM-Enter, but that's just the tip of the iceberg as the rather long list of MSRs that aren't snapshotted (hello, VM-Exit MSR load list) can't be handled this way. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614215831.3762138-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: x86/cpuid: Refactor host/guest CPU model consistency checkLike Xu1-1/+0
For the same purpose, the leagcy intel_pmu_lbr_is_compatible() can be renamed for reuse by more callers, and remove the comment about LBR use case can be deleted by the way. Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-Id: <20220411101946.20262-17-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: x86/pmu: Disable guest PEBS temporarily in two rare situationsLike Xu1-0/+1
The guest PEBS will be disabled when some users try to perf KVM and its user-space through the same PEBS facility OR when the host perf doesn't schedule the guest PEBS counter in a one-to-one mapping manner (neither of these are typical scenarios). The PEBS records in the guest DS buffer are still accurate and the above two restrictions will be checked before each vm-entry only if guest PEBS is deemed to be enabled. Suggested-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-Id: <20220411101946.20262-15-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: VMX: enable IPI virtualizationChao Gao1-0/+7
With IPI virtualization enabled, the processor emulates writes to APIC registers that would send IPIs. The processor sets the bit corresponding to the vector in target vCPU's PIR and may send a notification (IPI) specified by NDST and NV fields in target vCPU's Posted-Interrupt Descriptor (PID). It is similar to what IOMMU engine does when dealing with posted interrupt from devices. A PID-pointer table is used by the processor to locate the PID of a vCPU with the vCPU's APIC ID. The table size depends on maximum APIC ID assigned for current VM session from userspace. Allocating memory for PID-pointer table is deferred to vCPU creation, because irqchip mode and VM-scope maximum APIC ID is settled at that point. KVM can skip PID-pointer table allocation if !irqchip_in_kernel(). Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its notification vector to wakeup vector. This can ensure that when an IPI for blocked vCPUs arrives, VMM can get control and wake up blocked vCPUs. And if a VCPU is preempted, its posted interrupt notification is suppressed. Note that IPI virtualization can only virualize physical-addressing, flat mode, unicast IPIs. Sending other IPIs would still cause a trap-like APIC-write VM-exit and need to be handled by VMM. Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com> Message-Id: <20220419154510.11938-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS configRobert Hoo1-0/+1
Check VMX features on tertiary execution control in VMCS config setup. Sub-features in tertiary execution control to be enabled are adjusted according to hardware capabilities although no sub-feature is enabled in this patch. EVMCSv1 doesn't support tertiary VM-execution control, so disable it when EVMCSv1 is in use. And define the auxiliary functions for Tertiary control field here, using the new BUILD_CONTROLS_SHADOW(). Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com> Message-Id: <20220419153400.11642-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variationRobert Hoo1-28/+28
The Tertiary VM-Exec Control, different from previous control fields, is 64 bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to support both 32 bit and 64 bit fields' auxiliary functions building. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Zeng Guang <guang.zeng@intel.com> Message-Id: <20220419153318.11595-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-21KVM: x86/speculation: Disable Fill buffer clear within guestsPawan Gupta1-0/+2
The enumeration of MD_CLEAR in CPUID(EAX=7,ECX=0).EDX{bit 10} is not an accurate indicator on all CPUs of whether the VERW instruction will overwrite fill buffers. FB_CLEAR enumeration in IA32_ARCH_CAPABILITIES{bit 17} covers the case of CPUs that are not vulnerable to MDS/TAA, indicating that microcode does overwrite fill buffers. Guests running in VMM environments may not be aware of all the capabilities/vulnerabilities of the host CPU. Specifically, a guest may apply MDS/TAA mitigations when a virtual CPU is enumerated as vulnerable to MDS/TAA even when the physical CPU is not. On CPUs that enumerate FB_CLEAR_CTRL the VMM may set FB_CLEAR_DIS to skip overwriting of fill buffers by the VERW instruction. This is done by setting FB_CLEAR_DIS during VMENTER and resetting on VMEXIT. For guests that enumerate FB_CLEAR (explicitly asking for fill buffer clear capability) the VMM will not use FB_CLEAR_DIS. Irrespective of guest state, host overwrites CPU buffers before VMENTER to protect itself from an MMIO capable guest, as part of mitigation for MMIO Stale Data vulnerabilities. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-04-21KVM: nVMX: Defer APICv updates while L2 is active until L1 is activeSean Christopherson1-0/+1
Defer APICv updates that occur while L2 is active until nested VM-Exit, i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1 is active and (a) stomps all over vmcs02 and (b) neglects to ever updated vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv becomes unhibited while L2 is active, KVM will set various APICv controls in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless running with nested_early_check=1, KVM blames L1 and chaos ensues. In all cases, ignoring vmcs02 and always deferring the inhibition change to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE inhibitions cannot truly change while L2 is active (see below). IRQ_BLOCKING can change, but it is firmly a best effort debug feature. Furthermore, only L2's APIC is accelerated/virtualized to the full extent possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR interception will apply to the virtual APIC managed by KVM. The exception is the SELF_IPI register when x2APIC is enabled, but that's an acceptable hole. Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the MSRs to L2, but for that to work in any sane capacity, L1 would need to pass through IRQs to L2 as well, and IRQs must be intercepted to enable virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling VID for L2 are, for all intents and purposes, mutually exclusive. Lack of dynamic toggling is also why this scenario is all but impossible to encounter in KVM's current form. But a future patch will pend an APICv update request _during_ vCPU creation to plug a race where a vCPU that's being created doesn't get included in the "all vCPUs request" because it's not yet visible to other vCPUs. If userspaces restores L2 after VM creation (hello, KVM selftests), the first KVM_RUN will occur while L2 is active and thus service the APICv update request made during VM creation. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220420013732.3308816-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"Sean Christopherson1-3/+2
Undo a nested VMX fix as a step toward reverting the commit it fixed, 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"), as the underlying premise that "host CR3 in the vcpu thread can only be changed when scheduling" is wrong. This reverts commit a9f2705ec84449e3b8d70c804766f8e97e23080d. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220224191917.3508476-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmxSean Christopherson1-0/+3
Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and rename the list and all associated variables to clarify that it tracks the set of vCPU that need to be poked on a posted interrupt to the wakeup vector. The list is not used to track _all_ vCPUs that are blocking, and the term "blocked" can be misleading as it may refer to a blocking condition in the host or the guest, where as the PI wakeup case is specifically for the vCPUs that are actively blocking from within the guest. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-14kvm: x86: Disable interception for IA32_XFD on demandKevin Tian1-1/+1
Always intercepting IA32_XFD causes non-negligible overhead when this register is updated frequently in the guest. Disable r/w emulation after intercepting the first WRMSR(IA32_XFD) with a non-zero value. Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync with the software states in fpstate and the per-cpu xfd cache. This leads to two additional changes accordingly: - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring software states back in-sync with the MSR, before handle_exit_irqoff() is called. - Always trap #NM once write interception is disabled for IA32_XFD. The #NM exception is rare if the guest doesn't use dynamic features. Otherwise, there is at most one exception per guest task given a dynamic feature. p.s. We have confirmed that SDM is being revised to say that when setting IA32_XFD[18] the AMX register state is not guaranteed to be preserved. This clarification avoids adding mess for a creative guest which sets IA32_XFD[18]=1 before saving active AMX state to its own storage. Signed-off-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jing Liu <jing2.liu@intel.com> Signed-off-by: Yang Zhong <yang.zhong@intel.com> Message-Id: <20220105123532.12586-22-yang.zhong@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-14kvm: x86: Disable RDMSR interception of IA32_XFD_ERRJing Liu1-1/+1
This saves one unnecessary VM-exit in guest #NM handler, given that the MSR is already restored with the guest value before the guest is resumed. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jing Liu <jing2.liu@intel.com> Signed-off-by: Yang Zhong <yang.zhong@intel.com> Message-Id: <20220105123532.12586-15-yang.zhong@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()Lai Jiangshan1-2/+3
The host CR3 in the vcpu thread can only be changed when scheduling, so commit 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()") changed vmx.c to only save it in vmx_prepare_switch_to_guest(). However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS. vmx_set_host_fs_gs() is called in both places, so rename it to vmx_set_vmcs_host_state() and make it update HOST_CR3. Fixes: 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuiltVitaly Kuznetsov1-0/+9
Introduce a flag to keep track of whether MSR bitmap for L2 needs to be rebuilt due to changes in MSR bitmap for L1 or switching to a different L2. This information will be used for Enlightened MSR Bitmap feature for Hyper-V guests. Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is not really needed for Enlightened MSR Bitmap as the feature can only be used in conjunction with Enlightened VMCS but let's keep tracking information complete, it's cheap and in the future similar PV feature can easily be implemented for KVM on KVM too. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211129094704.326635-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: vmx, svm: clean up mass updates to regs_avail/regs_dirty bitsPaolo Bonzini1-13/+15
Document the meaning of the three combinations of regs_avail and regs_dirty. Update regs_dirty just after writeback instead of doing it later after vmexit. After vmexit, instead, we clear the regs_avail bits corresponding to lazily-loaded registers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18KVM: nVMX: Use a gfn_to_hva_cache for vmptrldDavid Woodhouse1-0/+5
And thus another call to kvm_vcpu_map() can die. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-7-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18KVM: nVMX: Use kvm_{read,write}_guest_cached() for shadow_vmcs12David Woodhouse1-0/+5
Using kvm_vcpu_map() for reading from the guest is entirely gratuitous, when all we do is a single memcpy and unmap it again. Fix it up to use kvm_read_guest()... but in fact I couldn't bring myself to do that without also making it use a gfn_to_hva_cache for both that *and* the copy in the other direction. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-5-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11KVM: VMX: Add a helper function to retrieve the GPR index for INVPCID, ↵Vipin Sharma1-0/+5
INVVPID, and INVEPT handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2 field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that holds the invalidation type. Add a helper to retrieve reg2 from VMX instruction info to consolidate and document the shift+mask magic. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109174426.2350547-2-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11KVM: VMX: Macrofy the MSR bitmap getters and settersSean Christopherson1-60/+25
Add builder macros to generate the MSR bitmap helpers to reduce the amount of copy-paste code, especially with respect to all the magic numbers needed to calc the correct bit location. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11KVM: nVMX: Handle dynamic MSR intercept togglingSean Christopherson1-0/+63
Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2, and always update the relevant bits in vmcs02. This fixes two distinct, but intertwined bugs related to dynamic MSR bitmap modifications. The first issue is that KVM fails to enable MSR interception in vmcs02 for the FS/GS base MSRs if L1 first runs L2 with interception disabled, and later enables interception. The second issue is that KVM fails to honor userspace MSR filtering when preparing vmcs02. Fix both issues simultaneous as fixing only one of the issues (doesn't matter which) would create a mess that no one should have to bisect. Fixing only the first bug would exacerbate the MSR filtering issue as userspace would see inconsistent behavior depending on the whims of L1. Fixing only the second bug (MSR filtering) effectively requires fixing the first, as the nVMX code only knows how to transition vmcs02's bitmap from 1->0. Move the various accessor/mutators that are currently buried in vmx.c into vmx.h so that they can be shared by the nested code. Fixes: 1a155254ff93 ("KVM: x86: Introduce MSR filtering") Fixes: d69129b4e46a ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible") Cc: stable@vger.kernel.org Cc: Alexander Graf <graf@amazon.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211109013047.2041518-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: VMX: Rename pt_desc.addr_range to pt_desc.num_address_rangesXiaoyao Li1-1/+1
To better self explain the meaning of this field and match the PT_CAP_num_address_ranges constatn. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-4-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: x86: nVMX: re-evaluate emulation_required on nested VM exitMaxim Levitsky1-0/+1
If L1 had invalid state on VM entry (can happen on SMM transactions when we enter from real mode, straight to nested guest), then after we load 'host' state from VMCS12, the state has to become valid again, but since we load the segment registers with __vmx_set_segment we weren't always updating emulation_required. Update emulation_required explicitly at end of load_vmcs12_host_state. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: VMX: Remove defunct "nr_active_uret_msrs" fieldSean Christopherson1-4/+0
Remove vcpu_vmx.nr_active_uret_msrs and its associated comment, which are both defunct now that KVM keeps the list constant and instead explicitly tracks which entries need to be loaded into hardware. No functional change intended. Fixes: ee9d22e08d13 ("KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210908002401.1947049-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13KVM: VMX: Hide VMCS control calculators in vmx.cSean Christopherson1-26/+0
Now that nested VMX pulls KVM's desired VMCS controls from vmcs01 instead of re-calculating on the fly, bury the helpers that do the calcluations in vmx.c. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210810171952.2758100-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13KVM: VMX: Drop caching of KVM's desired sec exec controls for vmcs01Sean Christopherson1-2/+1
Remove the secondary execution controls cache now that it's effectively dead code; it is only read immediately after it is written. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210810171952.2758100-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13KVM: nVMX: Pull KVM L0's desired controls directly from vmcs01Sean Christopherson1-1/+5
When preparing controls for vmcs02, grab KVM's desired controls from vmcs01's shadow state instead of recalculating the controls from scratch, or in the secondary execution controls, instead of using the dedicated cache. Calculating secondary exec controls is eye-poppingly expensive due to the guest CPUID checks, hence the dedicated cache, but the other calculations aren't exactly free either. Explicitly clear several bits (x2APIC, DESC exiting, and load EFER on exit) as appropriate as they may be set in vmcs01, whereas the previous implementation relied on dynamic bits being cleared in the calculator. Intentionally propagate VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL from vmcs01 to vmcs02. Whether or not PERF_GLOBAL_CTRL is loaded depends on whether or not perf itself is active, so unless perf stops between the exit from L1 and entry to L2, vmcs01 will hold the desired value. This is purely an optimization as atomic_switch_perf_msrs() will set/clear the control as needed at VM-Enter, i.e. it avoids two extra VMWRITEs in the case where perf is active (versus starting with the bits clear in vmcs02, which was the previous behavior). Cc: Zeng Guang <guang.zeng@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210810171952.2758100-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-10Merge branch 'kvm-vmx-secctl' into HEADPaolo Bonzini1-1/+1
Merge common topic branch for 5.14-rc6 and 5.15 merge window.
2021-08-10KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulationSean Christopherson1-1/+1
Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to effectively get the controls for the current VMCS, as opposed to using vmx->secondary_exec_controls, which is the cached value of KVM's desired controls for vmcs01 and truly not reflective of any particular VMCS. While the waitpkg control is not dynamic, i.e. vmcs01 will always hold the same waitpkg configuration as vmx->secondary_exec_controls, the same does not hold true for vmcs02 if the L1 VMM hides the feature from L2. If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL, L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP. Fixes: 6e3ba4abcea5 ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210810171952.2758100-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Smush x2APIC MSR bitmap adjustments into single functionSean Christopherson1-1/+1
Consolidate all of the dynamic MSR bitmap adjustments into vmx_update_msr_bitmap_x2apic(), and rename the mode tracker to reflect that it is x2APIC specific. If KVM gains more cases of dynamic MSR pass-through, odds are very good that those new cases will be better off with their own logic, e.g. see Intel PT MSRs and MSR_IA32_SPEC_CTRL. Attempting to handle all updates in a common helper did more harm than good, as KVM ended up collecting a large number of useless "updates". Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitionsSean Christopherson1-1/+0
Drop unnecessary MSR bitmap updates during nested transitions, as L1's APIC_BASE MSR is not modified by the standard VM-Enter/VM-Exit flows, and L2's MSR bitmap is managed separately. In the unlikely event that L1 is pathological and loads APIC_BASE via the VM-Exit load list, KVM will handle updating the bitmap in its normal WRMSR flows. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-39-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: nVMX: Don't evaluate "emulation required" on nested VM-ExitSean Christopherson1-1/+1
Use the "internal" variants of setting segment registers when stuffing state on nested VM-Exit in order to skip the "emulation required" updates. VM-Exit must always go to protected mode, and all segments are mostly hardcoded (to valid values) on VM-Exit. The bits of the segments that aren't hardcoded are explicitly checked during VM-Enter, e.g. the selector RPLs must all be zero. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-15KVM: VMX: Remove vmx_msr_index from vmx.hYu Zhang1-2/+0
vmx_msr_index was used to record the list of MSRs which can be lazily restored when kvm returns to userspace. It is now reimplemented as kvm_uret_msrs_list, a common x86 list which is only used inside x86.c. So just remove the obsolete declaration in vmx.h. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Message-Id: <20210707235702.31595-1-yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: nVMX: Handle split-lock #AC exceptions that happen in L2Sean Christopherson1-0/+1
Mark #ACs that won't be reinjected to the guest as wanted by L0 so that KVM handles split-lock #AC from L2 instead of forwarding the exception to L1. Split-lock #AC isn't yet virtualized, i.e. L1 will treat it like a regular #AC and do the wrong thing, e.g. reinject it into L2. Fixes: e6f8b6c12f03 ("KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest") Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622172244.3561540-1-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17KVM: x86: hyper-v: Move the remote TLB flush logic out of vmxVineeth Pillai1-9/+0
Currently the remote TLB flush logic is specific to VMX. Move it to a common place so that SVM can use it as well. Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17KVM: X86: Add vendor callbacks for writing the TSC multiplierIlias Stamatis1-8/+0
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the VMCS every time the VMCS is loaded. Instead of doing this, set this field from common code on initialization and whenever the scaling ratio changes. Additionally remove vmx->current_tsc_ratio. This field is redundant as vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling ratio. The vmx->current_tsc_ratio field is only used for avoiding unnecessary writes but it is no longer needed after removing the code from the VMCS load path. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Message-Id: <20210607105438.16541-1-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17KVM: X86: Add functions for retrieving L2 TSC fields from common codeIlias Stamatis1-0/+3
In order to implement as much of the nested TSC scaling logic as possible in common code, we need these vendor callbacks for retrieving the TSC offset and the TSC multiplier that L1 has set for L2. Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210526184418.28881-7-ilstam@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting listSean Christopherson1-1/+1
Explicitly flag a uret MSR as needing to be loaded into hardware instead of resorting the list of "active" MSRs and tracking how many MSRs in total need to be loaded. The only benefit to sorting the list is that the loop to load MSRs during vmx_prepare_switch_to_guest() doesn't need to iterate over all supported uret MRS, only those that are active. But that is a pointless optimization, as the most common case, running a 64-bit guest, will load the vast majority of MSRs. Not to mention that a single WRMSR is far more expensive than iterating over the list. Providing a stable list order obviates the need to track a given MSR's "slot" in the per-CPU list of user return MSRs; all lists simply use the same ordering. Future patches will take advantage of the stable order to further simplify the related code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: VMX: Configure list of user return MSRs at module initSean Christopherson1-1/+9
Configure the list of user return MSRs that are actually supported at module init instead of reprobing the list of possible MSRs every time a vCPU is created. Curating the list on a per-vCPU basis is pointless; KVM is completely hosed if the set of supported MSRs changes after module init, or if the set of MSRs differs per physical PCU. The per-vCPU lists also increase complexity (see __vmx_find_uret_msr()) and creates corner cases that _should_ be impossible, but theoretically exist in KVM, e.g. advertising RDTSCP to userspace without actually being able to virtualize RDTSCP if probing MSR_TSC_AUX fails. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>