summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx
AgeCommit message (Collapse)AuthorFilesLines
2019-08-05KVM: Fix leak vCPU's VMCS value into other pCPUWanpeng Li1-0/+6
After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22KVM: nVMX: Set cached_vmcs12 and cached_shadow_vmcs12 NULL after freeJan Kiszka1-0/+2
Shall help finding use-after-free bugs earlier. Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22KVM: X86: Dynamically allocate user_fpuWanpeng Li1-1/+12
After reverting commit 240c35a3783a (kvm: x86: Use task structs fpu field for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3) is the order at which allocations are deemed costly to service. In serveless scenario, one host can service hundreds/thoudands firecracker/kata-container instances, howerver, new instance will fail to launch after memory is too fragmented to allocate kvm_vcpu struct on host, this was observed in some cloud provider product environments. This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my Skylake server. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nestedJan Kiszka1-0/+2
Letting this pend may cause nested_get_vmcs12_pages to run against an invalid state, corrupting the effective vmcs of L1. This was triggerable in QEMU after a guest corruption in L2, followed by a L1 reset. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 7f7f1ba33cf2 ("KVM: x86: do not load vmcs12 pages while still in SMM") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20KVM: nVMX: do not use dangling shadow VMCS after guest resetPaolo Bonzini1-1/+7
If a KVM guest is reset while running a nested guest, free_nested will disable the shadow VMCS execution control in the vmcs01. However, on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync the VMCS12 to the shadow VMCS which has since been freed. This causes a vmptrld of a NULL pointer on my machime, but Jan reports the host to hang altogether. Let's see how much this trivial patch fixes. Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20KVM: VMX: dump VMCS on failed entryPaolo Bonzini1-0/+1
This is useful for debugging, and is ratelimited nowadays. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20KVM: LAPIC: Inject timer interrupt via posted interruptWanpeng Li1-1/+2
Dedicated instances are currently disturbed by unnecessary jitter due to the emulated lapic timers firing on the same pCPUs where the vCPUs reside. There is no hardware virtual timer on Intel for guest like ARM, so both programming timer in guest and the emulated timer fires incur vmexits. This patch tries to avoid vmexit when the emulated timer fires, at least in dedicated instance scenario when nohz_full is enabled. In that case, the emulated timers can be offload to the nearest busy housekeeping cpus since APICv has been found for several years in server processors. The guest timer interrupt can then be injected via posted interrupts, which are delivered by the housekeeping cpu once the emulated timer fires. The host should tuned so that vCPUs are placed on isolated physical processors, and with several pCPUs surplus for busy housekeeping. If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode, ~3% redis performance benefit can be observed on Skylake server, and the number of external interrupt vmexits drops substantially. Without patch VM-EXIT Samples Samples% Time% Min Time Max Time Avg time EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% ) While with patch: VM-EXIT Samples Samples% Time% Min Time Max Time Avg time EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% ) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-17KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_countersLike Xu1-3/+8
To avoid semantic inconsistency, the fixed_counters in Intel vPMU need to be reset to 0 in intel_pmu_reset() as gp_counters does. Signed-off-by: Like Xu <like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-15KVM: nVMX: Ignore segment base for VMX memory operand when segment not FS or GSLiran Alon1-1/+4
As reported by Maxime at https://bugzilla.kernel.org/show_bug.cgi?id=204175: In vmx/nested.c::get_vmx_mem_address(), when the guest runs in long mode, the base address of the memory operand is computed with a simple: *ret = s.base + off; This is incorrect, the base applies only to FS and GS, not to the others. Because of that, if the guest uses a VMX instruction based on DS and has a DS.base that is non-zero, KVM wrongfully adds the base to the resulting address. Reported-by: Maxime Villard <max@m00nbsd.net> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-15kvm: vmx: fix coccinelle warningsYi Wang1-1/+1
This fixes the following coccinelle warning: WARNING: return of 0/1 in function 'vmx_need_emulation_on_page_fault' with return type bool Return false instead of 0. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds10-640/+873
Pull KVM updates from Paolo Bonzini: "ARM: - support for chained PMU counters in guests - improved SError handling - handle Neoverse N1 erratum #1349291 - allow side-channel mitigation status to be migrated - standardise most AArch64 system register accesses to msr_s/mrs_s - fix host MPIDR corruption on 32bit - selftests ckleanups x86: - PMU event {white,black}listing - ability for the guest to disable host-side interrupt polling - fixes for enlightened VMCS (Hyper-V pv nested virtualization), - new hypercall to yield to IPI target - support for passing cstate MSRs through to the guest - lots of cleanups and optimizations Generic: - Some txt->rST conversions for the documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (128 commits) Documentation: virtual: Add toctree hooks Documentation: kvm: Convert cpuid.txt to .rst Documentation: virtual: Convert paravirt_ops.txt to .rst KVM: x86: Unconditionally enable irqs in guest context KVM: x86: PMU Event Filter kvm: x86: Fix -Wmissing-prototypes warnings KVM: Properly check if "page" is valid in kvm_vcpu_unmap KVM: arm/arm64: Initialise host's MPIDRs by reading the actual register KVM: LAPIC: Retry tune per-vCPU timer_advance_ns if adaptive tuning goes insane kvm: LAPIC: write down valid APIC registers KVM: arm64: Migrate _elx sysreg accessors to msr_s/mrs_s KVM: doc: Add API documentation on the KVM_REG_ARM_WORKAROUNDS register KVM: arm/arm64: Add save/restore support for firmware workaround state arm64: KVM: Propagate full Spectre v2 workaround state to KVM guests KVM: arm/arm64: Support chained PMU counters KVM: arm/arm64: Remove pmc->bitmask KVM: arm/arm64: Re-create event when setting counter value KVM: arm/arm64: Extract duplicated code to own function KVM: arm/arm64: Rename kvm_pmu_{enable/disable}_counter functions KVM: LAPIC: ARBPRI is a reserved register for x2APIC ...
2019-07-11Merge tag 'kvm-arm-for-5.3' of ↵Paolo Bonzini4-55/+63
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm updates for 5.3 - Add support for chained PMU counters in guests - Improve SError handling - Handle Neoverse N1 erratum #1349291 - Allow side-channel mitigation status to be migrated - Standardise most AArch64 system register accesses to msr_s/mrs_s - Fix host MPIDR corruption on 32bit
2019-07-05KVM nVMX: Check Host Segment Registers and Descriptor Tables on vmentry of ↵Krish Sadhukhan1-2/+24
nested guests According to section "Checks on Host Segment and Descriptor-Table Registers" in Intel SDM vol 3C, the following checks are performed on vmentry of nested guests: - In the selector field for each of CS, SS, DS, ES, FS, GS and TR, the RPL (bits 1:0) and the TI flag (bit 2) must be 0. - The selector fields for CS and TR cannot be 0000H. - The selector field for SS cannot be 0000H if the "host address-space size" VM-exit control is 0. - On processors that support Intel 64 architecture, the base-address fields for FS, GS and TR must contain canonical addresses. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: nVMX: Stash L1's CR3 in vmcs01.GUEST_CR3 on nested entry w/o EPTSean Christopherson1-21/+23
KVM does not have 100% coverage of VMX consistency checks, i.e. some checks that cause VM-Fail may only be detected by hardware during a nested VM-Entry. In such a case, KVM must restore L1's state to the pre-VM-Enter state as L2's state has already been loaded into KVM's software model. L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*. But when EPT is disabled, the associated fields hold KVM's shadow values, not L1's "real" values. Fortunately, when EPT is disabled the PDPTRs come from memory, i.e. are not cached in the VMCS. Which leaves CR3 as the sole anomaly. A previously applied workaround to handle CR3 was to force nested early checks if EPT is disabled: commit 2b27924bb1d48 ("KVM: nVMX: always use early vmcs check when EPT is disabled") Forcing nested early checks is undesirable as doing so adds hundreds of cycles to every nested VM-Entry. Rather than take this performance hit, handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested VM-Entry when EPT is disabled *and* nested early checks are disabled. By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3. These shenanigans work because nested_vmx_restore_host_state() does a full kvm_mmu_reset_context(), i.e. unloads the current MMU, which guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3 prior to re-entering L1. vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via: nested_vmx_restore_host_state() -> kvm_mmu_reset_context() -> kvm_mmu_unload() -> kvm_mmu_free_roots() kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank on 'root_hpa == INVALID_PAGE' unless the implementation of kvm_mmu_reset_context() is changed. On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a successful entry) via: vcpu_enter_guest() -> kvm_mmu_reload() -> kvm_mmu_load() -> kvm_mmu_load_cr3() -> vmx_set_cr3() Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled as a "late" VM-Fail should never happen win that case (KVM WARNs), and the conditional write avoids the need to restore the correct GUEST_CR3 when nested_vmx_check_vmentry_hw() fails. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCSLiran Alon1-9/+16
Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS capability is enabled on vCPU. As indicated by vmx->nested.enlightened_vmcs_enabled. This is quite bizarre as userspace VMM should make sure to expose same vCPU with same CPUID values in both source and destination. In case vCPU is exposed with eVMCS support on CPUID, it is also expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability. Therefore, KVM_STATE_NESTED_EVMCS is redundant. KVM_STATE_NESTED_EVMCS is currently used on restore path (vmx_set_nested_state()) only to enable eVMCS capability in KVM and to signal need_vmcs12_sync such that on next VMEntry to guest nested_sync_from_vmcs12() will be called to sync vmcs12 content into eVMCS in guest memory. However, because restore nested-state is rare enough, we could have just modified vmx_set_nested_state() to always signal need_vmcs12_sync. From all the above, it seems that we could have just removed the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve backwards migration compatibility, we cannot do that. (vmx_get_nested_state() needs to signal flag when migrating from new kernel to old kernel). Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled have a bad side-effect of userspace VMM having to send nested-state from source to destination as part of migration stream. Even if guest have never used eVMCS as it doesn't even run a nested hypervisor workload. This requires destination userspace VMM and KVM to support setting nested-state. Which make it more difficult to migrate from new host to older host. To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is not only enabled but also active. i.e. Guest have made some eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied from eVMCS and therefore should be restored into eVMCS resident in memory (by copy_vmcs12_to_enlightened()). Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maran Wilson <maran.wilson@oracle.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMMLiran Alon1-1/+4
As comment in code specifies, SMM temporarily disables VMX so we cannot be in guest mode, nor can VMLAUNCH/VMRESUME be pending. However, code currently assumes that these are the only flags that can be set on kvm_state->flags. This is not true as KVM_STATE_NESTED_EVMCS can also be set on this field to signal that eVMCS should be enabled. Therefore, fix code to check for guest-mode and pending VMLAUNCH/VMRESUME explicitly. Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02kvm: nVMX: Remove unnecessary sync_roots from handle_inveptJim Mattson1-5/+3
When L0 is executing handle_invept(), the TDP MMU is active. Emulating an L1 INVEPT does require synchronizing the appropriate shadow EPT root(s), but a call to kvm_mmu_sync_roots in this context won't do that. Similarly, the hardware TLB and paging-structure-cache entries associated with the appropriate shadow EPT root(s) must be flushed, but requesting a TLB_FLUSH from this context won't do that either. How did this ever work? KVM always does a sync_roots and TLB flush (in the correct context) when transitioning from L1 to L2. That isn't the best choice for nested VM performance, but it effectively papers over the mistakes here. Remove the unnecessary operations and leave a comment to try to do better in the future. Reported-by: Junaid Shahid <junaids@google.com> Fixes: bfd0a56b90005f ("nEPT: Nested INVEPT") Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Nadav Har'El <nyh@il.ibm.com> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: Xinhao Xu <xinhao.xu@intel.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by Peter Shier <pshier@google.com> Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02x86/kvm/nVMX: fix VMCLEAR when Enlightened VMCS is in useVitaly Kuznetsov3-14/+37
When Enlightened VMCS is in use, it is valid to do VMCLEAR and, according to TLFS, this should "transition an enlightened VMCS from the active to the non-active state". It is, however, wrong to assume that it is only valid to do VMCLEAR for the eVMCS which is currently active on the vCPU performing VMCLEAR. Currently, the logic in handle_vmclear() is broken: in case, there is no active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal' VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly corrupts the memory area. So, in case the VMCLEAR argument is not the current active eVMCS on the vCPU, how can we know if the area it is pointing to is a normal or an enlightened VMCS? Thanks to the bug in Hyper-V (see commit 72aeb60c52bf7 ("KVM: nVMX: Verify eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can not, the revision can't be used to distinguish between them. So let's assume it is always enlightened in case enlightened vmentry is enabled in the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to minimize the impact for 'unenlightened' workloads. Fixes: b8bbab928fb1 ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02x86/KVM/nVMX: don't use clean fields data on enlightened VMLAUNCHVitaly Kuznetsov1-8/+12
Apparently, Windows doesn't maintain clean fields data after it does VMCLEAR for an enlightened VMCS so we can only use it on VMRESUME. The issue went unnoticed because currently we do nested_release_evmcs() in handle_vmclear() and the consecutive enlightened VMPTRLD invalidates clean fields when a new eVMCS is mapped but we're going to change the logic. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LISTPaolo Bonzini1-0/+2
This allows userspace to know which MSRs are supported by the hypervisor. Unfortunately userspace must resort to tricks for everything except MSR_IA32_VMX_VMFUNC (which was just added in the previous patch). One possibility is to use the feature control MSR, which is tied to nested VMX as well and is present on all KVM versions that support feature MSRs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: allow setting the VMFUNC controls MSRPaolo Bonzini1-0/+5
Allow userspace to set a custom value for the VMFUNC controls MSR, as long as the capabilities it advertises do not exceed those of the host. Fixes: 27c42a1bb ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor", 2017-08-03) Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRSPaolo Bonzini1-1/+6
Some secondary controls are automatically enabled/disabled based on the CPUID values that are set for the guest. However, they are still available at a global level and therefore should be present when KVM_GET_MSRS is sent to /dev/kvm. Fixes: 1389309c811 ("KVM: nVMX: expose VMX capabilities for nested hypervisors to userspace", 2018-02-26) Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-21Merge tag 'spdx-5.2-rc6' of ↵Linus Torvalds2-8/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull still more SPDX updates from Greg KH: "Another round of SPDX updates for 5.2-rc6 Here is what I am guessing is going to be the last "big" SPDX update for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that were "easy" to determine by pattern matching. The ones after this are going to be a bit more difficult and the people on the spdx list will be discussing them on a case-by-case basis now. Another 5000+ files are fixed up, so our overall totals are: Files checked: 64545 Files with SPDX: 45529 Compared to the 5.1 kernel which was: Files checked: 63848 Files with SPDX: 22576 This is a huge improvement. Also, we deleted another 20000 lines of boilerplate license crud, always nice to see in a diffstat" * tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485 ...
2019-06-20KVM: nVMX: reorganize initial steps of vmx_set_nested_statePaolo Bonzini1-11/+15
Commit 332d079735f5 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state", 2019-05-02) broke evmcs_test because the eVMCS setup must be performed even if there is no VMXON region defined, as long as the eVMCS bit is set in the assist page. While the simplest possible fix would be to add a check on kvm_state->flags & KVM_STATE_NESTED_EVMCS in the initial "if" that covers kvm_state->hdr.vmx.vmxon_pa == -1ull, that is quite ugly. Instead, this patch moves checks earlier in the function and conditionalizes them on kvm_state->hdr.vmx.vmxon_pa, so that vmx_set_nested_state always goes through vmx_leave_nested and nested_enable_evmcs. Fixes: 332d079735f5 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state") Cc: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-20KVM: VMX: check CPUID before allowing read/write of IA32_XSSWanpeng Li1-2/+8
Raise #GP when guest read/write IA32_XSS, but the CPUID bits say that it shouldn't exist. Fixes: 203000993de5 (kvm: vmx: add MSR logic for XSAVES) Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Reported-by: Tao Xu <tao3.xu@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499Thomas Gleixner2-8/+2
Based on 1 normalized pattern(s): this work is licensed under the terms of the gnu gpl version 2 see the copying file in the top level directory extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 35 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19KVM: x86: Modify struct kvm_nested_state to have explicit fields for dataLiran Alon2-37/+47
Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format of VMX nested state data in a struct. In order to avoid changing the ioctl values of KVM_{GET,SET}_NESTED_STATE, there is a need to preserve sizeof(struct kvm_nested_state). This is done by defining the data struct as "data.vmx[0]". It was the most elegant way I found to preserve struct size while still keeping struct readable and easy to maintain. It does have a misfortunate side-effect that now it has to be accessed as "data.vmx[0]" rather than just "data.vmx". Because we are already modifying these structs, I also modified the following: * Define the "format" field values as macros. * Rename vmcs_pa to vmcs12_pa for better readability. Signed-off-by: Liran Alon <liran.alon@oracle.com> [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo] Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: shadow pin based execution controlsPaolo Bonzini1-0/+1
The VMX_PREEMPTION_TIMER flag may be toggled frequently, though not *very* frequently. Since it does not affect KVM's dirty logic, e.g. the preemption timer value is loaded from vmcs12 even if vmcs12 is "clean", there is no need to mark vmcs12 dirty when L1 writes pin controls, and shadowing the field achieves that. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Leave preemption timer running when it's disabledSean Christopherson3-25/+41
VMWRITEs to the major VMCS controls, pin controls included, are deceptively expensive. CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than doubles the latency of the next VM-Entry (and again when/if the flag is toggled back). In a non-nested scenario, running a "standard" guest with the preemption timer enabled, toggling the timer flag is uncommon but not rare, e.g. roughly 1 in 10 entries. Disabling the preemption timer can change these numbers due to its use for "immediate exits", even when explicitly disabled by userspace. Nested virtualization in particular is painful, as the timer flag is set for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's pin controls to *clear* the flag since its the timer's final state isn't known until vmx_vcpu_run(). I.e. the majority of nested VM-Enters end up unnecessarily writing pin controls *twice*. Rather than toggle the timer flag in pin controls, set the timer value itself to the largest allowed value to put it into a "soft disabled" state, and ignore any spurious preemption timer exits. Sadly, the timer is a 32-bit value and so theoretically it can fire before the head death of the universe, i.e. spurious exits are possible. But because KVM does *not* save the timer value on VM-Exit and because the timer runs at a slower rate than the TSC, the maximuma timer value is still sufficiently large for KVM's purposes. E.g. on a modern CPU with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate TSC, the timer will fire after ~55 seconds of *uninterrupted* guest execution. In other words, spurious VM-Exits are effectively only possible if the host is completely tickless on the logical CPU, the guest is not using the preemption timer, and the guest is not generating VM-Exits for any other reason. To be safe from bad/weird hardware, disable the preemption timer if its maximum delay is less than ten seconds. Ten seconds is mostly arbitrary and was selected in no small part because it's a nice round number. For simplicity and paranoia, fall back to __kvm_request_immediate_exit() if the preemption timer is disabled by KVM or userspace. Previously KVM continued to use the preemption timer to force immediate exits even when the timer was disabled by userspace. Now that KVM leaves the timer running instead of truly disabling it, allow userspace to kill it entirely in the unlikely event the timer (or KVM) malfunctions. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Drop hv_timer_armed from 'struct loaded_vmcs'Sean Christopherson3-8/+2
... now that it is fully redundant with the pin controls shadow. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Preset *DT exiting in vmcs02 when emulating UMIPSean Christopherson1-0/+8
KVM dynamically toggles SECONDARY_EXEC_DESC to intercept (a subset of) instructions that are subject to User-Mode Instruction Prevention, i.e. VMCS.SECONDARY_EXEC_DESC == CR4.UMIP when emulating UMIP. Preset the VMCS control when preparing vmcs02 to avoid unnecessarily VMWRITEs, e.g. KVM will clear VMCS.SECONDARY_EXEC_DESC in prepare_vmcs02_early() and then set it in vmx_set_cr4(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Preserve last USE_MSR_BITMAPS when preparing vmcs02Sean Christopherson1-1/+11
KVM dynamically toggles the CPU_BASED_USE_MSR_BITMAPS execution control for nested guests based on whether or not both L0 and L1 want to pass through the same MSRs to L2. Preserve the last used value from vmcs02 so as to avoid multiple VMWRITEs to (re)set/(re)clear the bit on nested VM-Entry. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Explicitly initialize controls shadow at VMCS allocationSean Christopherson3-15/+12
Or: Don't re-initialize vmcs02's controls on every nested VM-Entry. VMWRITEs to the major VMCS controls are deceptively expensive. Intel CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. Zero out the controls' shadow copies during VMCS allocation and use the optimized setter when "initializing" controls. While this technically affects both non-nested and nested virtualization, nested virtualization is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows hardware to optimizie away consistency checks. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't reset VMCS controls shadow on VMCS switchSean Christopherson2-9/+0
... now that the shadow copies are per-VMCS. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Shadow VMCS controls on a per-VMCS basisSean Christopherson2-15/+16
... to pave the way for not preserving the shadow copies across switches between vmcs01 and vmcs02, and eventually to avoid VMWRITEs to vmcs02 when the desired value is unchanged across nested VM-Enters. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Shadow VMCS secondary execution controlsSean Christopherson3-26/+28
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Shadow VMCS primary execution controlsSean Christopherson3-31/+23
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02, and more importantly can eliminate costly VMWRITEs to controls when preparing vmcs02. Shadowing exec controls also saves a VMREAD when opening virtual INTR/NMI windows, yay... Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Shadow VMCS pin controlsSean Christopherson3-7/+8
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Shadowing pin controls also allows a future patch to remove the per-VMCS 'hv_timer_armed' flag, as the shadow copy is a superset of said flag. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Add builder macros for shadowing controlsSean Christopherson1-64/+36
... to pave the way for shadowing all (five) major VMCS control fields without massive amounts of error prone copy+paste+modify. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Use adjusted pin controls for vmcs02Sean Christopherson3-4/+4
KVM provides a module parameter to allow disabling virtual NMI support to simplify testing (hardware *without* virtual NMI support is hard to come by but it does have users). When preparing vmcs02, use the accessor for pin controls to ensure that the module param is respected for nested guests. Opportunistically swap the order of applying L0's and L1's pin controls to better align with other controls and to prepare for a future patche that will ignore L1's, but not L0's, preemption timer flag. Fixes: d02fcf50779ec ("kvm: vmx: Allow disabling virtual NMI support") Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Copy PDPTRs to/from vmcs12 only when necessarySean Christopherson1-5/+22
Per Intel's SDM: ... the logical processor uses PAE paging if CR0.PG=1, CR4.PAE=1 and IA32_EFER.LME=0. A VM entry to a guest that uses PAE paging loads the PDPTEs into internal, non-architectural registers based on the setting of the "enable EPT" VM-execution control. and: [GUEST_PDPTR] values are saved into the four PDPTE fields as follows: - If the "enable EPT" VM-execution control is 0 or the logical processor was not using PAE paging at the time of the VM exit, the values saved are undefined. In other words, if EPT is disabled or the guest isn't using PAE paging, then the PDPTRS aren't consumed by hardware on VM-Entry and are loaded with junk on VM-Exit. From a nesting perspective, all of the above hold true, i.e. KVM can effectively ignore the VMCS PDPTRs. E.g. KVM already loads the PDPTRs from memory when nested EPT is disabled (see nested_vmx_load_cr3()). Because KVM intercepts setting CR4.PAE, there is no danger of consuming a stale value or crushing L1's VMWRITEs regardless of whether L1 intercepts CR4.PAE. The vmcs12's values are unchanged up until the VM-Exit where L2 sets CR4.PAE, i.e. L0 will see the new PAE state on the subsequent VM-Entry and propagate the PDPTRs from vmcs12 to vmcs02. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: x86: introduce is_pae_pagingPaolo Bonzini2-4/+3
Checking for 32-bit PAE is quite common around code that fiddles with the PDPTRs. Add a function to compress all checks into a single invocation. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't update GUEST_BNDCFGS if it's clean in HV eVMCSSean Christopherson1-4/+4
L1 is responsible for dirtying GUEST_GRP1 if it writes GUEST_BNDCFGS. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Update vmcs12 for MSR_IA32_DEBUGCTLMSR when it's writtenSean Christopherson2-3/+9
KVM unconditionally intercepts WRMSR to MSR_IA32_DEBUGCTLMSR. In the unlikely event that L1 allows L2 to write L1's MSR_IA32_DEBUGCTLMSR, but but saves L2's value on VM-Exit, update vmcs12 during L2's WRMSR so as to eliminate the need to VMREAD the value from vmcs02 on nested VM-Exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Update vmcs12 for SYSENTER MSRs when they're writtenSean Christopherson2-3/+10
For L2, KVM always intercepts WRMSR to SYSENTER MSRs. Update vmcs12 in the WRMSR handler so that they don't need to be (re)read from vmcs02 on every nested VM-Exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Update vmcs12 for MSR_IA32_CR_PAT when it's writtenSean Christopherson2-4/+4
As alluded to by the TODO comment, KVM unconditionally intercepts writes to the PAT MSR. In the unlikely event that L1 allows L2 to write L1's PAT directly but saves L2's PAT on VM-Exit, update vmcs12 when L2 writes the PAT. This eliminates the need to VMREAD the value from vmcs02 on VM-Exit as vmcs12 is already up to date in all situations. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't speculatively write APIC-access page addressSean Christopherson1-8/+0
If nested_get_vmcs12_pages() fails to map L1's APIC_ACCESS_ADDR into L2, then it disables SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES in vmcs02. In other words, the APIC_ACCESS_ADDR in vmcs02 is guaranteed to be written with the correct value before being consumed by hardware, drop the unneessary VMWRITE. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't speculatively write virtual-APIC page addressSean Christopherson1-13/+8
The VIRTUAL_APIC_PAGE_ADDR in vmcs02 is guaranteed to be updated before it is consumed by hardware, either in nested_vmx_enter_non_root_mode() or via the KVM_REQ_GET_VMCS12_PAGES callback. Avoid an extra VMWRITE and only stuff a bad value into vmcs02 when mapping vmcs12's address fails. This also eliminates the need for extra comments to connect the dots between prepare_vmcs02_early() and nested_get_vmcs12_pages(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't dump VMCS if virtual APIC page can't be mappedSean Christopherson1-3/+0
... as a malicious userspace can run a toy guest to generate invalid virtual-APIC page addresses in L1, i.e. flood the kernel log with error messages. Fixes: 690908104e39d ("KVM: nVMX: allow tests to use bad virtual-APIC page address") Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: Don't reread VMCS-agnostic state when switching VMCSSean Christopherson3-6/+15
When switching between vmcs01 and vmcs02, there is no need to update state tracking for values that aren't tied to any particular VMCS as the per-vCPU values are already up-to-date (vmx_switch_vmcs() can only be called when the vCPU is loaded). Avoiding the update eliminates a RDMSR, and potentially a RDPKRU and posted-interrupt update (cmpxchg64() and more). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>