summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2020-09-28KVM: SVM: use __GFP_ZERO instead of clear_pageMaxim Levitsky1-4/+2
Another small refactoring. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200827171145.374620-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: SVM: refactor msr permission bitmap allocationMaxim Levitsky1-22/+23
Replace svm_vcpu_init_msrpm with svm_vcpu_alloc_msrpm, that also allocates the msr bitmap and add svm_vcpu_free_msrpm to free it. This will be used later to move the nested msr permission bitmap allocation to nested.c Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200827171145.374620-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nSVM: rename nested vmcb to vmcb12Maxim Levitsky3-119/+117
This is to be more consistient with VMX, and to support upcoming addition of vmcb02 Hopefully no functional changes. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200827171145.374620-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: SVM: rename a variable in the svm_create_vcpuMaxim Levitsky1-6/+6
The 'page' is to hold the vcpu's vmcb so name it as such to avoid confusion. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200827171145.374620-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: LAPIC: Reduce world switch latency caused by timer_advance_nsWanpeng Li3-12/+7
All the checks in lapic_timer_int_injected(), __kvm_wait_lapic_expire(), and these function calls waste cpu cycles when the timer mode is not tscdeadline. We can observe ~1.3% world switch time overhead by kvm-unit-tests/vmexit.flat vmcall testing on AMD server. This patch reduces the world switch latency caused by timer_advance_ns feature when the timer mode is not tscdeadline by simpling move the check against apic->lapic_timer.expired_tscdeadline much earlier. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1599731444-3525-7-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: LAPIC: Narrow down the kick target vCPUWanpeng Li3-8/+3
The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer fires on a different pCPU which vCPU is running on. This kick costs about 1000 clock cycles and we don't need this when injecting already-expired timer or when using the VMX preemption timer because kvm_lapic_expired_hv_timer() is called from the target vCPU. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1599731444-3525-6-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: LAPIC: Guarantee the timer is in tsc-deadline mode when settingWanpeng Li1-2/+1
Check apic_lvtt_tscdeadline() mode directly instead of apic_lvtt_oneshot() and apic_lvtt_period() to guarantee the timer is in tsc-deadline mode when wrmsr MSR_IA32_TSCDEADLINE. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1599731444-3525-3-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: LAPIC: Return 0 when getting the tscdeadline timer if the lapic is hw ↵Wanpeng Li1-2/+1
disabled Return 0 when getting the tscdeadline timer if the lapic is hw disabled. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1599731444-3525-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: LAPIC: Fix updating DFR missing apic map recalculationWanpeng Li1-5/+10
There is missing apic map recalculation after updating DFR, if it is INIT RESET, in x2apic mode, local apic is software enabled before. This patch fix it by introducing the function kvm_apic_set_dfr() to be called in INIT RESET handling path. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1597827327-25055-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nVMX: Simplify the initialization of nested_vmx_msrsChenyi Qiang1-2/+1
The nested VMX controls MSRs can be initialized by the global capability values stored in vmcs_config. Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200828085622.8365-6-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabledChenyi Qiang1-2/+4
KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they are not exposed by the system ioctl KVM_GET_MSR. Add them to the setup of nested VMX controls MSR. Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20200828085622.8365-2-chenyi.qiang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: nSVM: Avoid freeing uninitialized pointers in svm_set_nested_state()Vitaly Kuznetsov1-5/+3
The save and ctl pointers are passed uninitialized to kfree() when svm_set_nested_state() follows the 'goto out_set_gif' path. While the issue could've been fixed by initializing these on-stack varialbles to NULL, it seems preferable to eliminate 'out_set_gif' label completely as it is not actually a failure path and duplicating a single svm_set_gif() call doesn't look too bad. [ bp: Drop obscure Addresses-Coverity: tag. ] Fixes: 6ccbd29ade0d ("KVM: SVM: nested: Don't allocate VMCB structures on stack") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Joerg Roedel <jroedel@suse.de> Reported-by: Colin King <colin.king@canonical.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Joerg Roedel <jroedel@suse.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20200914133725.650221-1-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-27x86/hyperv: Remove aliases with X64 in their nameJoseph Salisbury1-10/+10
In the architecture independent version of hyperv-tlfs.h, commit c55a844f46f958b removed the "X64" in the symbol names so they would make sense for both x86 and ARM64. That commit added aliases with the "X64" in the x86 version of hyperv-tlfs.h so that existing x86 code would continue to compile. As a cleanup, update the x86 code to use the symbols without the "X64", then remove the aliases. There's no functional change. Signed-off-by: Joseph Salisbury <joseph.salisbury@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/1601130386-11111-1-git-send-email-jsalisbury@linux.microsoft.com
2020-09-26Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-11/+39
Pull more kvm fixes from Paolo Bonzini: "Five small fixes. The nested migration bug will be fixed with a better API in 5.10 or 5.11, for now this is a fix that works with existing userspace but keeps the current ugly API" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Add a dedicated INVD intercept routine KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE KVM: x86: fix MSR_IA32_TSC read for nested migration selftests: kvm: Fix assert failure in single-step test KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-25KVM: SVM: Add a dedicated INVD intercept routineTom Lendacky1-1/+7
The INVD instruction intercept performs emulation. Emulation can't be done on an SEV guest because the guest memory is encrypted. Provide a dedicated intercept routine for the INVD intercept. And since the instruction is emulated as a NOP, just skip it instead. Fixes: 1654efcbc431 ("KVM: SVM: Add KVM_SEV_INIT command") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <a0b9a19ffa7fef86a3cc700c7ea01cb2731e04e5.1600972918.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-25KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKESean Christopherson1-1/+2
Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled. Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are toggled inadvertantly skipped the MMU context reset due to the mask of bits that triggers PDPTR loads also being used to trigger MMU context resets. Fixes: 427890aff855 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode") Fixes: cb957adb4ea4 ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode") Cc: Jim Mattson <jmattson@google.com> Cc: Peter Shier <pshier@google.com> Cc: Oliver Upton <oupton@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923215352.17756-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-24KVM: x86: fix MSR_IA32_TSC read for nested migrationMaxim Levitsky1-2/+15
MSR reads/writes should always access the L1 state, since the (nested) hypervisor should intercept all the msrs it wants to adjust, and these that it doesn't should be read by the guest as if the host had read it. However IA32_TSC is an exception. Even when not intercepted, guest still reads the value + TSC offset. The write however does not take any TSC offset into account. This is documented in Intel's SDM and seems also to happen on AMD as well. This creates a problem when userspace wants to read the IA32_TSC value and then write it. (e.g for migration) In this case it reads L2 value but write is interpreted as an L1 value. To fix this make the userspace initiated reads of IA32_TSC return L1 value as well. Huge thanks to Dave Gilbert for helping me understand this very confusing semantic of MSR writes. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-23KVM: x86: VMX: Make smaller physical guest address space support ↵Mohammed Gamal3-7/+15
user-configurable This patch exposes allow_smaller_maxphyaddr to the user as a module parameter. Since smaller physical address spaces are only supported on VMX, the parameter is only exposed in the kvm_intel module. For now disable support by default, and let the user decide if they want to enable it. Modifications to VMX page fault and EPT violation handling will depend on whether that parameter is enabled. Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Message-Id: <20200903141122.72908-1-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-22Merge branch 'x86-seves-for-paolo' of ↵Paolo Bonzini10-36/+54
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
2020-09-19KVM: SVM: Don't flush cache if hardware enforces cache coherency across ↵Krish Sadhukhan1-1/+2
encryption domains In some hardware implementations, coherency between the encrypted and unencrypted mappings of the same physical page in a VM is enforced. In such a system, it is not required for software to flush the VM's page from all CPU caches in the system prior to changing the value of the C-bit for the page. So check that bit before flushing the cache. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lkml.kernel.org/r/20200917212038.5090-4-krish.sadhukhan@oracle.com
2020-09-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds9-26/+63
Pull kvm fixes from Paolo Bonzini: "A bit on the bigger side, mostly due to me being on vacation, then busy, then on parental leave, but there's nothing worrisome. ARM: - Multiple stolen time fixes, with a new capability to match x86 - Fix for hugetlbfs mappings when PUD and PMD are the same level - Fix for hugetlbfs mappings when PTE mappings are enforced (dirty logging, for example) - Fix tracing output of 64bit values x86: - nSVM state restore fixes - Async page fault fixes - Lots of small fixes everywhere" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: emulator: more strict rsm checks. KVM: nSVM: more strict SMM checks when returning to nested guest SVM: nSVM: setup nested msr permission bitmap on nested state load SVM: nSVM: correctly restore GIF on vmexit from nesting after migration x86/kvm: don't forget to ACK async PF IRQ x86/kvm: properly use DEFINE_IDTENTRY_SYSVEC() macro KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit KVM: SVM: avoid emulation with stale next_rip KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN KVM: SVM: Periodically schedule when unregistering regions on destroy KVM: MIPS: Change the definition of kvm type kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control KVM: fix memory leak in kvm_io_bus_unregister_dev() KVM: Check the allocation of pv cpu mask KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected KVM: arm64: Update page shift if stage 2 block mapping not supported KVM: arm64: Fix address truncation in traces KVM: arm64: Do not try to map PUDs when they are folded into PMD arm64/x86: KVM: Introduce steal-time cap ...
2020-09-12KVM: emulator: more strict rsm checks.Maxim Levitsky1-5/+17
Don't ignore return values in rsm_load_state_64/32 to avoid loading invalid state from SMM state area if it was tampered with by the guest. This is primarly intended to avoid letting guest set bits in EFER (like EFER.SVME when nesting is disabled) by manipulating SMM save area. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200827171145.374620-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-12KVM: nSVM: more strict SMM checks when returning to nested guestMaxim Levitsky1-11/+18
* check that guest is 64 bit guest, otherwise the SVM related fields in the smm state area are not defined * If the SMM area indicates that SMM interrupted a running guest, check that EFER.SVME which is also saved in this area is set, otherwise the guest might have tampered with SMM save area, and so indicate emulation failure which should triple fault the guest. * Check that that guest CPUID supports SVM (due to the same issue as above) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200827162720.278690-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-12SVM: nSVM: setup nested msr permission bitmap on nested state loadMaxim Levitsky1-0/+3
This code was missing and was forcing the L2 run with L1's msr permission bitmap Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200827162720.278690-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-12SVM: nSVM: correctly restore GIF on vmexit from nesting after migrationMaxim Levitsky1-1/+3
Currently code in svm_set_nested_state copies the current vmcb control area to L1 control area (hsave->control), under assumption that it mostly reflects the defaults that kvm choose, and later qemu overrides these defaults with L2 state using standard KVM interfaces, like KVM_SET_REGS. However nested GIF (which is AMD specific thing) is by default is true, and it is copied to hsave area as such. This alone is not a big deal since on VMexit, GIF is always set to false, regardless of what it was on VM entry. However in nested_svm_vmexit we were first were setting GIF to false, but then we overwrite the control fields with value from the hsave area. (including the nested GIF field itself if GIF virtualization is enabled). Now on normal vm entry this is not a problem, since GIF is usually false prior to normal vm entry, and this is the value that copied to hsave, and then restored, but this is not always the case when the nested state is loaded as explained above. To fix this issue, move svm_set_gif after we restore the L1 control state in nested_svm_vmexit, so that even with wrong GIF in the saved L1 control area, we still clear GIF as the spec says. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200827162720.278690-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-12KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exitWanpeng Li1-0/+1
According to SDM 27.2.4, Event delivery causes an APIC-access VM exit. Don't report internal error and freeze guest when event delivery causes an APIC-access exit, it is handleable and the event will be re-injected during the next vmentry. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1597827327-25055-2-git-send-email-wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-12KVM: SVM: avoid emulation with stale next_ripWanpeng Li1-4/+3
svm->next_rip is reset in svm_vcpu_run() only after calling svm_exit_handlers_fastpath(), which will cause SVM's skip_emulated_instruction() to write a stale RIP. We can move svm_exit_handlers_fastpath towards the end of svm_vcpu_run(). To align VMX with SVM, keep svm_complete_interrupts() close as well. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Paul K. <kronenpj@kronenpj.dyndns.org> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Also move vmcb_mark_all_clean before any possible write to the VMCB. - Paolo]
2020-09-11KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_ENVitaly Kuznetsov1-1/+1
Even without in-kernel LAPIC we should allow writing '0' to MSR_KVM_ASYNC_PF_EN as we're not enabling the mechanism. In particular, QEMU with 'kernel-irqchip=off' fails to start a guest with qemu-system-x86_64: error: failed to set MSR 0x4b564d02 to 0x0 Fixes: 9d3c447c72fb2 ("KVM: X86: Fix async pf caused null-ptr-deref") Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200911093147.484565-1-vkuznets@redhat.com> [Actually commit the version proposed by Sean Christopherson. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11KVM: SVM: Periodically schedule when unregistering regions on destroyDavid Rientjes1-0/+1
There may be many encrypted regions that need to be unregistered when a SEV VM is destroyed. This can lead to soft lockups. For example, on a host running 4.15: watchdog: BUG: soft lockup - CPU#206 stuck for 11s! [t_virtual_machi:194348] CPU: 206 PID: 194348 Comm: t_virtual_machi RIP: 0010:free_unref_page_list+0x105/0x170 ... Call Trace: [<0>] release_pages+0x159/0x3d0 [<0>] sev_unpin_memory+0x2c/0x50 [kvm_amd] [<0>] __unregister_enc_region_locked+0x2f/0x70 [kvm_amd] [<0>] svm_vm_destroy+0xa9/0x200 [kvm_amd] [<0>] kvm_arch_destroy_vm+0x47/0x200 [<0>] kvm_put_kvm+0x1a8/0x2f0 [<0>] kvm_vm_release+0x25/0x30 [<0>] do_exit+0x335/0xc10 [<0>] do_group_exit+0x3f/0xa0 [<0>] get_signal+0x1bc/0x670 [<0>] do_signal+0x31/0x130 Although the CLFLUSH is no longer issued on every encrypted region to be unregistered, there are no other changes that can prevent soft lockups for very large SEV VMs in the latest kernel. Periodically schedule if necessary. This still holds kvm->lock across the resched, but since this only happens when the VM is destroyed this is assumed to be acceptable. Signed-off-by: David Rientjes <rientjes@google.com> Message-Id: <alpine.DEB.2.23.453.2008251255240.2987727@chino.kir.corp.google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when neededLai Jiangshan1-1/+1
When kvm_mmu_get_page() gets a page with unsynced children, the spt pagetable is unsynchronized with the guest pagetable. But the guest might not issue a "flush" operation on it when the pagetable entry is changed from zero or other cases. The hypervisor has the responsibility to synchronize the pagetables. KVM behaved as above for many years, But commit 8c8560b83390 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes") inadvertently included a line of code to change it without giving any reason in the changelog. It is clear that the commit's intention was to change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't needlessly flush other contexts; however, one of the hunks changed a nearby KVM_REQ_MMU_SYNC instead. This patch changes it back. Link: https://lore.kernel.org/lkml/20200320212833.3507-26-sean.j.christopherson@intel.com/ Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20200902135421.31158-1-jiangshanlai@gmail.com> fixes: 8c8560b83390 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL controlChenyi Qiang1-1/+1
A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field in exit_ctls_high. Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control") Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detectedPeter Shier3-2/+11
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call load_pdptrs to read the possibly updated PDPTEs from the guest physical address referenced by CR3. It loads them into vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. At the subsequent assumed reentry into L2, the mmu will call vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works if the L2 CRn write intercept always resumes L2. The resume path calls vmx_check_nested_events which checks for exceptions, MTF, and expired VMX preemption timers. If vmx_check_nested_events finds any of these conditions pending it will reflect the corresponding exit into L1. Live migration at this point would also cause a missed immediate reentry into L2. After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. When L2 next resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale values stored at last L2 exit. A repro of this bug showed L2 entering triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values. When L2 is in PAE paging mode add a call to ept_load_pdptrs before leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in vcpu->arch.walk_mmu->pdptrs[]. Tested: kvm-unit-tests with new directed test: vmx_mtf_pdpte_test. Verified that test fails without the fix. Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix would repro readily. Ran 14 simultaneous L2s for 140 iterations with no failures. Signed-off-by: Peter Shier <pshier@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200820230545.2411347-1-pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11Merge tag 'kvmarm-fixes-5.9-1' of ↵Paolo Bonzini1-2/+13
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for Linux 5.9, take #1 - Multiple stolen time fixes, with a new capability to match x86 - Fix for hugetlbfs mappings when PUD and PMD are the same level - Fix for hugetlbfs mappings when PTE mappings are enforced (dirty logging, for example) - Fix tracing output of 64bit values
2020-09-10objtool: Rename frame.h -> objtool.hJulien Thierry3-3/+3
Header frame.h is getting more code annotations to help objtool analyze object files. Rename the file to objtool.h. [ jpoimboe: add objtool.h to MAINTAINERS ] Signed-off-by: Julien Thierry <jthierry@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
2020-09-07KVM: SVM: Add GHCB definitionsTom Lendacky1-0/+2
Extend the vmcb_safe_area with SEV-ES fields and add a new 'struct ghcb' which will be used for guest-hypervisor communication. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200907131613.12703-3-joro@8bytes.org
2020-09-07KVM: SVM: nested: Don't allocate VMCB structures on stackJoerg Roedel1-14/+33
Do not allocate a vmcb_control_area and a vmcb_save_area on the stack, as these structures will become larger with future extenstions of SVM and thus the svm_set_nested_state() function will become a too large stack frame. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200907131613.12703-2-joro@8bytes.org
2020-09-07Merge 'x86/cpu' to pick up dependent bitsBorislav Petkov1-1/+1
Pick up work happening in parallel to avoid nasty merge conflicts later. Signed-off-by: Borislav Petkov <bp@suse.de>
2020-08-30x86/kvm: Expose TSX Suspend Load Tracking featureCathy Zhang1-1/+1
TSX suspend load tracking instruction is supported by the Intel uarch Sapphire Rapids. It aims to give a way to choose which memory accesses do not need to be tracked in the TSX read set. It's availability is indicated as CPUID.(EAX=7,ECX=0):EDX[bit 16]. Expose TSX Suspend Load Address Tracking feature in KVM CPUID, so KVM could pass this information to guests and they can make use of this feature accordingly. Signed-off-by: Cathy Zhang <cathy.zhang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/1598316478-23337-3-git-send-email-cathy.zhang@intel.com
2020-08-25kvm: mmu: page_track: Fix RCU list API usageMadhuparna Bhowmik1-2/+4
Use hlist_for_each_entry_srcu() instead of hlist_for_each_entry_rcu() as it also checkes if the right lock is held. Using hlist_for_each_entry_rcu() with a condition argument will not report the cases where a SRCU protected list is traversed using rcu_read_lock(). Hence, use hlist_for_each_entry_srcu(). Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: <kvm@vger.kernel.org>
2020-08-24treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva8-21/+18
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-3/+6
Pull kvm fixes from Paolo Bonzini: - PAE and PKU bugfixes for x86 - selftests fix for new binutils - MMU notifier fix for arm64 * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode KVM: x86: fix access code passed to gva_to_gpa selftests: kvm: Use a shorter encoding to clear RAX
2020-08-22KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()Will Deacon1-1/+2
The 'flags' field of 'struct mmu_notifier_range' is used to indicate whether invalidate_range_{start,end}() are permitted to block. In the case of kvm_mmu_notifier_invalidate_range_start(), this field is not forwarded on to the architecture-specific implementation of kvm_unmap_hva_range() and therefore the backend cannot sensibly decide whether or not to block. Add an extra 'flags' parameter to kvm_unmap_hva_range() so that architectures are aware as to whether or not they are permitted to block. Cc: <stable@vger.kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20200811102725.7121-2-will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-21arm64/x86: KVM: Introduce steal-time capAndrew Jones1-0/+3
arm64 requires a vcpu fd (KVM_HAS_DEVICE_ATTR vcpu ioctl) to probe support for steal-time. However this is unnecessary, as only a KVM fd is required, and it complicates userspace (userspace may prefer delaying vcpu creation until after feature probing). Introduce a cap that can be checked instead. While x86 can already probe steal-time support with a kvm fd (KVM_GET_SUPPORTED_CPUID), we add the cap there too for consistency. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20200804170604.42662-7-drjones@redhat.com
2020-08-17kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE modeJim Mattson1-1/+1
See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: b9baba8614890 ("KVM, pkeys: expose CPUID/CR4 to guest") Cc: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE modeJim Mattson1-1/+1
See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: 0be0226f07d14 ("KVM: MMU: fix SMAP virtualization") Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17KVM: x86: fix access code passed to gva_to_gpaPaolo Bonzini1-1/+3
The PK bit of the error code is computed dynamically in permission_fault and therefore need not be passed to gva_to_gpa: only the access bits (fetch, user, write) need to be passed down. Not doing so causes a splat in the pku test: WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020 RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm] Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89 RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007 RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000 RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007 R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000 FS: 00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: paging64_gva_to_gpa+0x3f/0xb0 [kvm] kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm] handle_exception_nmi+0x4fc/0x5b0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm] kvm_vcpu_ioctl+0x23e/0x5d0 [kvm] ksys_ioctl+0x92/0xb0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x3e/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace d17eb998aee991da ]--- Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds3-13/+15
Pull more KVM updates from Paolo Bonzini: "PPC: - Improvements and bugfixes for secure VM support, giving reduced startup time and memory hotplug support. - Locking fixes in nested KVM code - Increase number of guests supported by HV KVM to 4094 - Preliminary POWER10 support ARM: - Split the VHE and nVHE hypervisor code bases, build the EL2 code separately, allowing for the VHE code to now be built with instrumentation - Level-based TLB invalidation support - Restructure of the vcpu register storage to accomodate the NV code - Pointer Authentication available for guests on nVHE hosts - Simplification of the system register table parsing - MMU cleanups and fixes - A number of post-32bit cleanups and other fixes MIPS: - compilation fixes x86: - bugfixes - support for the SERIALIZE instruction" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits) KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabled MIPS: KVM: Convert a fallthrough comment to fallthrough MIPS: VZ: Only include loongson_regs.h for CPU_LOONGSON64 x86: Expose SERIALIZE for supported cpuid KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled KVM: arm64: Move S1PTW S2 fault logic out of io_mem_abort() KVM: arm64: Don't skip cache maintenance for read-only memslots KVM: arm64: Handle data and instruction external aborts the same way KVM: arm64: Rename kvm_vcpu_dabt_isextabt() KVM: arm: Add trace name for ARM_NISV KVM: arm64: Ensure that all nVHE hyp code is in .hyp.text KVM: arm64: Substitute RANDOMIZE_BASE for HARDEN_EL2_VECTORS KVM: arm64: Make nVHE ASLR conditional on RANDOMIZE_BASE KVM: PPC: Book3S HV: Rework secure mem slot dropping KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up KVM: PPC: Book3S HV: Migrate hot plugged memory KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START ...
2020-08-12Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-2/+10
Pull virtio updates from Michael Tsirkin: - IRQ bypass support for vdpa and IFC - MLX5 vdpa driver - Endianness fixes for virtio drivers - Misc other fixes * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits) vdpa/mlx5: fix up endian-ness for mtu vdpa: Fix pointer math bug in vdpasim_get_config() vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config() vdpa/mlx5: fix memory allocation failure checks vdpa/mlx5: Fix uninitialised variable in core/mr.c vdpa_sim: init iommu lock virtio_config: fix up warnings on parisc vdpa/mlx5: Add VDPA driver for supported mlx5 devices vdpa/mlx5: Add shared memory registration code vdpa/mlx5: Add support library for mlx5 VDPA implementation vdpa/mlx5: Add hardware descriptive header file vdpa: Modify get_vq_state() to return error code net/vdpa: Use struct for set/get vq state vdpa: remove hard coded virtq num vdpasim: support batch updating vhost-vdpa: support IOTLB batching hints vhost-vdpa: support get/set backend features vhost: generialize backend features setting/getting vhost-vdpa: refine ioctl pre-processing vDPA: dont change vq irq after DRIVER_OK ...
2020-08-10x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabledJon Doron1-0/+1
Based on an analysis of the HyperV firmwares (Gen1 and Gen2) it seems like the SCONTROL is not being set to the ENABLED state as like we have thought. Also from a test done by Vitaly Kuznetsov, running a nested HyperV it was concluded that the first access to the SCONTROL MSR with a read resulted with the value of 0x1, aka HV_SYNIC_CONTROL_ENABLE. It's important to note that this diverges from the value states in the HyperV TLFS of 0. Signed-off-by: Jon Doron <arilou@gmail.com> Message-Id: <20200717125238.1103096-2-arilou@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-09x86: Expose SERIALIZE for supported cpuidPaolo Bonzini1-1/+2
The SERIALIZE instruction is supported by Tntel processors, like Sapphire Rapids. SERIALIZE is a faster serializing instruction which does not modify registers, arithmetic flags or memory, will not cause VM exit. It's availability is indicated by CPUID.(EAX=7,ECX=0):ECX[bit 14]. Expose it in KVM supported CPUID. This way, KVM could pass this information to guests and they can make use of these features accordingly. Signed-off-by: Cathy Zhang <cathy.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>