summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)AuthorFilesLines
2019-09-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-69/+128
Pull KVM updates from Paolo Bonzini: "s390: - ioctl hardening - selftests ARM: - ITS translation cache - support for 512 vCPUs - various cleanups and bugfixes PPC: - various minor fixes and preparation x86: - bugfixes all over the place (posted interrupts, SVM, emulation corner cases, blocked INIT) - some IPI optimizations" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (75 commits) KVM: X86: Use IPI shorthands in kvm guest when support KVM: x86: Fix INIT signal handling in various CPU states KVM: VMX: Introduce exit reason for receiving INIT signal on guest-mode KVM: VMX: Stop the preemption timer during vCPU reset KVM: LAPIC: Micro optimize IPI latency kvm: Nested KVM MMUs need PAE root too KVM: x86: set ctxt->have_exception in x86_decode_insn() KVM: x86: always stop emulation on page fault KVM: nVMX: trace nested VM-Enter failures detected by H/W KVM: nVMX: add tracepoint for failed nested VM-Enter x86: KVM: svm: Fix a check in nested_svm_vmrun() KVM: x86: Return to userspace with internal error on unexpected exit reason KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers doc: kvm: Fix return description of KVM_SET_MSRS KVM: X86: Tune PLE Window tracepoint KVM: VMX: Change ple_window type to unsigned int KVM: X86: Remove tailing newline for tracepoints KVM: X86: Trace vcpu_id for vmexit KVM: x86: Manually calculate reserved bits when loading PDPTRS ...
2019-09-14KVM: x86: work around leak of uninitialized stack contentsFuqian Huang1-0/+7
Emulation of VMPTRST can incorrectly inject a page fault when passed an operand that points to an MMIO address. The page fault will use uninitialized kernel stack memory as the CR2 and error code. The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR exit to userspace; however, it is not an easy fix, so for now just ensure that the error code and CR2 are zero. Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Cc: stable@vger.kernel.org [add comment] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: x86: set ctxt->have_exception in x86_decode_insn()Jan Dakinevich1-0/+6
x86_emulate_instruction() takes into account ctxt->have_exception flag during instruction decoding, but in practice this flag is never set in x86_decode_insn(). Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: x86: always stop emulation on page faultJan Dakinevich1-1/+3
inject_emulated_exception() returns true if and only if nested page fault happens. However, page fault can come from guest page tables walk, either nested or not nested. In both cases we should stop an attempt to read under RIP and give guest to step over its own page fault handler. This is also visible when an emulated instruction causes a #GP fault and the VMware backdoor is enabled. To handle the VMware backdoor, KVM intercepts #GP faults; with only the next patch applied, x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL instead of EMULATE_DONE. EMULATE_FAIL causes handle_exception_nmi() (or gp_interception() for SVM) to re-inject the original #GP because it thinks emulation failed due to a non-VMware opcode. This patch prevents the issue as x86_emulate_instruction() will return EMULATE_DONE after injecting the #GP. Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: nVMX: add tracepoint for failed nested VM-EnterSean Christopherson1-0/+1
Debugging a failed VM-Enter is often like searching for a needle in a haystack, e.g. there are over 80 consistency checks that funnel into the "invalid control field" error code. One way to expedite debug is to run the buggy code as an L1 guest under KVM (and pray that the failing check is detected by KVM). However, extracting useful debug information out of L0 KVM requires attaching a debugger to KVM and/or modifying the source, e.g. to log which check is failing. Make life a little less painful for VMM developers and add a tracepoint for failed VM-Enter consistency checks. Ideally the tracepoint would capture both what check failed and precisely why it failed, but logging why a checked failed is difficult to do in a generic tracepoint without resorting to invasive techniques, e.g. generating a custom string on failure. That being said, for the vast majority of VM-Enter failures the most difficult step is figuring out exactly what to look at, e.g. figuring out which bit was incorrectly set in a control field is usually not too painful once the guilty field as been identified. To reach a happy medium between precision and ease of use, simply log the code that detected a failed check, using a macro to execute the check and log the trace event on failure. This approach enables tracing arbitrary code, e.g. it's not limited to function calls or specific formats of checks, and the changes to the existing code are minimally invasive. A macro with a two-character name is desirable as usage of the macro doesn't result in overly long lines or confusing alignment, while still retaining some amount of readability. I.e. a one-character name is a little too terse, and a three-character name results in the contents being passed to the macro aligning with an indented line when the macro is used an in if-statement, e.g.: if (VCC(nested_vmx_check_long_line_one(...) && nested_vmx_check_long_line_two(...))) return -EINVAL; And that is the story of how the CC(), a.k.a. Consistency Check, macro got its name. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM codeSean Christopherson1-0/+35
Move RDMSR and WRMSR emulation into common x86 code to consolidate nearly identical SVM and VMX code. Note, consolidating RDMSR introduces an extra indirect call, i.e. retpoline, due to reaching {svm,vmx}_get_msr() via kvm_x86_ops, but a guest kernel likely has bigger problems if increasing the latency of RDMSR VM-Exits by ~70 cycles has a measurable impact on overall VM performance. E.g. the only recurring RDMSR VM-Exits (after booting) on my system running Linux 5.2 in the guest are for MSR_IA32_TSC_ADJUST via arch_cpu_idle_enter(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callersSean Christopherson1-51/+50
Refactor the top-level MSR accessors to take/return the index and value directly instead of requiring the caller to dump them into a msr_data struct. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: X86: Tune PLE Window tracepointPeter Xu1-1/+1
The PLE window tracepoint triggers even if the window is not changed, and the wording can be a bit confusing too. One example line: kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096) It easily let people think of "the window now is 4096 which is shrinked", but the truth is the value actually didn't change (4096). Let's only dump this message if the value really changed, and we make the message even simpler like: kvm_ple_window: vcpu 4 old 4096 new 8192 (growed) Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Manually calculate reserved bits when loading PDPTRSSean Christopherson1-3/+8
Manually generate the PDPTR reserved bit mask when explicitly loading PDPTRs. The reserved bits that are being tracked by the MMU reflect the current paging mode, which is unlikely to be PAE paging in the vast majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation, __set_sregs(), etc... This can cause KVM to incorrectly signal a bad PDPTR, or more likely, miss a reserved bit check and subsequently fail a VM-Enter due to a bad VMCS.GUEST_PDPTR. Add a one off helper to generate the reserved bits instead of sharing code across the MMU's calculations and the PDPTR emulation. The PDPTR reserved bits are basically set in stone, and pushing a helper into the MMU's calculation adds unnecessary complexity without improving readability. Oppurtunistically fix/update the comment for load_pdptrs(). Note, the buggy commit also introduced a deliberate functional change, "Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was effectively (and correctly) reverted by commit cd9ae5fe47df ("KVM: x86: Fix page-tables reserved bits"). A bit of SDM archaeology shows that the SDM from late 2008 had a bug (likely a copy+paste error) where it listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved for 2mb entries. I.e. the SDM contradicted itself, and bits 6:5 are and always have been reserved. Fixes: 20c466b56168d ("KVM: Use rsvd_bits_mask in load_pdptrs()") Cc: stable@vger.kernel.org Cc: Nadav Amit <nadav.amit@gmail.com> Reported-by: Doug Reiland <doug.reiland@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-27KVM: x86: Don't update RIP or do single-step on faulting emulationSean Christopherson1-4/+5
Don't advance RIP or inject a single-step #DB if emulation signals a fault. This logic applies to all state updates that are conditional on clean retirement of the emulation instruction, e.g. updating RFLAGS was previously handled by commit 38827dbd3fb85 ("KVM: x86: Do not update EFLAGS on faulting emulation"). Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with ctxt->_eip until emulation "retires" anyways. Skipping #DB injection fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to invalid state with EFLAGS.TF=1 would loop indefinitely due to emulation overwriting the #UD with #DB and thus restarting the bad SYSCALL over and over. Cc: Nadav Amit <nadav.amit@gmail.com> Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@kernel.org> Fixes: 663f4c61b803 ("KVM: x86: handle singlestep during emulation") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2019-08-22KVM: X86: Add pv tlb shootdown tracepointWanpeng Li1-0/+2
Add pv tlb shootdown tracepoint. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: Unconditionally call x86 ops that are always implementedSean Christopherson1-10/+3
Remove a few stale checks for non-NULL ops now that the ops in question are implemented by both VMX and SVM. Note, this is **not** stable material, the Fixes tags are there purely to show when a particular op was first supported by both VMX and SVM. Fixes: 74f169090b6f ("kvm/svm: Setup MCG_CAP on AMD properly") Fixes: b31c114b82b2 ("KVM: X86: Provide a capability to disable PAUSE intercepts") Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: Rename access permissions cache member in struct kvm_vcpu_archSean Christopherson1-1/+1
Rename "access" to "mmio_access" to match the other MMIO cache members and to make it more obvious that it's tracking the access permissions for the MMIO cache. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: KVM: add xsetbv to the emulatorVitaly Kuznetsov1-0/+6
To avoid hardcoding xsetbv length to '3' we need to support decoding it in the emulator. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: KVM: clear interrupt shadow on EMULTYPE_SKIPVitaly Kuznetsov1-0/+1
When doing x86_emulate_instruction(EMULTYPE_SKIP) interrupt shadow has to be cleared if and only if the skipping is successful. There are two immediate issues: - In SVM skip_emulated_instruction() we are not zapping interrupt shadow in case kvm_emulate_instruction(EMULTYPE_SKIP) is used to advance RIP (!nrpip_save). - In VMX handle_ept_misconfig() when running as a nested hypervisor we (static_cpu_has(X86_FEATURE_HYPERVISOR) case) forget to clear interrupt shadow. Note that we intentionally don't handle the case when the skipped instruction is supposed to prolong the interrupt shadow ("MOV/POP SS") as skip-emulation of those instructions should not happen under normal circumstances. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: kvm: svm: propagate errors from skip_emulated_instruction()Vitaly Kuznetsov1-2/+4
On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory, fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP). Currently, we only do printk(KERN_DEBUG) when this happens and this is not ideal. Propagate the error up the stack. On VMX, skip_emulated_instruction() doesn't fail, we have two call sites calling it explicitly: handle_exception_nmi() and handle_task_switch(), we can just ignore the result. On SVM, we also have two explicit call sites: svm_queue_exception() and it seems we don't need to do anything there as we check if RIP was advanced or not. In task_switch_interception(), however, we are better off not proceeding to kvm_task_switch() in case skip_emulated_instruction() failed. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: use Intel speculation bugs and features as derived in generic x86 codePaolo Bonzini1-0/+7
Similar to AMD bits, set the Intel bits from the vendor-independent feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care about the vendor and they should be set on AMD processors as well. Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05KVM: Fix leak vCPU's VMCS value into other pCPUWanpeng Li1-0/+16
After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-24KVM: X86: Boost queue head vCPU to mitigate lock waiter preemptionWanpeng Li1-1/+2
Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks) introduces hybrid PV queued/unfair locks - queued mode (no starvation) - unfair mode (good performance on not heavily contended lock) The lock waiter goes into the unfair mode especially in VMs with over-commit vCPUs since increaing over-commitment increase the likehood that the queue head vCPU may have been preempted and not actively spinning. However, reschedule queue head vCPU timely to acquire the lock still can get better performance than just depending on lock stealing in over-subscribe scenario. Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM: ebizzy -M vanilla boosting improved 1VM 23520 25040 6% 2VM 8000 13600 70% 3VM 3100 5400 74% The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue head vCPU which is involuntary preemption or the one which is voluntary halt due to fail to acquire the lock after a short spin in the guest. Cc: Waiman Long <longman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22KVM: X86: Dynamically allocate user_fpuWanpeng Li1-2/+2
After reverting commit 240c35a3783a (kvm: x86: Use task structs fpu field for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3) is the order at which allocations are deemed costly to service. In serveless scenario, one host can service hundreds/thoudands firecracker/kata-container instances, howerver, new instance will fail to launch after memory is too fragmented to allocate kvm_vcpu struct on host, this was observed in some cloud provider product environments. This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my Skylake server. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22KVM: X86: Fix fpu state crash in kvm guestWanpeng Li1-3/+6
The idea before commit 240c35a37 (which has just been reverted) was that we have the following FPU states: userspace (QEMU) guest --------------------------------------------------------------------------- processor vcpu->arch.guest_fpu >>> KVM_RUN: kvm_load_guest_fpu vcpu->arch.user_fpu processor >>> preempt out vcpu->arch.user_fpu current->thread.fpu >>> preempt in vcpu->arch.user_fpu processor >>> back to userspace >>> kvm_put_guest_fpu processor vcpu->arch.guest_fpu --------------------------------------------------------------------------- With the new lazy model we want to get the state back to the processor when schedule in from current->thread.fpu. Reported-by: Thomas Lambertz <mail@thomaslambertz.de> Reported-by: anthony <antdev66@gmail.com> Tested-by: anthony <antdev66@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Lambertz <mail@thomaslambertz.de> Cc: anthony <antdev66@gmail.com> Cc: stable@vger.kernel.org Fixes: 5f409e20b (x86/fpu: Defer FPU state load until return to userspace) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Add a comment in front of the warning. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22Revert "kvm: x86: Use task structs fpu field for user"Paolo Bonzini1-2/+2
This reverts commit 240c35a3783ab9b3a0afaba0dde7291295680a6b ("kvm: x86: Use task structs fpu field for user", 2018-11-06). The commit is broken and causes QEMU's FPU state to be destroyed when KVM_RUN is preempted. Fixes: 240c35a3783a ("kvm: x86: Use task structs fpu field for user") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20KVM: LAPIC: Inject timer interrupt via posted interruptWanpeng Li1-0/+6
Dedicated instances are currently disturbed by unnecessary jitter due to the emulated lapic timers firing on the same pCPUs where the vCPUs reside. There is no hardware virtual timer on Intel for guest like ARM, so both programming timer in guest and the emulated timer fires incur vmexits. This patch tries to avoid vmexit when the emulated timer fires, at least in dedicated instance scenario when nohz_full is enabled. In that case, the emulated timers can be offload to the nearest busy housekeeping cpus since APICv has been found for several years in server processors. The guest timer interrupt can then be injected via posted interrupts, which are delivered by the housekeeping cpu once the emulated timer fires. The host should tuned so that vCPUs are placed on isolated physical processors, and with several pCPUs surplus for busy housekeeping. If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode, ~3% redis performance benefit can be observed on Skylake server, and the number of external interrupt vmexits drops substantially. Without patch VM-EXIT Samples Samples% Time% Min Time Max Time Avg time EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% ) While with patch: VM-EXIT Samples Samples% Time% Min Time Max Time Avg time EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% ) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-17KVM: LAPIC: Make lapic timer unpinnedWanpeng Li1-5/+1
Commit 61abdbe0bcc2 ("kvm: x86: make lapic hrtimer pinned") pinned the lapic timer to avoid to wait until the next kvm exit for the guest to see KVM_REQ_PENDING_TIMER set. There is another solution to give a kick after setting the KVM_REQ_PENDING_TIMER bit, make lapic timer unpinned will be used in follow up patches. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-15kvm: x86: some tsc debug cleanupYi Wang1-8/+0
There are some pr_debug in TSC code, which may have been no use, so remove them as Paolo suggested. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-62/+167
Pull KVM updates from Paolo Bonzini: "ARM: - support for chained PMU counters in guests - improved SError handling - handle Neoverse N1 erratum #1349291 - allow side-channel mitigation status to be migrated - standardise most AArch64 system register accesses to msr_s/mrs_s - fix host MPIDR corruption on 32bit - selftests ckleanups x86: - PMU event {white,black}listing - ability for the guest to disable host-side interrupt polling - fixes for enlightened VMCS (Hyper-V pv nested virtualization), - new hypercall to yield to IPI target - support for passing cstate MSRs through to the guest - lots of cleanups and optimizations Generic: - Some txt->rST conversions for the documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (128 commits) Documentation: virtual: Add toctree hooks Documentation: kvm: Convert cpuid.txt to .rst Documentation: virtual: Convert paravirt_ops.txt to .rst KVM: x86: Unconditionally enable irqs in guest context KVM: x86: PMU Event Filter kvm: x86: Fix -Wmissing-prototypes warnings KVM: Properly check if "page" is valid in kvm_vcpu_unmap KVM: arm/arm64: Initialise host's MPIDRs by reading the actual register KVM: LAPIC: Retry tune per-vCPU timer_advance_ns if adaptive tuning goes insane kvm: LAPIC: write down valid APIC registers KVM: arm64: Migrate _elx sysreg accessors to msr_s/mrs_s KVM: doc: Add API documentation on the KVM_REG_ARM_WORKAROUNDS register KVM: arm/arm64: Add save/restore support for firmware workaround state arm64: KVM: Propagate full Spectre v2 workaround state to KVM guests KVM: arm/arm64: Support chained PMU counters KVM: arm/arm64: Remove pmc->bitmask KVM: arm/arm64: Re-create event when setting counter value KVM: arm/arm64: Extract duplicated code to own function KVM: arm/arm64: Rename kvm_pmu_{enable/disable}_counter functions KVM: LAPIC: ARBPRI is a reserved register for x2APIC ...
2019-07-11Merge tag 'kvm-arm-for-5.3' of ↵Paolo Bonzini1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm updates for 5.3 - Add support for chained PMU counters in guests - Improve SError handling - Handle Neoverse N1 erratum #1349291 - Allow side-channel mitigation status to be migrated - Standardise most AArch64 system register accesses to msr_s/mrs_s - Fix host MPIDR corruption on 32bit
2019-07-11KVM: x86: Unconditionally enable irqs in guest contextSean Christopherson1-0/+11
On VMX, KVM currently does not re-enable irqs until after it has exited the guest context. As a result, a tick that fires in the window between VM-Exit and guest_exit_irqoff() will be accounted as system time. While said window is relatively small, it's large enough to be problematic in some configurations, e.g. if VM-Exits are consistently occurring a hair earlier than the tick irq. Intentionally toggle irqs back off so that guest_exit_irqoff() can be used in lieu of guest_exit() in order to avoid the save/restore of flags in guest_exit(). On my Haswell system, "nop; cli; sti" is ~6 cycles, versus ~28 cycles for "pushf; pop <reg>; cli; push <reg>; popf". Fixes: f2485b3e0c6c0 ("KVM: x86: use guest_exit_irqoff") Reported-by: Wei Yang <w90p710@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-11KVM: x86: PMU Event FilterEric Hankland1-0/+5
Some events can provide a guest with information about other guests or the host (e.g. L3 cache stats); providing the capability to restrict access to a "safe" set of events would limit the potential for the PMU to be used in any side channel attacks. This change introduces a new VM ioctl that sets an event filter. If the guest attempts to program a counter for any blacklisted or non-whitelisted event, the kernel counter won't be created, so any RDPMC/RDMSR will show 0 instances of that event. Signed-off-by: Eric Hankland <ehankland@google.com> [Lots of changes. All remaining bugs are probably mine. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-08Merge branch 'timers-core-for-linus' of ↵Linus Torvalds1-6/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "The timer and timekeeping departement delivers: Core: - The consolidation of the VDSO code into a generic library including the conversion of x86 and ARM64. Conversion of ARM and MIPS are en route through the relevant maintainer trees and should end up in 5.4. This gets rid of the unnecessary different copies of the same code and brings all architectures on the same level of VDSO functionality. - Make the NTP user space interface more robust by restricting the TAI offset to prevent undefined behaviour. Includes a selftest. - Validate user input in the compat settimeofday() syscall to catch invalid values which would be turned into valid values by a multiplication overflow - Consolidate the time accessors - Small fixes, improvements and cleanups all over the place Drivers: - Support for the NXP system counter, TI davinci timer - Move the Microsoft HyperV clocksource/events code into the drivers/clocksource directory so it can be shared between x86 and ARM64. - Overhaul of the Tegra driver - Delay timer support for IXP4xx - Small fixes, improvements and cleanups as usual" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) time: Validate user input in compat_settimeofday() timer: Document TIMER_PINNED clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic clocksource/drivers: Make Hyper-V clocksource ISA agnostic MAINTAINERS: Fix Andy's surname and the directory entries of VDSO hrtimer: Use a bullet for the returns bullet list arm64: vdso: Fix compilation with clang older than 8 arm64: compat: Fix __arch_get_hw_counter() implementation arm64: Fix __arch_get_hw_counter() implementation lib/vdso: Make delta calculation work correctly MAINTAINERS: Add entry for the generic VDSO library arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system arm64: vdso: Remove unnecessary asm-offsets.c definitions vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h clocksource/drivers/davinci: Add support for clocksource clocksource/drivers/davinci: Add support for clockevents clocksource/drivers/tegra: Set up maximum-ticks limit properly clocksource/drivers/tegra: Cycles can't be 0 clocksource/drivers/tegra: Restore base address before cleanup clocksource/drivers/tegra: Add verbose definition for 1MHz constant ...
2019-07-03clocksource/drivers: Continue making Hyper-V clocksource ISA agnosticMichael Kelley1-0/+1
Continue consolidating Hyper-V clock and timer code into an ISA independent Hyper-V clocksource driver. Move the existing clocksource code under drivers/hv and arch/x86 to the new clocksource driver while separating out the ISA dependencies. Update Hyper-V initialization to call initialization and cleanup routines since the Hyper-V synthetic clock is not independently enumerated in ACPI. Update Hyper-V clocksource users in KVM and VDSO to get definitions from the new include file. No behavior is changed and no new functionality is added. Suggested-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "bp@alien8.de" <bp@alien8.de> Cc: "will.deacon@arm.com" <will.deacon@arm.com> Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com> Cc: "mark.rutland@arm.com" <mark.rutland@arm.com> Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org> Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org> Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org> Cc: "olaf@aepfle.de" <olaf@aepfle.de> Cc: "apw@canonical.com" <apw@canonical.com> Cc: "jasowang@redhat.com" <jasowang@redhat.com> Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com> Cc: Sunil Muthuswamy <sunilmut@microsoft.com> Cc: KY Srinivasan <kys@microsoft.com> Cc: "sashal@kernel.org" <sashal@kernel.org> Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com> Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org> Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org> Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org> Cc: "arnd@arndb.de" <arnd@arndb.de> Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk> Cc: "ralf@linux-mips.org" <ralf@linux-mips.org> Cc: "paul.burton@mips.com" <paul.burton@mips.com> Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org> Cc: "salyzyn@android.com" <salyzyn@android.com> Cc: "pcc@google.com" <pcc@google.com> Cc: "shuah@kernel.org" <shuah@kernel.org> Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com> Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk> Cc: "huw@codeweavers.com" <huw@codeweavers.com> Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au> Cc: "pbonzini@redhat.com" <pbonzini@redhat.com> Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com> Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org> Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com
2019-07-03Merge branch 'timers/vdso' into timers/coreThomas Gleixner1-4/+1
so the hyper-v clocksource update can be applied.
2019-07-02KVM: x86: degrade WARN to pr_warn_ratelimitedPaolo Bonzini1-3/+3
This warning can be triggered easily by userspace, so it should certainly not cause a panic if panic_on_warn is set. Reported-by: syzbot+c03f30b4f4c46bdf8575@syzkaller.appspotmail.com Suggested-by: Alexander Potapenko <glider@google.com> Acked-by: Alexander Potapenko <glider@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: X86: Implement PV sched yield hypercallWanpeng Li1-0/+21
The target vCPUs are in runnable state after vcpu_kick and suitable as a yield target. This patch implements the sched yield hypercall. 17% performance increasement of ebizzy benchmark can be observed in an over-subscribe environment. (w/ kvm-pv-tlb disabled, testing TLB flush call-function IPI-many since call-function is not easy to be trigged by userspace workload). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LISTPaolo Bonzini1-0/+20
This allows userspace to know which MSRs are supported by the hypervisor. Unfortunately userspace must resort to tricks for everything except MSR_IA32_VMX_VMFUNC (which was just added in the previous patch). One possibility is to use the feature control MSR, which is tied to nested VMX as well and is present on all KVM versions that support feature MSRs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-22timekeeping: Use proper clock specifier names in functionsJason A. Donenfeld1-6/+6
This makes boot uniformly boottime and tai uniformly clocktai, to address the remaining oversights. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499Thomas Gleixner1-4/+1
Based on 1 normalized pattern(s): this work is licensed under the terms of the gnu gpl version 2 see the copying file in the top level directory extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 35 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-18KVM: x86: introduce is_pae_pagingPaolo Bonzini1-4/+4
Checking for 32-bit PAE is quite common around code that fiddles with the PDPTRs. Add a function to compress all checks into a single invocation. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fnSean Christopherson1-1/+1
Per commit 1b6269db3f833 ("KVM: VMX: Handle NMIs before enabling interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run() to "make sure we handle NMI on the current cpu, and that we don't service maskable interrupts before non-maskable ones". The other exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC, have similar requirements, and are located there to avoid extra VMREADs since VMX bins hardware exceptions and NMIs into a single exit reason. Clean up the code and eliminate the vaguely named complete_atomic_exit() by moving the interrupts-disabled exception and NMI handling into the existing handle_external_intrs() callback, and rename the callback to a more appropriate name. Rename VMexit handlers throughout so that the atomic and non-atomic counterparts have similar names. In addition to improving code readability, this also ensures the NMI handler is run with the host's debug registers loaded in the unlikely event that the user is debugging NMIs. Accuracy of the last_guest_tsc field is also improved when handling NMIs (and #MCs) as the handler will run after updating said field. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> [Naming cleanups. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor codeSean Christopherson1-2/+0
VMX can conditionally call kvm_{before,after}_interrupt() since KVM always uses "ack interrupt on exit" and therefore explicitly handles interrupts as opposed to blindly enabling irqs. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: x86: move MSR_IA32_POWER_CTL handling to common codePaolo Bonzini1-0/+6
Make it available to AMD hosts as well, just in case someone is trying to use an Intel processor's CPUID setup. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18kvm: x86: add host poll control msrsMarcelo Tosatti1-0/+23
Add an MSRs which allows the guest to disable host polling (specifically the cpuidle-haltpoll, when performing polling in the guest, disables host side polling). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15Liran Alon1-1/+1
Make all code consistent with kvm_deliver_exception_payload() by using appropriate symbolic constant instead of hard-coded number. Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-13KVM: x86: clean up conditions for asynchronous page fault handlingPaolo Bonzini1-5/+42
Even when asynchronous page fault is disabled, KVM does not want to pause the host if a guest triggers a page fault; instead it will put it into an artificial HLT state that allows running other host processes while allowing interrupt delivery into the guest. However, the way this feature is triggered is a bit confusing. First, it is not used for page faults while a nested guest is running: but this is not an issue since the artificial halt is completely invisible to the guest, either L1 or L2. Second, it is used even if kvm_halt_in_guest() returns true; in this case, the guest probably should not pay the additional latency cost of the artificial halt, and thus we should handle the page fault in a completely synchronous way. By introducing a new function kvm_can_deliver_async_pf, this patch commonizes the code that chooses whether to deliver an async page fault (kvm_arch_async_page_not_present) and the code that chooses whether a page fault should be handled synchronously (kvm_can_do_async_pf). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-05kvm: Convert kvm_lock to a mutexJunaid Shahid1-7/+7
It doesn't seem as if there is any particular need for kvm_lock to be a spinlock, so convert the lock to a mutex so that sleepable functions (in particular cond_resched()) can be called while holding it. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: X86: Emulate MSR_IA32_MISC_ENABLE MWAIT bitWanpeng Li1-1/+9
MSR IA32_MISC_ENABLE bit 18, according to SDM: | When this bit is set to 0, the MONITOR feature flag is not set (CPUID.01H:ECX[bit 3] = 0). | This indicates that MONITOR/MWAIT are not supported. | | Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 0. | | When this bit is set to 1 (default), MONITOR/MWAIT are supported (CPUID.01H:ECX[bit 3] = 1). The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR bit, CPUID.01H:ECX[bit 3] is a better guard than kvm_mwait_in_guest(). kvm_mwait_in_guest() affects the behavior of MONITOR/MWAIT, not its guest visibility. This patch implements toggling of the CPUID bit based on guest writes to the MSR. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Fixes for backwards compatibility - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: X86: Provide a capability to disable cstate msr read interceptsWanpeng Li1-1/+4
Allow guest reads CORE cstate when exposing host CPU power management capabilities to the guest. PKG cstate is restricted to avoid a guest to get the whole package information in multi-tenant scenario. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04kvm: x86: refine kvm_get_arch_capabilities()Xiaoyao Li1-4/+4
1. Using X86_FEATURE_ARCH_CAPABILITIES to enumerate the existence of MSR_IA32_ARCH_CAPABILITIES to avoid using rdmsrl_safe(). 2. Since kvm_get_arch_capabilities() is only used in this file, making it static. Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: Directly return result from kvm_arch_check_processor_compat()Sean Christopherson1-2/+2
Add a wrapper to invoke kvm_arch_check_processor_compat() so that the boilerplate ugliness of checking virtualization support on all CPUs is hidden from the arch specific code. x86's implementation in particular is quite heinous, as it unnecessarily propagates the out-param pattern into kvm_x86_ops. While the x86 specific issue could be resolved solely by changing kvm_x86_ops, make the change for all architectures as returning a value directly is prettier and technically more robust, e.g. s390 doesn't set the out param, which could lead to subtle breakage in the (highly unlikely) scenario where the out-param was not pre-initialized by the caller. Opportunistically annotate svm_check_processor_compat() with __init. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: LAPIC: Optimize timer latency furtherWanpeng Li1-3/+0
Advance lapic timer tries to hidden the hypervisor overhead between the host emulated timer fires and the guest awares the timer is fired. However, it just hidden the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire, instead of the real position of vmentry which is mentioned in the orignial commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire and before world switch on my haswell desktop. This patch tries to narrow the last gap(wait_lapic_expire -> world switch), it takes the real overhead time between apic_timer_fn/handle_preemption_timer and before world switch into consideration when adaptively tuning timer advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>