summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)AuthorFilesLines
2020-08-04Merge tag 'uninit-macro-v5.9-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()
2020-07-16treewide: Remove uninitialized_var() usageKees Cook1-1/+1
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-03KVM: x86: Inject #GP if guest attempts to toggle CR4.LA57 in 64-bit modeSean Christopherson1-0/+2
Inject a #GP on MOV CR4 if CR4.LA57 is toggled in 64-bit mode, which is illegal per Intel's SDM: CR4.LA57 57-bit linear addresses (bit 12 of CR4) ... blah blah blah ... This bit cannot be modified in IA-32e mode. Note, the pseudocode for MOV CR doesn't call out the fault condition, which is likely why the check was missed during initial development. This is arguably an SDM bug and will hopefully be fixed in future release of the SDM. Fixes: fd8cb433734ee ("KVM: MMU: Expose the LA57 feature to VM.") Cc: stable@vger.kernel.org Reported-by: Sebastien Boeuf <sebastien.boeuf@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703021714.5549-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-29KVM: X86: Fix async pf caused null-ptr-derefWanpeng Li1-0/+3
Syzbot reported that: CPU: 1 PID: 6780 Comm: syz-executor153 Not tainted 5.7.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__apic_accept_irq+0x46/0xb80 Call Trace: kvm_arch_async_page_present+0x7de/0x9e0 kvm_check_async_pf_completion+0x18d/0x400 kvm_arch_vcpu_ioctl_run+0x18bf/0x69f0 kvm_vcpu_ioctl+0x46a/0xe20 ksys_ioctl+0x11a/0x180 __x64_sys_ioctl+0x6f/0xb0 do_syscall_64+0xf6/0x7d0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 The testcase enables APF mechanism in MSR_KVM_ASYNC_PF_EN with ASYNC_PF_INT enabled w/o setting MSR_KVM_ASYNC_PF_INT before, what's worse, interrupt based APF 'page ready' event delivery depends on in kernel lapic, however, we didn't bail out when lapic is not in kernel during guest setting MSR_KVM_ASYNC_PF_EN which causes the null-ptr-deref in host later. This patch fixes it. Reported-by: syzbot+1bf777dfdde86d64b89b@syzkaller.appspotmail.com Fixes: 2635b5c4a0 (KVM: x86: interrupt based APF 'page ready' event delivery) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1593426391-8231-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-23KVM: x86: allow TSC to differ by NTP correction bounds without TSC scalingMarcelo Tosatti1-1/+2
The Linux TSC calibration procedure is subject to small variations (its common to see +-1 kHz difference between reboots on a given CPU, for example). So migrating a guest between two hosts with identical processor can fail, in case of a small variation in calibrated TSC between them. Without TSC scaling, the current kernel interface will either return an error (if user_tsc_khz <= tsc_khz) or enable TSC catchup mode. This change enables the following TSC tolerance check to accept KVM_SET_TSC_KHZ within tsc_tolerance_ppm (which is 250ppm by default). /* * Compute the variation in TSC rate which is acceptable * within the range of tolerance and decide if the * rate being applied is within that bounds of the hardware * rate. If so, no scaling or compensation need be done. */ thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm); thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm); if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) { pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi); use_scaling = 1; } NTP daemon in the guest can correct this difference (NTP can correct upto 500ppm). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Message-Id: <20200616114741.GA298183@fuller.cnet> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-23KVM: X86: Fix MSR range of APIC registers in X2APIC modeXiaoyao Li1-2/+2
Only MSR address range 0x800 through 0x8ff is architecturally reserved and dedicated for accessing APIC registers in x2APIC mode. Fixes: 0105d1a52640 ("KVM: x2apic interface to lapic") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200616073307.16440-1-xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: x86: do not pass poisoned hva to __kvm_set_memory_regionPaolo Bonzini1-6/+1
__kvm_set_memory_region does not use the hva at all, so trying to catch use-after-delete is pointless and, worse, it fails access_ok now that we apply it to all memslots including private kernel ones. This fixes an AVIC regression. Fixes: 09d952c971a5 ("KVM: check userspace_addr for all memslots") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: async_pf: Inject 'page ready' event only if 'page not present' was ↵Vitaly Kuznetsov1-2/+5
previously injected 'Page not present' event may or may not get injected depending on guest's state. If the event wasn't injected, there is no need to inject the corresponding 'page ready' event as the guest may get confused. E.g. Linux thinks that the corresponding 'page not present' event wasn't delivered *yet* and allocates a 'dummy entry' for it. This entry is never freed. Note, 'wakeup all' events have no corresponding 'page not present' event and always get injected. s390 seems to always be able to inject 'page not present', the change is effectively a nop. Suggested-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200610175532.779793-2-vkuznets@redhat.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=208081 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: x86: respect singlestep when emulating instructionFelipe Franciosi1-1/+1
When userspace configures KVM_GUESTDBG_SINGLESTEP, KVM will manage the presence of X86_EFLAGS_TF via kvm_set/get_rflags on vcpus. The actual rflag bit is therefore hidden from callers. That includes init_emulate_ctxt() which uses the value returned from kvm_get_flags() to set ctxt->tf. As a result, x86_emulate_instruction() will skip a single step, leaving singlestep_rip stale and not returning to userspace. This resolves the issue by observing the vcpu guest_debug configuration alongside ctxt->tf in x86_emulate_instruction(), performing the single step if set. Cc: stable@vger.kernel.org Signed-off-by: Felipe Franciosi <felipe@nutanix.com> Message-Id: <20200519081048.8204-1-felipe@nutanix.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-09KVM: x86: Unexport x86_fpu_cache and make it staticSean Christopherson1-2/+1
Make x86_fpu_cache static now that FPU allocation and destruction is handled entirely by common x86 code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200608180218.20946-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-08KVM: x86: Fix APIC page invalidation raceEiichi Tsukata1-5/+2
Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried to fix inappropriate APIC page invalidation by re-introducing arch specific kvm_arch_mmu_notifier_invalidate_range() and calling it from kvm_mmu_notifier_invalidate_range_start. However, the patch left a possible race where the VMCS APIC address cache is updated *before* it is unmapped: (Invalidator) kvm_mmu_notifier_invalidate_range_start() (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD) (KVM VCPU) vcpu_enter_guest() (KVM VCPU) kvm_vcpu_reload_apic_access_page() (Invalidator) actually unmap page Because of the above race, there can be a mismatch between the host physical address stored in the APIC_ACCESS_PAGE VMCS field and the host physical address stored in the EPT entry for the APIC GPA (0xfee0000). When this happens, the processor will not trap APIC accesses, and will instead show the raw contents of the APIC-access page. Because Windows OS periodically checks for unexpected modifications to the LAPIC register, this will show up as a BSOD crash with BugCheck CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in https://bugzilla.redhat.com/show_bug.cgi?id=1751017. The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range() cannot guarantee that no additional references are taken to the pages in the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately, this case is supported by the MMU notifier API, as documented in include/linux/mmu_notifier.h: * If the subsystem * can't guarantee that no additional references are taken to * the pages in the range, it has to implement the * invalidate_range() notifier to remove any references taken * after invalidate_range_start(). The fix therefore is to reload the APIC-access page field in the VMCS from kvm_mmu_notifier_invalidate_range() instead of ..._range_start(). Cc: stable@vger.kernel.org Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation") Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951 Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-08Revert "KVM: x86: work around leak of uninitialized stack contents"Vitaly Kuznetsov1-7/+0
handle_vmptrst()/handle_vmread() stopped injecting #PF unconditionally and switched to nested_vmx_handle_memory_failure() which just kills the guest with KVM_EXIT_INTERNAL_ERROR in case of MMIO access, zeroing 'exception' in kvm_write_guest_virt_system() is not needed anymore. This reverts commit 541ab2aeb28251bf7135c7961f3a6080eebcc705. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200605115906.532682-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-04KVM: x86: minor code refactor and comments fixup around dirty loggingAnthony Yznaga1-55/+49
Consolidate the code and correct the comments to show that the actions taken to update existing mappings to disable or enable dirty logging are not necessary when creating, moving, or deleting a memslot. Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Message-Id: <1591128450-11977-4-git-send-email-anthony.yznaga@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-04KVM: x86: avoid unnecessary rmap walks when creating/moving slotsAnthony Yznaga1-1/+1
On large memory guests it has been observed that creating a memslot for a very large range can take noticeable amount of time. Investigation showed that the time is spent walking the rmaps to update existing sptes to remove write access or set/clear dirty bits to support dirty logging. These rmap walks are unnecessary when creating or moving a memslot. A newly created memslot will not have any existing mappings, and the existing mappings of a moved memslot will have been invalidated and flushed. Any mappings established once the new/moved memslot becomes visible will be set using the properties of the new slot. Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Message-Id: <1591128450-11977-3-git-send-email-anthony.yznaga@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-04KVM: x86: remove unnecessary rmap walk of read-only memslotsAnthony Yznaga1-4/+2
There's no write access to remove. An existing memslot cannot be updated to set or clear KVM_MEM_READONLY, and any mappings established in a newly created or moved read-only memslot will already be read-only. Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Message-Id: <1591128450-11977-2-git-send-email-anthony.yznaga@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01x86/kvm/hyper-v: Add support for synthetic debugger interfaceJon Doron1-0/+8
Add support for Hyper-V synthetic debugger (syndbg) interface. The syndbg interface is using MSRs to emulate a way to send/recv packets data. The debug transport dll (kdvm/kdnet) will identify if Hyper-V is enabled and if it supports the synthetic debugger interface it will attempt to use it, instead of trying to initialize a network adapter. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Jon Doron <arilou@gmail.com> Message-Id: <20200529134543.1127440-4-arilou@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86/pmu: Support full width countingLike Xu1-0/+2
Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0) for GP counters that allows writing the full counter width. Enable this range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]). The guest would query CPUID to get the counter width, and sign extends the counter values as needed. The traditional MSRs always limit to 32bit, even though the counter internally is larger (48 or 57 bits). When the new capability is set, use the alternative range which do not have these restrictions. This lowers the overhead of perf stat slightly because it has to do less interrupts to accumulate the counter value. Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20200529074347.124619-3-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' inWei Wang1-2/+2
Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated field from the struct could be used by get_msr. This also makes this API consistent with kvm_pmu_set_msr. No functional changes. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Message-Id: <20200529074347.124619-2-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: announce KVM_FEATURE_ASYNC_PF_INTVitaly Kuznetsov1-0/+1
Introduce new capability to indicate that KVM supports interrupt based delivery of 'page ready' APF events. This includes support for both MSR_KVM_ASYNC_PF_INT and MSR_KVM_ASYNC_PF_ACK. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: acknowledgment mechanism for async pf page ready notificationsVitaly Kuznetsov1-4/+22
If two page ready notifications happen back to back the second one is not delivered and the only mechanism we currently have is kvm_check_async_pf_completion() check in vcpu_run() loop. The check will only be performed with the next vmexit when it happens and in some cases it may take a while. With interrupt based page ready notification delivery the situation is even worse: unlike exceptions, interrupts are not handled immediately so we must check if the slot is empty. This is slow and unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate the fact that the slot is free and host should check its notification queue. Mandate using it for interrupt based 'page ready' APF event delivery. As kvm_check_async_pf_completion() is going away from vcpu_run() we need a way to communicate the fact that vcpu->async_pf.done queue has transitioned from empty to non-empty state. Introduce kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-7-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: interrupt based APF 'page ready' event deliveryVitaly Kuznetsov1-26/+67
Concerns were expressed around APF delivery via synthetic #PF exception as in some cases such delivery may collide with real page fault. For 'page ready' notifications we can easily switch to using an interrupt instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the legacy one. One notable difference between the two mechanisms is that interrupt may not get handled immediately so whenever we would like to deliver next event (regardless of its type) we must be sure the guest had read and cleared previous event in the slot. While on it, get rid on 'type 1/type 2' names for APF events in the documentation as they are causing confusion. Use 'page not present' and 'page ready' everywhere instead. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: rename kvm_arch_can_inject_async_page_present() to ↵Vitaly Kuznetsov1-1/+1
kvm_arch_can_dequeue_async_page_present() An innocent reader of the following x86 KVM code: bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) { if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED)) return true; ... may get very confused: if APF mechanism is not enabled, why do we report that we 'can inject async page present'? In reality, upon injection kvm_arch_async_page_present() will check the same condition again and, in case APF is disabled, will just drop the item. This is fine as the guest which deliberately disabled APF doesn't expect to get any APF notifications. Rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() to make it clear what we are checking: if the item can be dequeued (meaning either injected or just dropped). On s390 kvm_arch_can_inject_async_page_present() always returns 'true' so the rename doesn't matter much. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: extend struct kvm_vcpu_pv_apf_data with token infoVitaly Kuznetsov1-4/+13
Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page ↵Vitaly Kuznetsov1-22/+1
Ready" exceptions simultaneously" Commit 9a6e7c39810e (""KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously") added a protection against 'page ready' notification coming before 'page not present' is delivered. This situation seems to be impossible since commit 2a266f23550b ("KVM MMU: check pending exception before injecting APF) which added 'vcpu->arch.exception.pending' check to kvm_can_do_async_pf. On x86, kvm_arch_async_page_present() has only one call site: kvm_check_async_pf_completion() loop and we only enter the loop when kvm_arch_can_inject_async_page_present(vcpu) which when async pf msr is enabled, translates into kvm_can_do_async_pf(). There is also one problem with the cancellation mechanism. We don't seem to check that the 'page not present' notification we're canceling matches the 'page ready' notification so in theory, we may erroneously drop two valid events. Revert the commit. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATEPaolo Bonzini1-1/+2
Similar to VMX, the state that is captured through the currently available IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. In particular, the SVM-specific state for nested virtualization includes the L1 saved state (including the interrupt flag), the cached L2 controls, and the GIF. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: nSVM: inject exceptions via svm_check_nested_eventsPaolo Bonzini1-11/+2
This allows exceptions injected by the emulator to be properly delivered as vmexits. The code also becomes simpler, because we can just let all L0-intercepted exceptions go through the usual path. In particular, our emulation of the VMX #DB exit qualification is very much simplified, because the vmexit injection path can use kvm_deliver_exception_payload to update DR6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: x86: enable event window in inject_pending_eventPaolo Bonzini1-51/+66
In case an interrupt arrives after nested.check_events but before the call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt window even if the interrupt is actually going to be a vmexit. This is useless rather than harmful, but it really complicates reasoning about SVM's handling of the VINTR intercept. We'd like to never bother with the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in that case there is no interrupt window and we can just exit the nested guest whenever we want. This patch moves the opening of the interrupt window inside inject_pending_event. This consolidates the check for pending interrupt/NMI/SMI in one place, and makes KVM's usage of immediate exits more consistent, extending it beyond just nested virtualization. There are two functional changes here. They only affect corner cases, but overall they simplify the inject_pending_event. - re-injection of still-pending events will also use req_immediate_exit instead of using interrupt-window intercepts. This should have no impact on performance on Intel since it simply replaces an interrupt-window or NMI-window exit for a preemption-timer exit. On AMD, which has no equivalent of the preemption time, it may incur some overhead but an actual effect on performance should only be visible in pathological cases. - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true if an interrupt, NMI or SMI is blocked by nested_run_pending. This makes sense because entering the VM will allow it to make progress and deliver the event. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: x86: track manually whether an event has been injectedPaolo Bonzini1-5/+12
Instead of calling kvm_event_needs_reinjection, track its future return value in a variable. This will be useful in the next patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: x86: Initialize tdp_level during vCPU creationSean Christopherson1-0/+1
Initialize vcpu->arch.tdp_level during vCPU creation to avoid consuming garbage if userspace calls KVM_RUN without first calling KVM_SET_CPUID. Fixes: e93fd3b3e89e9 ("KVM: x86/mmu: Capture TDP level when updating CPUID") Reported-by: syzbot+904752567107eefb728c@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200527085400.23759-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27kvm/x86: Remove redundant function implementations彭浩(Richard)1-0/+1
pic_in_kernel(), ioapic_in_kernel() and irqchip_kernel() have the same implementation. Signed-off-by: Peng Hao <richard.peng@oppo.com> Message-Id: <HKAPR02MB4291D5926EA10B8BFE9EA0D3E0B70@HKAPR02MB4291.apcprd02.prod.outlook.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: Fix the indentation to match coding styleHaiwei Li1-1/+1
There is a bad indentation in next&queue branch. The patch looks like fixes nothing though it fixes the indentation. Before fixing: if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) { kvm_skip_emulated_instruction(vcpu); ret = EXIT_FASTPATH_EXIT_HANDLED; } break; case MSR_IA32_TSCDEADLINE: After fixing: if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) { kvm_skip_emulated_instruction(vcpu); ret = EXIT_FASTPATH_EXIT_HANDLED; } break; case MSR_IA32_TSCDEADLINE: Signed-off-by: Haiwei Li <lihaiwei@tencent.com> Message-Id: <2f78457e-f3a7-3bc9-e237-3132ee87f71e@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: x86: Remove superfluous brackets from case statementSean Christopherson1-2/+2
Remove unnecessary brackets from a case statement that unintentionally encapsulates unrelated case statements in the same switch statement. While technically legal and functionally correct syntax, the brackets are visually confusing and potentially dangerous, e.g. the last of the encapsulated case statements has an undocumented fall-through that isn't flagged by compilers due the encapsulation. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200218234012.7110-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27KVM: x86: allow KVM_STATE_NESTED_MTF_PENDING in kvm_state flagsPaolo Bonzini1-1/+1
The migration functionality was left incomplete in commit 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation", 2020-02-23), fix it. Fixes: 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation") Cc: stable@vger.kernel.org Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-27Merge branch 'kvm-master' into HEADPaolo Bonzini1-1/+5
Merge AMD fixes before doing more development work.
2020-05-27KVM: x86: don't expose MSR_IA32_UMWAIT_CONTROL unconditionallyMaxim Levitsky1-0/+4
This msr is only available when the host supports WAITPKG feature. This breaks a nested guest, if the L1 hypervisor is set to ignore unknown msrs, because the only other safety check that the kernel does is that it attempts to read the msr and rejects it if it gets an exception. Cc: stable@vger.kernel.org Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200523161455.3940-3-mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mceJim Mattson1-1/+1
Bank_num is a one-based count of banks, not a zero-based index. It overflows the allocated space only when strictly greater than KVM_MAX_MCE_BANKS. Fixes: a9e38c3e01ad ("KVM: x86: Catch potential overrun in MCE setup") Signed-off-by: Jue Wang <juew@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Message-Id: <20200511225616.19557-1-jmattson@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15kvm: add halt-polling cpu usage statsDavid Matlack1-0/+2
Two new stats for exposing halt-polling cpu usage: halt_poll_success_ns halt_poll_fail_ns Thus sum of these 2 stats is the total cpu time spent polling. "success" means the VCPU polled until a virtual interrupt was delivered. "fail" means the VCPU had to schedule out (either because the maximum poll time was reached or it needed to yield the CPU). To avoid touching every arch's kvm_vcpu_stat struct, only update and export halt-polling cpu usage stats if we're on x86. Exporting cpu usage as a u64 and in nanoseconds means we will overflow at ~500 years, which seems reasonably large. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Jon Cargille <jcargill@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200508182240.68440-1-jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: TSCDEADLINE MSR emulation fastpathWanpeng Li1-0/+16
This patch implements a fast path for emulation of writes to the TSCDEADLINE MSR. Besides shortcutting various housekeeping tasks in the vCPU loop, the fast path can also deliver the timer interrupt directly without going through KVM_REQ_PENDING_TIMER because it runs in vCPU context. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-7-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Introduce more exit_fastpath_completion enum valuesWanpeng Li1-9/+10
Adds a fastpath_t typedef since enum lines are a bit long, and replace EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values. - EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop, but it would skip invoking the exit handler. - EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered without invoking the exit handler or going back to vcpu_run Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Introduce kvm_vcpu_exit_request() helperWanpeng Li1-2/+8
Introduce kvm_vcpu_exit_request() helper, we need to check some conditions before enter guest again immediately, we skip invoking the exit handler and go through full run loop if complete fastpath but there is stuff preventing we enter guest again immediately. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-5-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Sanity check on gfn before removalPeter Xu1-0/+4
The index returned by kvm_async_pf_gfn_slot() will be removed when an async pf gfn is going to be removed. However kvm_async_pf_gfn_slot() is not reliable in that it can return the last key it loops over even if the gfn is not found in the async gfn array. It should never happen, but it's still better to sanity check against that to make sure no unexpected gfn will be removed. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155910.267514-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: X86: Force ASYNC_PF_PER_VCPU to be power of twoPeter Xu1-3/+5
Forcing the ASYNC_PF_PER_VCPU to be power of two is much easier to be used rather than calling roundup_pow_of_two() from time to time. Do this by adding a BUILD_BUG_ON() inside the hash function. Another point is that generally async pf does not allow concurrency over ASYNC_PF_PER_VCPU after all (see kvm_setup_async_pf()), so it does not make much sense either to have it not a power of two or some of the entries will definitely be wasted. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155859.267366-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enumsSean Christopherson1-2/+2
Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15kvm: x86: Cleanup vcpu->arch.guest_xstate_sizeXiaoyao Li1-2/+0
vcpu->arch.guest_xstate_size lost its only user since commit df1daba7d1cb ("KVM: x86: support XSAVES usage in the host"), so clean it up. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200429154312.1411-1-xiaoyao.li@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: Save L1 TSC offset in 'struct kvm_vcpu_arch'Sean Christopherson1-5/+4
Save L1's TSC offset in 'struct kvm_vcpu_arch' and drop the kvm_x86_ops hook read_l1_tsc_offset(). This avoids a retpoline (when configured) when reading L1's effective TSC, which is done at least once on every VM-Exit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: Replace late check_nested_events() hack with more precise fixPaolo Bonzini1-24/+12
Add an argument to interrupt_allowed and nmi_allowed, to checking if interrupt injection is blocked. Use the hook to handle the case where an interrupt arrives between check_nested_events() and the injection logic. Drop the retry of check_nested_events() that hack-a-fixed the same condition. Blocking injection is also a bit of a hack, e.g. KVM should do exiting and non-exiting interrupt processing in a single pass, but it's a more precise hack. The old comment is also misleading, e.g. KVM_REQ_EVENT is purely an optimization, setting it on every run loop (which KVM doesn't do) should not affect functionality, only performance. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-13-sean.j.christopherson@intel.com> [Extend to SVM, add SMI and NMI. Even though NMI and SMI cannot come asynchronously right now, making the fix generic is easy and removes a special case. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: WARN on injected+pending exception even in nested caseSean Christopherson1-1/+3
WARN if a pending exception is coincident with an injected exception before calling check_nested_events() so that the WARN will fire even if inject_pending_event() bails early because check_nested_events() detects the conflict. Bailing early isn't problematic (quite the opposite), but suppressing the WARN is undesirable as it could mask a bug elsewhere in KVM. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: replace is_smm checks with kvm_x86_ops.smi_allowedPaolo Bonzini1-3/+3
Do not hardcode is_smm so that all the architectural conditions for blocking SMIs are listed in a single place. Well, in two places because this introduces some code duplication between Intel and AMD. This ensures that nested SVM obeys GIF in kvm_vcpu_has_events. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit setSean Christopherson1-0/+2
Re-request KVM_REQ_EVENT if vcpu_enter_guest() bails after processing pending requests and an immediate exit was requested. This fixes a bug where a pending event, e.g. VMX preemption timer, is delayed and/or lost if the exit was deferred due to something other than a higher priority _injected_ event, e.g. due to a pending nested VM-Enter. This bug only affects the !injected case as kvm_x86_ops.cancel_injection() sets KVM_REQ_EVENT to redo the injection, but that's purely serendipitous behavior with respect to the deferred event. Note, emulated preemption timer isn't the only event that can be affected, it simply happens to be the only event where not re-requesting KVM_REQ_EVENT is blatantly visible to the guest. Fixes: f4124500c2c13 ("KVM: nVMX: Fully emulate preemption timer") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Open a window for pending nested VMX preemption timerSean Christopherson1-0/+9
Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and use it to effectively open a window for servicing the expired timer. Like pending SMIs on VMX, opening a window simply means requesting an immediate exit. This fixes a bug where an expired VMX preemption timer (for L2) will be delayed and/or lost if a pending exception is injected into L2. The pending exception is rightly prioritized by vmx_check_nested_events() and injected into L2, with the preemption timer left pending. Because no window opened, L2 is free to run uninterrupted. Fixes: f4124500c2c13 ("KVM: nVMX: Fully emulate preemption timer") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com> [Check it in kvm_vcpu_has_events too, to ensure that the preemption timer is serviced promptly even if the vCPU is halted and L1 is not intercepting HLT. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>