summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)AuthorFilesLines
2023-04-13KVM: x86: Clear "has_error_code", not "error_code", for RM exception injectionSean Christopherson1-2/+9
commit 6c41468c7c12d74843bb414fc00307ea8a6318c3 upstream. When injecting an exception into a vCPU in Real Mode, suppress the error code by clearing the flag that tracks whether the error code is valid, not by clearing the error code itself. The "typo" was introduced by recent fix for SVM's funky Paged Real Mode. Opportunistically hoist the logic above the tracepoint so that the trace is coherent with respect to what is actually injected (this was also the behavior prior to the buggy commit). Fixes: b97f07458373 ("KVM: x86: determine if an exception has an error code only when injecting it.") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322143300.2209476-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17KVM: x86: Move guts of kvm_arch_init() to standalone helperSean Christopherson1-2/+13
[ Upstream commit 4f8396b96a9fc672964842fe7adbe8ddca8a3adf ] Move the guts of kvm_arch_init() to a new helper, kvm_x86_vendor_init(), so that VMX can do _all_ arch and vendor initialization before calling kvm_init(). Calling kvm_init() must be the _very_ last step during init, as kvm_init() exposes /dev/kvm to userspace, i.e. allows creating VMs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: e32b120071ea ("KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-25KVM: x86: Fail emulation during EMULTYPE_SKIP on any exceptionSean Christopherson1-1/+3
[ Upstream commit 17122c06b86c9f77f45b86b8e62c3ed440847a59 ] Treat any exception during instruction decode for EMULTYPE_SKIP as a "full" emulation failure, i.e. signal failure instead of queuing the exception. When decoding purely to skip an instruction, KVM and/or the CPU has already done some amount of emulation that cannot be unwound, e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated MMIO. KVM already does this if a #UD is encountered, but not for other exceptions, e.g. if a #PF is encountered during fetch. In SVM's soft-injection use case, queueing the exception is particularly problematic as queueing exceptions while injecting events can put KVM into an infinite loop due to bailing from VM-Enter to service the newly pending exception. E.g. multiple warnings to detect such behavior fire: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G W 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-22kvm: initialize all of the kvm_debugregs structure before sending it to ↵Greg Kroah-Hartman1-2/+1
userspace commit 2c10b61421a28e95a46ab489fd56c0f442ff6952 upstream. When calling the KVM_GET_DEBUGREGS ioctl, on some configurations, there might be some unitialized portions of the kvm_debugregs structure that could be copied to userspace. Prevent this as is done in the other kvm ioctls, by setting the whole structure to 0 before copying anything into it. Bonus is that this reduces the lines of code as the explicit flag setting and reserved space zeroing out can be removed. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <x86@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: stable <stable@kernel.org> Reported-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Message-Id: <20230214103304.3689213-1-gregkh@linuxfoundation.org> Tested-by: Xingyuan Mo <hdthky0@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-14KVM: x86: Mitigate the cross-thread return address predictions bugTom Lendacky1-11/+32
commit 6f0f2d5ef895d66a3f2b32dd05189ec34afa5a55 upstream. By default, KVM/SVM will intercept attempts by the guest to transition out of C0. However, the KVM_CAP_X86_DISABLE_EXITS capability can be used by a VMM to change this behavior. To mitigate the cross-thread return address predictions bug (X86_BUG_SMT_RSB), a VMM must not be allowed to override the default behavior to intercept C0 transitions. Use a module parameter to control the mitigation on processors that are vulnerable to X86_BUG_SMT_RSB. If the processor is vulnerable to the X86_BUG_SMT_RSB bug and the module parameter is set to mitigate the bug, KVM will not allow the disabling of the HLT, MWAIT and CSTATE exits. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <4019348b5e07148eb4d593380a5f6713b93c9a16.1675956146.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-30KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULTPaolo Bonzini1-1/+1
If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by turning it into a vmexit), there is no need to leave vcpu_enter_guest(). Any vcpu->requests will be caught later before the actual vmentry, and in fact vcpu_enter_guest() was not initializing the "r" variable. Depending on the compiler's whims, this could cause the x86_64/triple_fault_event_test test to fail. Cc: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 92e7d5c83aff ("KVM: x86: allow L1 to not intercept triple fault") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-27Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-6/+23
Pull kvm fixes from Paolo Bonzini: "x86: - Fixes for Xen emulation. While nobody should be enabling it in the kernel (the only public users of the feature are the selftests), the bug effectively allows userspace to read arbitrary memory. - Correctness fixes for nested hypervisors that do not intercept INIT or SHUTDOWN on AMD; the subsequent CPU reset can cause a use-after-free when it disables virtualization extensions. While downgrading the panic to a WARN is quite easy, the full fix is a bit more laborious; there are also tests. This is the bulk of the pull request. - Fix race condition due to incorrect mmu_lock use around make_mmu_pages_available(). Generic: - Obey changes to the kvm.halt_poll_ns module parameter in VMs not using KVM_CAP_HALT_POLL, restoring behavior from before the introduction of the capability" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Update gfn_to_pfn_cache khva when it moves within the same page KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0 KVM: x86/xen: Validate port number in SCHEDOP_poll KVM: x86/mmu: Fix race condition in direct_page_fault KVM: x86: remove exit_int_info warning in svm_handle_exit KVM: selftests: add svm part to triple_fault_test KVM: x86: allow L1 to not intercept triple fault kvm: selftests: add svm nested shutdown test KVM: selftests: move idt_entry to header KVM: x86: forcibly leave nested mode on vCPU reset KVM: x86: add kvm_leave_nested KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use KVM: x86: nSVM: leave nested mode on vCPU free KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-polling KVM: Cap vcpu->halt_poll_ns before halting rather than after
2022-11-17KVM: x86: allow L1 to not intercept triple faultMaxim Levitsky1-5/+6
This is SVM correctness fix - although a sane L1 would intercept SHUTDOWN event, it doesn't have to, so we have to honour this. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: x86: forcibly leave nested mode on vCPU resetMaxim Levitsky1-0/+10
While not obivous, kvm_vcpu_reset() leaves the nested mode by clearing 'vcpu->arch.hflags' but it does so without all the required housekeeping. On SVM, it is possible to have a vCPU reset while in guest mode because unlike VMX, on SVM, INIT's are not latched in SVM non root mode and in addition to that L1 doesn't have to intercept triple fault, which should also trigger L1's reset if happens in L2 while L1 didn't intercept it. If one of the above conditions happen, KVM will continue to use vmcb02 while not having in the guest mode. Later the IA32_EFER will be cleared which will lead to freeing of the nested guest state which will (correctly) free the vmcb02, but since KVM still uses it (incorrectly) this will lead to a use after free and kernel crash. This issue is assigned CVE-2022-3344 Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: x86: add kvm_leave_nestedMaxim Levitsky1-1/+7
add kvm_leave_nested which wraps a call to nested_ops->leave_nested into a function. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20221103141351.50662-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-15x86/cpu: Restore AMD's DE_CFG MSR after resumeBorislav Petkov1-1/+1
DE_CFG contains the LFENCE serializing bit, restore it on resume too. This is relevant to older families due to the way how they do S3. Unify and correct naming while at it. Fixes: e4d0e84e4907 ("x86/cpu/AMD: Make LFENCE a serializing instruction") Reported-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Reported-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-11-09KVM: x86/pmu: Limit the maximum number of supported AMD GP countersLike Xu1-0/+3
The AMD PerfMonV2 specification allows for a maximum of 16 GP counters, but currently only 6 pairs of MSRs are accepted by KVM. While AMD64_NUM_COUNTERS_CORE is already equal to 6, increasing without adjusting msrs_to_save_all[] could result in out-of-bounds accesses. Therefore introduce a macro (named KVM_AMD_PMC_MAX_GENERIC) to refer to the number of counters supported by KVM. Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-3-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Limit the maximum number of supported Intel GP countersLike Xu1-5/+7
The Intel Architectural IA32_PMCx MSRs addresses range allows for a maximum of 8 GP counters, and KVM cannot address any more. Introduce a local macro (named KVM_INTEL_PMC_MAX_GENERIC) and use it consistently to refer to the number of counters supported by KVM, thus avoiding possible out-of-bound accesses. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-2-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yetLike Xu1-12/+2
The SDM lists an architectural MSR IA32_CORE_CAPABILITIES (0xCF) that limits the theoretical maximum value of the Intel GP PMC MSRs allocated at 0xC1 to 14; likewise the Intel April 2022 SDM adds IA32_OVERCLOCKING_STATUS at 0x195 which limits the number of event selection MSRs to 15 (0x186-0x194). Limiting the maximum number of counters to 14 or 18 based on the currently allocated MSRs is clearly fragile, and it seems likely that Intel will even place PMCs 8-15 at a completely different range of MSR indices. So stop at the maximum number of GP PMCs supported today on Intel processors. There are some machines, like Intel P4 with non Architectural PMU, that may indeed have 18 counters, but those counters are in a completely different MSR address range and are not supported by KVM. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Fixes: cf05a67b68b8 ("KVM: x86: omit "impossible" pmu MSRs from MSR list") Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220919091008.60695-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-03KVM: x86: Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()Ben Gardon1-0/+3
kvm_zap_gfn_range() must be called in an SRCU read-critical section, but there is no SRCU annotation in __kvm_set_or_clear_apicv_inhibit(). This can lead to the following warning via kvm_arch_vcpu_ioctl_set_guest_debug() if a Shadow MMU is in use (TDP MMU disabled or nesting): [ 1416.659809] ============================= [ 1416.659810] WARNING: suspicious RCU usage [ 1416.659839] 6.1.0-dbg-DEV #1 Tainted: G S I [ 1416.659853] ----------------------------- [ 1416.659854] include/linux/kvm_host.h:954 suspicious rcu_dereference_check() usage! [ 1416.659856] ... [ 1416.659904] dump_stack_lvl+0x84/0xaa [ 1416.659910] dump_stack+0x10/0x15 [ 1416.659913] lockdep_rcu_suspicious+0x11e/0x130 [ 1416.659919] kvm_zap_gfn_range+0x226/0x5e0 [ 1416.659926] ? kvm_make_all_cpus_request_except+0x18b/0x1e0 [ 1416.659935] __kvm_set_or_clear_apicv_inhibit+0xcc/0x100 [ 1416.659940] kvm_arch_vcpu_ioctl_set_guest_debug+0x350/0x390 [ 1416.659946] kvm_vcpu_ioctl+0x2fc/0x620 [ 1416.659955] __se_sys_ioctl+0x77/0xc0 [ 1416.659962] __x64_sys_ioctl+0x1d/0x20 [ 1416.659965] do_syscall_64+0x3d/0x80 [ 1416.659969] entry_SYSCALL_64_after_hwframe+0x63/0xcd Always take the KVM SRCU read lock in __kvm_set_or_clear_apicv_inhibit() to protect the GFN to memslot translation. The SRCU read lock is not technically required when no Shadow MMUs are in use, since the TDP MMU walks the paging structures from the roots and does not need to look up GFN translations in the memslots, but make the SRCU locking unconditional for simplicty. In most cases, the SRCU locking is taken care of in the vCPU run loop, but when called through other ioctls (such as KVM_SET_GUEST_DEBUG) there is no srcu_read_lock. Tested: ran tools/testing/selftests/kvm/x86_64/debug_regs on a DBG build. This patch causes the suspicious RCU warning to disappear. Note that the warning is hit in __kvm_zap_rmaps(), so kvm_memslots_have_rmaps() must return true in order for this to repro (i.e. the TDP MMU must be off or nesting in use.) Reported-by: Greg Thelen <gthelen@google.com> Fixes: 36222b117e36 ("KVM: x86: don't disable APICv memslot when inhibited") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20221102205359.1260980-1-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: Initialize gfn_to_pfn_cache locks in dedicated helperMichal Luczaj1-5/+7
Move the gfn_to_pfn_cache lock initialization to another helper and call the new helper during VM/vCPU creation. There are race conditions possible due to kvm_gfn_to_pfn_cache_init()'s ability to re-initialize the cache's locks. For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock. (thread 1) | (thread 2) | kvm_xen_set_evtchn_fast | read_lock_irqsave(&gpc->lock, ...) | | kvm_gfn_to_pfn_cache_init | rwlock_init(&gpc->lock) read_unlock_irqrestore(&gpc->lock, ...) | Rename "cache_init" and "cache_destroy" to activate+deactivate to avoid implying that the cache really is destroyed/freed. Note, there more races in the newly named kvm_gpc_activate() that will be addressed separately. Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: call out that this is a bug fix] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: x86: Exempt pending triple fault from event injection sanity checkSean Christopherson1-1/+14
Exempt pending triple faults, a.k.a. KVM_REQ_TRIPLE_FAULT, when asserting that KVM didn't attempt to queue a new exception during event injection. KVM needs to emulate the injection itself when emulating Real Mode due to lack of unrestricted guest support (VMX) and will queue a triple fault if that emulation fails. Ideally the assertion would more precisely filter out the emulated Real Mode triple fault case, but rmode.vm86_active is buried in vcpu_vmx and can't be queried without a new kvm_x86_ops. And unlike "regular" exceptions, triple fault cannot put the vCPU into an infinite loop; the triple fault will force either an exit to userspace or a nested VM-Exit, and triple fault after nested VM-Exit will force an exit to userspace. I.e. there is no functional issue, so just suppress the warning for triple faults. Opportunistically convert the warning to a one-time thing, when it fires, it fires _a lot_, and is usually user triggerable, i.e. can be used to spam the kernel log. Fixes: 7055fb113116 ("KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions") Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/r/202209301338.aca913c3-yujie.liu@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220930230008.1636044-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22KVM: x86: Add compat handler for KVM_X86_SET_MSR_FILTERAlexander Graf1-0/+56
The KVM_X86_SET_MSR_FILTER ioctls contains a pointer in the passed in struct which means it has a different struct size depending on whether it gets called from 32bit or 64bit code. This patch introduces compat code that converts from the 32bit struct to its 64bit counterpart which then gets used going forward internally. With this applied, 32bit QEMU can successfully set MSR bitmaps when running on 64bit kernels. Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com> Fixes: 1a155254ff937 ("KVM: x86: Introduce MSR filtering") Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-4-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22KVM: x86: Copy filter arg outside kvm_vm_ioctl_set_msr_filter()Alexander Graf1-14/+17
In the next patch we want to introduce a second caller to set_msr_filter() which constructs its own filter list on the stack. Refactor the original function so it takes it as argument instead of reading it through copy_from_user(). Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-3-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-181/+395
Pull kvm updates from Paolo Bonzini: "The first batch of KVM patches, mostly covering x86. ARM: - Account stage2 page table allocations in memory stats x86: - Account EPT/NPT arm64 page table allocations in memory stats - Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR accesses - Drop eVMCS controls filtering for KVM on Hyper-V, all known versions of Hyper-V now support eVMCS fields associated with features that are enumerated to the guest - Use KVM's sanitized VMCS config as the basis for the values of nested VMX capabilities MSRs - A myriad event/exception fixes and cleanups. Most notably, pending exceptions morph into VM-Exits earlier, as soon as the exception is queued, instead of waiting until the next vmentry. This fixed a longstanding issue where the exceptions would incorrecly become double-faults instead of triggering a vmexit; the common case of page-fault vmexits had a special workaround, but now it's fixed for good - A handful of fixes for memory leaks in error paths - Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow - Never write to memory from non-sleepable kvm_vcpu_check_block() - Selftests refinements and cleanups - Misc typo cleanups Generic: - remove KVM_REQ_UNHALT" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits) KVM: remove KVM_REQ_UNHALT KVM: mips, x86: do not rely on KVM_REQ_UNHALT KVM: x86: never write to memory from kvm_vcpu_check_block() KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested events KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-Enter KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set KVM: x86: lapic does not have to process INIT if it is blocked KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed KVM: nVMX: Make an event request when pending an MTF nested VM-Exit KVM: x86: make vendor code check for all nested events mailmap: Update Oliver's email address KVM: x86: Allow force_emulation_prefix to be written without a reload KVM: selftests: Add an x86-only test to verify nested exception queueing KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events() KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions KVM: x86: Morph pending exceptions to pending VM-Exits at queue time ...
2022-09-26KVM: remove KVM_REQ_UNHALTPaolo Bonzini1-3/+0
KVM_REQ_UNHALT is now unnecessary because it is replaced by the return value of kvm_vcpu_block/kvm_vcpu_halt. Remove it. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20220921003201.1441511-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: mips, x86: do not rely on KVM_REQ_UNHALTPaolo Bonzini1-1/+8
KVM_REQ_UNHALT is a weird request that simply reports the value of kvm_arch_vcpu_runnable() on exit from kvm_vcpu_halt(). Only MIPS and x86 are looking at it, the others just clear it. Check the state of the vCPU directly so that the request is handled as a nop on all architectures. No functional change intended, except for corner cases where an event arrive immediately after a signal become pending or after another similar host-side event. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20220921003201.1441511-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: never write to memory from kvm_vcpu_check_block()Paolo Bonzini1-3/+11
kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore it cannot sleep. Writing to guest memory is therefore forbidden, but it can happen on AMD processors if kvm_check_nested_events() causes a vmexit. Fortunately, all events that are caught by kvm_check_nested_events() are also recognized by kvm_vcpu_has_events() through vendor callbacks such as kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so remove the call and postpone the actual processing to vcpu_block(). Opportunistically honor the return of kvm_check_nested_events(). KVM punted on the check in kvm_vcpu_running() because the only error path is if vmx_complete_nested_posted_interrupt() fails, in which case KVM exits to userspace with "internal error" i.e. the VM is likely dead anyways so it wasn't worth overloading the return of kvm_vcpu_running(). Add the check mostly so that KVM is consistent with itself; the return of the call via kvm_apic_accept_events()=>kvm_check_nested_events() that immediately follows _is_ checked. Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sean: check and handle return of kvm_check_nested_events()] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: lapic does not have to process INIT if it is blockedPaolo Bonzini1-1/+2
Do not return true from kvm_vcpu_has_events() if the vCPU isn' going to immediately process a pending INIT/SIPI. INIT/SIPI shouldn't be treated as wake events if they are blocked. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sean: rebase onto refactored INIT/SIPI helpers, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specificSean Christopherson1-3/+3
Rename kvm_apic_has_events() to kvm_apic_has_pending_init_or_sipi() so that it's more obvious that "events" really just means "INIT or SIPI". Opportunistically clean up a weirdly worded comment that referenced kvm_apic_has_events() instead of kvm_apic_accept_events(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowedSean Christopherson1-4/+5
Rename and invert kvm_vcpu_latch_init() to kvm_apic_init_sipi_allowed() so as to match the behavior of {interrupt,nmi,smi}_allowed(), and expose the helper so that it can be used by kvm_vcpu_has_events() to determine whether or not an INIT or SIPI is pending _and_ can be taken immediately. Opportunistically replaced usage of the "latch" terminology with "blocked" and/or "allowed", again to align with KVM's terminology used for all other event types. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: make vendor code check for all nested eventsPaolo Bonzini1-4/+4
Interrupts, NMIs etc. sent while in guest mode are already handled properly by the *_interrupt_allowed callbacks, but other events can cause a vCPU to be runnable that are specific to guest mode. In the case of VMX there are two, the preemption timer and the monitor trap. The VMX preemption timer is already special cased via the hv_timer_pending callback, but the purpose of the callback can be easily extended to MTF or in fact any other event that can occur only in guest mode. Rename the callback and add an MTF check; kvm_arch_vcpu_runnable() now can return true if an MTF is pending, without relying on kvm_vcpu_running()'s call to kvm_check_nested_events(). Until that call is removed, however, the patch introduces no functional change. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220921003201.1441511-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Allow force_emulation_prefix to be written without a reloadSean Christopherson1-3/+4
Allow force_emulation_prefix to be written by privileged userspace without reloading KVM. The param does not have any persistent affects and is trivial to snapshot. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830231614.3580124-28-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events()Sean Christopherson1-3/+43
Rename inject_pending_events() to kvm_check_and_inject_events() in order to capture the fact that it handles more than just pending events, and to (mostly) align with kvm_check_nested_events(), which omits the "inject" for brevity. Add a comment above kvm_check_and_inject_events() to provide a high-level synopsis, and to document a virtualization hole (KVM erratum if you will) that exists due to KVM not strictly tracking instruction boundaries with respect to coincident instruction restarts and asynchronous events. No functional change inteded. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-25-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptionsSean Christopherson1-3/+0
Treat pending TRIPLE_FAULTS as pending exceptions. A triple fault is an exception for all intents and purposes, it's just not tracked as such because there's no vector associated the exception. E.g. if userspace were to set vcpu->request_interrupt_window while running L2 and L2 hit a triple fault, a triple fault nested VM-Exit should be synthesized to L1 before exiting to userspace with KVM_EXIT_IRQ_WINDOW_OPEN. Link: https://lore.kernel.org/all/YoVHAIGcFgJit1qp@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-23-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Morph pending exceptions to pending VM-Exits at queue timeSean Christopherson1-49/+110
Morph pending exceptions to pending VM-Exits (due to interception) when the exception is queued instead of waiting until nested events are checked at VM-Entry. This fixes a longstanding bug where KVM fails to handle an exception that occurs during delivery of a previous exception, KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow paging), and KVM determines that the exception is in the guest's domain, i.e. queues the new exception for L2. Deferring the interception check causes KVM to esclate various combinations of injected+pending exceptions to double fault (#DF) without consulting L1's interception desires, and ends up injecting a spurious #DF into L2. KVM has fudged around the issue for #PF by special casing emulated #PF injection for shadow paging, but the underlying issue is not unique to shadow paging in L0, e.g. if KVM is intercepting #PF because the guest has a smaller maxphyaddr and L1 (but not L0) is using shadow paging. Other exceptions are affected as well, e.g. if KVM is intercepting #GP for one of SVM's workaround or for the VMware backdoor emulation stuff. The other cases have gone unnoticed because the #DF is spurious if and only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1 would have injected #DF anyways. The hack-a-fix has also led to ugly code, e.g. bailing from the emulator if #PF injection forced a nested VM-Exit and the emulator finds itself back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves the async #PF in L2 mess; no need to set a magic flag and token, simply queue a #PF nested VM-Exit. Deal with event migration by flagging that a pending exception was queued by userspace and check for interception at the next KVM_RUN, e.g. so that KVM does the right thing regardless of the order in which userspace restores nested state vs. event state. When "getting" events from userspace, simply drop any pending excpetion that is destined to be intercepted if there is also an injected exception to be migrated. Ideally, KVM would migrate both events, but that would require new ABI, and practically speaking losing the event is unlikely to be noticed, let alone fatal. The injected exception is captured, RIP still points at the original faulting instruction, etc... So either the injection on the target will trigger the same intercepted exception, or the source of the intercepted exception was transient and/or non-deterministic, thus dropping it is ok-ish. Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0") Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2") Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-22-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-ExitSean Christopherson1-2/+8
Determine whether or not new events can be injected after checking nested events. If a VM-Exit occurred during nested event handling, any previous event that needed re-injection is gone from's KVM perspective; the event is captured in the vmc*12 VM-Exit information, but doesn't exist in terms of what needs to be done for entry to L1. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-19-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Hoist nested event checks above event injection logicSean Christopherson1-36/+53
Perform nested event checks before re-injecting exceptions/events into L2. If a pending exception causes VM-Exit to L1, re-injecting events into vmcs02 is premature and wasted effort. Take care to ensure events that need to be re-injected are still re-injected if checking for nested events "fails", i.e. if KVM needs to force an immediate entry+exit to complete the to-be-re-injecteed event. Keep the "can_inject" logic the same for now; it too can be pushed below the nested checks, but is a slightly riskier change (see past bugs about events not being properly purged on nested VM-Exit). Add and/or modify comments to better document the various interactions. Of note is the comment regarding "blocking" previously injected NMIs and IRQs if an exception is pending. The old comment isn't wrong strictly speaking, but it failed to capture the reason why the logic even exists. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-18-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Use kvm_queue_exception_e() to queue #DFSean Christopherson1-12/+9
Queue #DF by recursing on kvm_multiple_exception() by way of kvm_queue_exception_e() instead of open coding the behavior. This will allow KVM to Just Work when a future commit moves exception interception checks (for L2 => L1) into kvm_multiple_exception(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-17-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Make kvm_queued_exception a properly named, visible structSean Christopherson1-42/+38
Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch in anticipation of adding a second instance in kvm_vcpu_arch to handle exceptions that occur when vectoring an injected exception and are morphed to VM-Exit instead of leading to #DF. Opportunistically take advantage of the churn to rename "nr" to "vector". No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-15-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exceptionSean Christopherson1-1/+1
Rename the kvm_x86_ops hook for exception injection to better reflect reality, and to align with pretty much every other related function name in KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-14-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1)Sean Christopherson1-2/+25
Add a dedicated "exception type" for #DBs, as #DBs can be fault-like or trap-like depending the sub-type of #DB, and effectively defer the decision of what to do with the #DB to the caller. For the emulator's two calls to exception_type(), treat the #DB as fault-like, as the emulator handles only code breakpoint and general detect #DBs, both of which are fault-like. For event injection, which uses exception_type() to determine whether to set EFLAGS.RF=1 on the stack, keep the current behavior of not setting RF=1 for #DBs. Intel and AMD explicitly state RF isn't set on code #DBs, so exempting by failing the "== EXCPT_FAULT" check is correct. The only other fault-like #DB is General Detect, and despite Intel and AMD both strongly implying (through omission) that General Detect #DBs should set RF=1, hardware (multiple generations of both Intel and AMD), in fact does not. Through insider knowledge, extreme foresight, sheer dumb luck, or some combination thereof, KVM correctly handled RF for General Detect #DBs. Fixes: 38827dbd3fb8 ("KVM: x86: Do not update EFLAGS on faulting emulation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-9-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Suppress code #DBs on Intel if MOV/POP SS blocking is activeSean Christopherson1-1/+18
Suppress code breakpoints if MOV/POP SS blocking is active and the guest CPU is Intel, i.e. if the guest thinks it's running on an Intel CPU. Intel CPUs inhibit code #DBs when MOV/POP SS blocking is active, whereas AMD (and its descendents) do not. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830231614.3580124-6-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Allow clearing RFLAGS.RF on forced emulation to test code #DBsSean Christopherson1-2/+9
Extend force_emulation_prefix to an 'int' and use bit 1 as a flag to indicate that KVM should clear RFLAGS.RF before emulating, e.g. to allow tests to force emulation of code breakpoints in conjunction with MOV/POP SS blocking, which is impossible without KVM intervention as VMX unconditionally sets RFLAGS.RF on intercepted #UD. Make the behavior controllable so that tests can also test RFLAGS.RF=1 (again in conjunction with code #DBs). Note, clearing RFLAGS.RF won't create an infinite #DB loop as the guest's IRET from the #DB handler will return to the instruction and not the prefix, i.e. the restart won't force emulation. Opportunistically convert the permissions to the preferred octal format. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220830231614.3580124-5-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Don't check for code breakpoints when emulating on exceptionSean Christopherson1-3/+23
Don't check for code breakpoints during instruction emulation if the emulation was triggered by exception interception. Code breakpoints are the highest priority fault-like exception, and KVM only emulates on exceptions that are fault-like. Thus, if hardware signaled a different exception, then the vCPU is already passed the stage of checking for hardware breakpoints. This is likely a glorified nop in terms of functionality, and is more for clarification and is technically an optimization. Intel's SDM explicitly states vmcs.GUEST_RFLAGS.RF on exception interception is the same as the value that would have been saved on the stack had the exception not been intercepted, i.e. will be '1' due to all fault-like exceptions setting RF to '1'. AMD says "guest state saved ... is the processor state as of the moment the intercept triggers", but that begs the question, "when does the intercept trigger?". Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20220830231614.3580124-4-seanjc@google.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Add missing trace points for RDMSR/WRMSR in emulator pathHou Wenlong1-0/+6
Since the RDMSR/WRMSR emulation uses a sepearte emualtor interface, the trace points for RDMSR/WRMSR can be added in emulator path like normal path. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/39181a9f777a72d61a4d0bb9f6984ccbd1de2ea3.1661930557.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Return emulator error if RDMSR/WRMSR emulation failedHou Wenlong1-10/+16
The return value of emulator_{get|set}_mst_with_filter() is confused, since msr access error and emulator error are mixed. Although, KVM_MSR_RET_* doesn't conflict with X86EMUL_IO_NEEDED at present, it is better to convert msr access error to emulator error if error value is needed. So move "r < 0" handling for wrmsr emulation into the set helper function, then only X86EMUL_* is returned in the helper functions. Also add "r < 0" check in the get helper function, although KVM doesn't return -errno today, but assuming that will always hold true is unnecessarily risking. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/09b2847fc3bcb8937fb11738f0ccf7be7f61d9dd.1661930557.git.houwenlong.hwl@antgroup.com [sean: wrap changelog less aggressively] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: x86: Update trace function for nested VM entry to support VMXMingwei Zhang1-1/+1
Update trace function for nested VM entry to support VMX. Existing trace function only supports nested VMX and the information printed out is AMD specific. So, rename trace_kvm_nested_vmrun() to trace_kvm_nested_vmenter(), since 'vmenter' is generic. Add a new field 'isa' to recognize Intel and AMD; Update the output to print out VMX/SVM related naming respectively, eg., vmcb vs. vmcs; npt vs. ept. Opportunistically update the call site of trace_kvm_nested_vmenter() to make one line per parameter. Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20220825225755.907001-2-mizhang@google.com [sean: align indentation, s/update/rename in changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-23KVM: x86: Inject #UD on emulated XSETBV if XSAVES isn't enabledSean Christopherson1-0/+1
Inject #UD when emulating XSETBV if CR4.OSXSAVE is not set. This also covers the "XSAVE not supported" check, as setting CR4.OSXSAVE=1 #GPs if XSAVE is not supported (and userspace gets to keep the pieces if it forces incoherent vCPU state). Add a comment to kvm_emulate_xsetbv() to call out that the CPU checks CR4.OSXSAVE before checking for intercepts. AMD'S APM implies that #UD has priority (says that intercepts are checked before #GP exceptions), while Intel's SDM says nothing about interception priority. However, testing on hardware shows that both AMD and Intel CPUs prioritize the #UD over interception. Fixes: 02d4160fbd76 ("x86: KVM: add xsetbv to the emulator") Cc: stable@vger.kernel.org Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220824033057.3576315-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-23KVM: x86: Reinstate kvm_vcpu_arch.guest_supported_xcr0Sean Christopherson1-7/+2
Reinstate the per-vCPU guest_supported_xcr0 by partially reverting commit 988896bb6182; the implicit assessment that guest_supported_xcr0 is always the same as guest_fpu.fpstate->user_xfeatures was incorrect. kvm_vcpu_after_set_cpuid() isn't the only place that sets user_xfeatures, as user_xfeatures is set to fpu_user_cfg.default_features when guest_fpu is allocated via fpu_alloc_guest_fpstate() => __fpstate_reset(). guest_supported_xcr0 on the other hand is zero-allocated. If userspace never invokes KVM_SET_CPUID2, supported XCR0 will be '0', whereas the allowed user XFEATURES will be non-zero. Practically speaking, the edge case likely doesn't matter as no sane userspace will live migrate a VM without ever doing KVM_SET_CPUID2. The primary motivation is to prepare for KVM intentionally and explicitly setting bits in user_xfeatures that are not set in guest_supported_xcr0. Because KVM_{G,S}ET_XSAVE can be used to svae/restore FP+SSE state even if the host doesn't support XSAVE, KVM needs to set the FP+SSE bits in user_xfeatures even if they're not allowed in XCR0, e.g. because XCR0 isn't exposed to the guest. At that point, the simplest fix is to track the two things separately (allowed save/restore vs. allowed XCR0). Fixes: 988896bb6182 ("x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0") Cc: stable@vger.kernel.org Cc: Leonardo Bras <leobras@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220824033057.3576315-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-02KVM: x86: check validity of argument to KVM_SET_MP_STATEPaolo Bonzini1-3/+17
An invalid argument to KVM_SET_MP_STATE has no effect other than making the vCPU fail to run at the next KVM_RUN. Since it is extremely unlikely that any userspace is relying on it, fail with -EINVAL just like for other architectures. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-02KVM: x86: fix memoryleak in kvm_arch_vcpu_create()Miaohe Lin1-2/+1
When allocating memory for mci_ctl2_banks fails, KVM doesn't release mce_banks leading to memoryleak. Fix this issue by calling kfree() for it when kcalloc() fails. Fixes: 281b52780b57 ("KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs.") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Message-Id: <20220901122300.22298-1-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-02KVM: x86: Mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIESJim Mattson1-4/+21
KVM should not claim to virtualize unknown IA32_ARCH_CAPABILITIES bits. When kvm_get_arch_capabilities() was originally written, there were only a few bits defined in this MSR, and KVM could virtualize all of them. However, over the years, several bits have been defined that KVM cannot just blindly pass through to the guest without additional work (such as virtualizing an MSR promised by the IA32_ARCH_CAPABILITES feature bit). Define a mask of supported IA32_ARCH_CAPABILITIES bits, and mask off any other bits that are set in the hardware MSR. Cc: Paolo Bonzini <pbonzini@redhat.com> Fixes: 5b76a3cff011 ("KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20220830174947.2182144-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-24kvm: x86: Do proper cleanup if kvm_x86_ops->vm_init() failsJunaid Shahid1-1/+7
If vm_init() fails [which can happen, for instance, if a memory allocation fails during avic_vm_init()], we need to cleanup some state in order to avoid resource leaks. Signed-off-by: Junaid Shahid <junaids@google.com> Link: https://lore.kernel.org/r/20220729224329.323378-1-junaids@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-08-19kvm: x86: mmu: Always flush TLBs when enabling dirty loggingJunaid Shahid1-0/+44
When A/D bits are not available, KVM uses a software access tracking mechanism, which involves making the SPTEs inaccessible. However, the clear_young() MMU notifier does not flush TLBs. So it is possible that there may still be stale, potentially writable, TLB entries. This is usually fine, but can be problematic when enabling dirty logging, because it currently only does a TLB flush if any SPTEs were modified. But if all SPTEs are in access-tracked state, then there won't be a TLB flush, which means that the guest could still possibly write to memory and not have it reflected in the dirty bitmap. So just unconditionally flush the TLBs when enabling dirty logging. As an alternative, KVM could explicitly check the MMU-Writable bit when write-protecting SPTEs to decide if a flush is needed (instead of checking the Writable bit), but given that a flush almost always happens anyway, so just making it unconditional seems simpler. Signed-off-by: Junaid Shahid <junaids@google.com> Message-Id: <20220810224939.2611160-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>