summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2025-02-14KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0James Houghton1-2/+7
When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and only if the VM has at least one shadow page, as opposed to checking if the VM has rmaps. Checking for rmaps will effectively yield a false positive if the VM ran nested TDP VMs in the past, but is not currently doing so. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-8-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as youngJames Houghton1-9/+9
Reorder the processing of the TDP MMU versus the shadow MMU when aging SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP MMU reports that the page is young, i.e. completely avoid taking mmu_lock if the TDP MMU SPTE is young. Swap the order for the test-and-age helper as well for consistency. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-7-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Age TDP MMU SPTEs without holding mmu_lockSean Christopherson3-13/+34
Walk the TDP MMU in an RCU read-side critical section without holding mmu_lock when harvesting and potentially updating age information on TDP MMU SPTEs. Add a new macro to do RCU-safe walking of TDP MMU roots, and do all SPTE aging with atomic updates; while clobbering Accessed information is ok, KVM must not corrupt other bits, e.g. must not drop a Dirty or Writable bit when making a SPTE young.. If updating a SPTE to mark it for access tracking fails, leave it as is and treat it as if it were young. If the spte is being actively modified, it is most likely young. Acquire and release mmu_lock for write when harvesting age information from the shadow MMU, as the shadow MMU doesn't yet support aging outside of mmu_lock. Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Always update A/D-disabled SPTEs atomicallySean Christopherson1-4/+6
In anticipation of aging SPTEs outside of mmu_lock, force A/D-disabled SPTEs to be updated atomically, as aging A/D-disabled SPTEs will mark them for access-tracking outside of mmu_lock. Coupled with restoring access- tracked SPTEs in the fast page fault handler, the end result is that A/D-disable SPTEs will be volatile at all times. Reviewed-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/all/Z60bhK96JnKIgqZQ@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Don't force atomic update if only the Accessed bit is volatileJames Houghton4-28/+26
Don't force SPTE modifications to be done atomically if the only volatile bit in the SPTE is the Accessed bit. KVM and the primary MMU tolerate stale aging state, and the probability of an Accessed bit A/D assist being clobbered *and* affecting again is likely far lower than the probability of consuming stale information due to not flushing TLBs when aging. Rename spte_has_volatile_bits() to spte_needs_atomic_update() to better capture the nature of the helper. Opportunstically do s/write/update on the TDP MMU wrapper, as it's not simply the "write" that needs to be done atomically, it's the entire update, i.e. the entire read-modify-write operation needs to be done atomically so that KVM has an accurate view of the old SPTE. Leave kvm_tdp_mmu_write_spte_atomic() as is. While the name is imperfect, it pairs with kvm_tdp_mmu_write_spte(), which in turn pairs with kvm_tdp_mmu_read_spte(). And renaming all of those isn't obviously a net positive, and would require significant churn. Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-6-jthoughton@google.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Factor out spte atomic bit clearing routineJames Houghton1-6/+9
This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a follow-up patch to enable lockless Accessed bit clearing. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-4-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: SEV: Use long-term pin when registering encrypted memory regionsGe Yang1-7/+8
When registering an encrypted memory region for SEV-MEM/SEV-ES guests, pin the pages with FOLL_TERM so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE. Failure to do so violates the CMA/MOVABLE mechanisms and can result in fragmentation due to unmovable pages, e.g. can make CMA allocations fail. Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/1739241423-14326-1-git-send-email-yangge1116@126.com [sean: massage changelog, make @flags an unsigned int] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Override TSC_STABLE flag for Xen PV clocks in kvm_guest_time_update()Sean Christopherson1-20/+15
When updating PV clocks, handle the Xen-specific UNSTABLE_TSC override in the main kvm_guest_time_update() by simply clearing PVCLOCK_TSC_STABLE_BIT in the flags of the reference pvclock structure. Expand the comment to (hopefully) make it obvious that Xen clocks need to be processed after all clocks that care about the TSC_STABLE flag. No functional change intended. Cc: Paul Durrant <pdurrant@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Setup Hyper-V TSC page before Xen PV clocks (during clock update)Sean Christopherson1-1/+2
When updating paravirtual clocks, setup the Hyper-V TSC page before Xen PV clocks. This will allow dropping xen_pvclock_tsc_unstable in favor of simply clearing PVCLOCK_TSC_STABLE_BIT in the reference flags. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Remove per-vCPU "cache" of its reference pvclockSean Christopherson2-16/+19
Remove the per-vCPU "cache" of the reference pvclock and instead cache only the TSC shift+multiplier. All other fields in pvclock are fully recomputed by kvm_guest_time_update(), i.e. aren't actually persisted. In addition to shaving a few bytes, explicitly tracking the TSC shift/mul fields makes it easier to see that those fields are tied to hw_tsc_khz (they exist to avoid having to do expensive math in the common case). And conversely, not tracking the other fields makes it easier to see that things like the version number are pulled from the guest's copy, not from KVM's reference. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Pass reference pvclock as a param to kvm_setup_guest_pvclock()Sean Christopherson1-7/+7
Pass the reference pvclock structure that's used to setup each individual pvclock as a parameter to kvm_setup_guest_pvclock() as a preparatory step toward removing kvm_vcpu_arch.hv_clock. No functional change intended. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Set PVCLOCK_GUEST_STOPPED only for kvmclock, not for Xen PV clockSean Christopherson1-8/+9
Handle "guest stopped" propagation only for kvmclock, as the flag is set if and only if kvmclock is "active", i.e. can only be set for Xen PV clock if kvmclock *and* Xen PV clock are in-use by the guest, which creates very bizarre behavior for the guest. Simply restrict the flag to kvmclock, e.g. instead of trying to handle Xen PV clock, as propagation of PVCLOCK_GUEST_STOPPED was unintentionally added during a refactoring, and while Xen proper defines XEN_PVCLOCK_GUEST_STOPPED, there's no evidence that Xen guests actually support the flag. Check and clear pvclock_set_guest_stopped_request if and only if kvmclock is active to preserve the original behavior, i.e. keep the flag pending if kvmclock happens to be disabled when KVM processes the initial request. Fixes: aa096aa0a05f ("KVM: x86/xen: setup pvclock updates") Cc: Paul Durrant <pdurrant@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Don't bleed PVCLOCK_GUEST_STOPPED across PV clocksSean Christopherson1-5/+8
When updating a specific PV clock, make a full copy of KVM's reference copy/cache so that PVCLOCK_GUEST_STOPPED doesn't bleed across clocks. E.g. in the unlikely scenario the guest has enabled both kvmclock and Xen PV clock, a dangling GUEST_STOPPED in kvmclock would bleed into Xen PV clock. Using a local copy of the pvclock structure also sets the stage for eliminating the per-vCPU copy/cache (only the TSC frequency information actually "needs" to be cached/persisted). Fixes: aa096aa0a05f ("KVM: x86/xen: setup pvclock updates") Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86/xen: Use guest's copy of pvclock when starting timerSean Christopherson1-5/+60
Use the guest's copy of its pvclock when starting a Xen timer, as KVM's reference copy may not be up-to-date, i.e. may yield a false positive of sorts. In the unlikely scenario that the guest is starting a Xen timer and has used a Xen pvclock in the past, but has since but turned it "off", then vcpu->arch.hv_clock may be stale, as KVM's reference copy is updated if and only if at least one pvclock is enabled. Furthermore, vcpu->arch.hv_clock is currently used by three different pvclocks: kvmclock, Xen, and Xen compat. While it's extremely unlikely a guest would ever enable multiple pvclocks, effectively sharing KVM's reference clock could yield very weird behavior. Using the guest's active Xen pvclock instead of KVM's reference will allow dropping KVM's reference copy. Fixes: 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers") Cc: Paul Durrant <pdurrant@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Process "guest stopped request" once per guest time updateSean Christopherson1-5/+12
Handle "guest stopped" requests once per guest time update in preparation of restoring KVM's historical behavior of setting PVCLOCK_GUEST_STOPPED for kvmclock and only kvmclock. For now, simply move the code to minimize the probability of an unintentional change in functionally. Note, in practice, all clocks are guaranteed to see the request (or not) even though each PV clock processes the request individual, as KVM holds vcpu->mutex (blocks KVM_KVMCLOCK_CTRL) and it should be impossible for KVM's suspend notifier to run while KVM is handling requests. And because the helper updates the reference flags, all subsequent PV clock updates will pick up PVCLOCK_GUEST_STOPPED. Note #2, once PVCLOCK_GUEST_STOPPED is restricted to kvmclock, the horrific #ifdef will go away. Cc: Paul Durrant <pdurrant@amazon.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Drop local pvclock_flags variable in kvm_guest_time_update()Sean Christopherson1-5/+2
Drop the local pvclock_flags in kvm_guest_time_update(), the local variable is immediately shoved into the per-vCPU "cache", i.e. the local variable serves no purpose. No functional change intended. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Eliminate "handling" of impossible errors during SUSPENDSean Christopherson1-13/+7
Drop KVM's handling of kvm_set_guest_paused() failure when reacting to a SUSPEND notification, as kvm_set_guest_paused() only "fails" if the vCPU isn't using kvmclock, and KVM's notifier callback pre-checks that kvmclock is active. I.e. barring some bizarre edge case that shouldn't be treated as an error in the first place, kvm_arch_suspend_notifier() can't fail. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Don't take kvm->lock when iterating over vCPUs in suspend notifierSean Christopherson1-2/+0
When queueing vCPU PVCLOCK updates in response to SUSPEND or HIBERNATE, don't take kvm->lock as doing so can trigger a largely theoretical deadlock, it is perfectly safe to iterate over the xarray of vCPUs without holding kvm->lock, and kvm->lock doesn't protect kvm_set_guest_paused() in any way (pv_time.active and pvclock_set_guest_stopped_request are protected by vcpu->mutex, not kvm->lock). Reported-by: syzbot+352e553a86e0d75f5120@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/677c0f36.050a0220.3b3668.0014.GAE@google.com Fixes: 7d62874f69d7 ("kvm: x86: implement KVM PM-notifier") Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulationSean Christopherson8-11/+26
Defer runtime CPUID updates until the next non-faulting CPUID emulation or KVM_GET_CPUID2, which are the only paths in KVM that consume the dynamic entries. Deferring the updates is especially beneficial to nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state changes, not to mention the updates don't need to be realized while L2 is active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept on Intel, but not AMD). Deferring CPUID updates shaves several hundred cycles from nested VMX roundtrips, as measured from L2 executing CPUID in a tight loop: SKX 6850 => 6450 ICX 9000 => 8800 EMR 7900 => 7700 Alternatively, KVM could update only the CPUID leaves that are affected by the state change, e.g. update XSAVE info only if XCR0 or XSS changes, but that adds non-trivial complexity and doesn't solve the underlying problem of nested transitions potentially changing both XCR0 and XSS, on both nested VM-Enter and VM-Exit. Skipping updates entirely if L2 is active and CPUID is being intercepted by L1 could work for the common case. However, simply skipping updates if L2 is active is *very* subtly dangerous and complex. Most KVM updates are triggered by changes to the current vCPU state, which may be L2 state, whereas performing updates only for L1 would requiring detecting changes to L1 state. KVM would need to either track relevant L1 state, or defer runtime CPUID updates until the next nested VM-Exit. The former is ugly and complex, while the latter comes with similar dangers to deferring all CPUID updates, and would only address the nested VM-Enter path. To guard against using stale data, disallow querying dynamic CPUID feature bits, i.e. features that KVM updates at runtime, via a compile-time assertion in guest_cpu_cap_has(). Exempt MWAIT from the rule, as the MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID feature. Note, the rule could be enforced for MWAIT as well, e.g. by querying guest CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can of worms. MONITOR/MWAIT can't be virtualized (for a reasonable definition), and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is wrong for other reasons. Beyond the aforementioned feature bits, the only other dynamic CPUID (sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those CPUID entries in KVM is all but guaranteed to be a bug. The layout for an actual XSAVE buffer depends on the format (compacted or not) and potentially the features that are actually enabled. E.g. see the logic in fpstate_clear_xstate_component() needed to poke into the guest's effective XSAVE state to clear MPX state on INIT. KVM does consume CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(), but not EBX, which is the only dynamic output register in the leaf. Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bitSean Christopherson1-2/+12
Rework MONITOR/MWAIT emulation to query X86_FEATURE_MWAIT if and only if the MISC_ENABLE_NO_MWAIT quirk is enabled, in which case MWAIT is not a dynamic, KVM-controlled CPUID feature. KVM's funky ABI for that quirk is to emulate MONITOR/MWAIT as nops if userspace sets MWAIT in guest CPUID. For the case where KVM owns the MWAIT feature bit, check MISC_ENABLES itself, i.e. check the actual control, not its reflection in guest CPUID. Avoiding consumption of dynamic CPUID features will allow KVM to defer runtime CPUID updates until kvm_emulate_cpuid(), i.e. until the updates become visible to the guest. Alternatively, KVM could play other games with runtime CPUID updates, e.g. by precisely specifying which feature bits to update, but doing so adds non-trivial complexity and doesn't solve the underlying issue of unnecessary updates causing meaningful overhead for nested virtualization roundtrips. Link: https://lore.kernel.org/r/20241211013302.1347853-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLESean Christopherson1-1/+2
When emulating CPUID, retrieve MSR_IA32_TSX_CTRL.TSX_CTRL_CPUID_CLEAR if and only if RTM and/or HLE feature bits need to be cleared. Getting the MSR value is unnecessary if neither bit is set, and avoiding the lookup saves ~80 cycles for vCPUs without RTM or HLE. Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20241211013302.1347853-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Use for-loop to iterate over XSTATE size entriesSean Christopherson1-15/+14
Rework xstate_required_size() to use a for-loop and continue, to make it more obvious that the xstate_sizes[] lookups are indeed correctly bounded, and to make it (hopefully) easier to understand that the loop is iterating over supported XSAVE features. Link: https://lore.kernel.org/r/20241211013302.1347853-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86/cpuid: add type suffix to decimal const 48 fix building warningEthan Zhao1-1/+1
The default type of a decimal constant is determined by the magnitude of its value. If the value falls within the range of int, its type is int; otherwise, if it falls within the range of unsigned int, its type is unsigned int. This results in the constant 48 being of type int. In the following min call, g_phys_as = min(g_phys_as, 48); This leads to a building warning/error (CONFIG_KVM_WERROR=y) caused by the mismatch between the types of the two arguments to macro min. By adding the suffix U to explicitly declare the type of the constant, this issue is fixed. Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com> Link: https://lore.kernel.org/r/20250127013837.12983-1-haifeng.zhao@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Clear pv_unhalted on all transitions to KVM_MP_STATE_RUNNABLEJim Mattson3-2/+2
In kvm_set_mp_state(), ensure that vcpu->arch.pv.pv_unhalted is always cleared on a transition to KVM_MP_STATE_RUNNABLE, so that the next HLT instruction will be respected. Fixes: 6aef266c6e17 ("kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks") Fixes: b6b8a1451fc4 ("KVM: nVMX: Rework interception of IRQs and NMIs") Fixes: 38c0b192bd6d ("KVM: SVM: leave halted state on vmexit") Fixes: 1a65105a5aba ("KVM: x86/xen: handle PV spinlocks slowpath") Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250113200150.487409-3-jmattson@google.com [sean: add Xen PV spinlocks to the list of Fixes, tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Introduce kvm_set_mp_state()Jim Mattson7-19/+23
Replace all open-coded assignments to vcpu->arch.mp_state with calls to a new helper, kvm_set_mp_state(), to centralize all changes to mp_state. No functional change intended. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250113200150.487409-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Use kvfree_rcu() to free old optimized APIC mapLi RongQing1-8/+1
Use kvfree_rcu() to free the old optimized APIC instead of open coding a rough equivalent via call_rcu() and a callback function. Note, there is a subtle function change as rcu_barrier() doesn't wait on kvfree_rcu(), but does wait on call_rcu(). Not forcing rcu_barrier() to wait is safe and desirable in this case, as KVM doesn't care when an old map is actually freed. In fact, using kvfree_rcu() fixes a largely theoretical use-after-free. Because KVM _doesn't_ do rcu_barrier() to wait for kvm_apic_map_free() to complete, if KVM-the-module is unloaded in the RCU grace period before kvm_apic_map_free() is invoked, KVM's callback could run after module unload. Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250122073456.2950-1-lirongqing@baidu.com [sean: rework changelog, call out rcu_barrier() interaction] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Wake vCPU for PIC interrupt injection iff a valid IRQ was foundLiam Ni1-1/+1
When updating the emulated PIC IRQ status, set "wakeup_needed" if and only if a new interrupt was found, i.e. if the incoming level is non-zero and an IRQ is being raised. The bug is relatively benign, as KVM will signal a spurious wakeup, e.g. set KVM_REQ_EVENT and kick target vCPUs, but KVM will never actually inject a spurious IRQ as kvm_cpu_has_extint() cares only about the "output" field. Fixes: 7049467b5383 ("KVM: remove isr_ack logic from PIC") Signed-off-by: Liam Ni <zhiguangni01@gmail.com> Link: https://lore.kernel.org/r/CACZJ9cX2R_=qgvLdaqbB_DUJhv08c674b67Ln_Qb9yyVwgE16w@mail.gmail.com [sean: reconstruct patch, rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loopSean Christopherson5-11/+17
Move the conditional loading of hardware DR6 with the guest's DR6 value out of the core .vcpu_run() loop to fix a bug where KVM can load hardware with a stale vcpu->arch.dr6. When the guest accesses a DR and host userspace isn't debugging the guest, KVM disables DR interception and loads the guest's values into hardware on VM-Enter and saves them on VM-Exit. This allows the guest to access DRs at will, e.g. so that a sequence of DR accesses to configure a breakpoint only generates one VM-Exit. For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest) and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop. But for DR6, the guest's value doesn't need to be loaded into hardware for KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas VMX requires software to manually load the guest value, and so loading the guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done _inside_ the core run loop. Unfortunately, saving the guest values on VM-Exit is initiated by common x86, again outside of the core run loop. If the guest modifies DR6 (in hardware, when DR interception is disabled), and then the next VM-Exit is a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and clobber the guest's actual value. The bug shows up primarily with nested VMX because KVM handles the VMX preemption timer in the fastpath, and the window between hardware DR6 being modified (in guest context) and DR6 being read by guest software is orders of magnitude larger in a nested setup. E.g. in non-nested, the VMX preemption timer would need to fire precisely between #DB injection and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the window where hardware DR6 is "dirty" extends all the way from L1 writing DR6 to VMRESUME (in L1). L1's view: ========== <L1 disables DR interception> CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0 A: L1 Writes DR6 CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1 B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec D: L1 reads DR6, arch.dr6 = 0 CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0 CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0 L2 reads DR6, L1 disables DR interception CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216 CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0 CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0 L2 detects failure CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT L1 reads DR6 (confirms failure) CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0 L0's view: ========== L2 reads DR6, arch.dr6 = 0 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 L2 => L1 nested VM-Exit CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23 L1 writes DR7, L0 disables DR interception CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007 CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23 L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0 A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'> B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23 C: L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0 L1 => L2 nested VM-Enter CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME L0 reads DR6, arch.dr6 = 0 Reported-by: John Stultz <jstultz@google.com> Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com Fixes: 375e28ffc0cf ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT") Fixes: d67668e9dd76 ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: nSVM: Enter guest mode before initializing nested NPT MMUSean Christopherson2-6/+6
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest mode prior to initializing the MMU for nested NPT so that guest_mode is set in the MMU's role. KVM's model is that all L2 MMUs are tagged with guest_mode, as the behavior of hypervisor MMUs tends to be significantly different than kernel MMUs. Practically speaking, the bug is relatively benign, as KVM only directly queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths that are optimizations (mmu_page_zap_pte() and shadow_mmu_try_split_huge_pages()). And while the role is incorprated into shadow page usage, because nested NPT requires KVM to be using NPT for L1, reusing shadow pages across L1 and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs will have direct=0. Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning of the flow instead of forcing guest_mode in the MMU, as nothing in nested_vmcb02_prepare_control() between the old and new locations touches TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present inconsistent vCPU state to the MMU. Fixes: 69cb877487de ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control") Cc: stable@vger.kernel.org Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Reject Hyper-V's SEND_IPI hypercalls if local APIC isn't in-kernelSean Christopherson1-1/+5
Advertise support for Hyper-V's SEND_IPI and SEND_IPI_EX hypercalls if and only if the local API is emulated/virtualized by KVM, and explicitly reject said hypercalls if the local APIC is emulated in userspace, i.e. don't rely on userspace to opt-in to KVM_CAP_HYPERV_ENFORCE_CPUID. Rejecting SEND_IPI and SEND_IPI_EX fixes a NULL-pointer dereference if Hyper-V enlightenments are exposed to the guest without an in-kernel local APIC: dump_stack+0xbe/0xfd __kasan_report.cold+0x34/0x84 kasan_report+0x3a/0x50 __apic_accept_irq+0x3a/0x5c0 kvm_hv_send_ipi.isra.0+0x34e/0x820 kvm_hv_hypercall+0x8d9/0x9d0 kvm_emulate_hypercall+0x506/0x7e0 __vmx_handle_exit+0x283/0xb60 vmx_handle_exit+0x1d/0xd0 vcpu_enter_guest+0x16b0/0x24c0 vcpu_run+0xc0/0x550 kvm_arch_vcpu_ioctl_run+0x170/0x6d0 kvm_vcpu_ioctl+0x413/0xb20 __se_sys_ioctl+0x111/0x160 do_syscal1_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x67/0xd1 Note, checking the sending vCPU is sufficient, as the per-VM irqchip_mode can't be modified after vCPUs are created, i.e. if one vCPU has an in-kernel local APIC, then all vCPUs have an in-kernel local APIC. Reported-by: Dongjie Zou <zoudongjie@huawei.com> Fixes: 214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls") Fixes: 2bc39970e932 ("x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID") Cc: stable@vger.kernel.org Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20250118003454.2619573-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: SVM: Ensure PSP module is initialized if KVM module is built-inSean Christopherson1-0/+10
The kernel's initcall infrastructure lacks the ability to express dependencies between initcalls, whereas the modules infrastructure automatically handles dependencies via symbol loading. Ensure the PSP SEV driver is initialized before proceeding in sev_hardware_setup() if KVM is built-in as the dependency isn't handled by the initcall infrastructure. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Link: https://lore.kernel.org/r/f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-11KVM: SEV: Use to_kvm_sev_info() for fetching kvm_sev_info structNikunj A Dadhania2-78/+54
Simplify code by replacing &to_kvm_svm(kvm)->sev_info with to_kvm_sev_info() helper function. Wherever possible, drop the local variable declaration and directly use the helper instead. No functional changes. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250123055140.144378-1-nikunj@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-11KVM: x86/xen: Only write Xen hypercall page for guest writes to MSRDavid Woodhouse1-1/+7
The Xen hypercall page MSR is write-only. When the guest writes an address to the MSR, the hypervisor populates the referenced page with hypercall functions. There is no reason for the host ever to write to the MSR, and it isn't even readable. Allowing host writes to trigger the hypercall page allows userspace to attack the kernel, as kvm_xen_write_hypercall_page() takes multiple locks and writes to guest memory. E.g. if userspace sets the MSR to MSR_IA32_XSS, KVM's write to MSR_IA32_XSS during vCPU creation will trigger an SRCU violation due to writing guest memory: ============================= WARNING: suspicious RCU usage 6.13.0-rc3 ----------------------------- include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage! stack backtrace: CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x176/0x1c0 kvm_vcpu_gfn_to_memslot+0x259/0x280 kvm_vcpu_write_guest+0x3a/0xa0 kvm_xen_write_hypercall_page+0x268/0x300 kvm_set_msr_common+0xc44/0x1940 vmx_set_msr+0x9db/0x1fc0 kvm_vcpu_reset+0x857/0xb50 kvm_arch_vcpu_create+0x37e/0x4d0 kvm_vm_ioctl+0x669/0x2100 __x64_sys_ioctl+0xc1/0xf0 do_syscall_64+0xc5/0x210 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7feda371b539 While the MSR index isn't strictly ABI, i.e. can theoretically float to any value, in practice no known VMM sets the MSR index to anything other than 0x40000000 or 0x40000200. Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/de0437379dfab11e431a23c8ce41a29234c06cbf.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-04KVM: x86/mmu: Ensure NX huge page recovery thread is alive before wakingSean Christopherson1-7/+26
When waking a VM's NX huge page recovery thread, ensure the thread is actually alive before trying to wake it. Now that the thread is spawned on-demand during KVM_RUN, a VM without a recovery thread is reachable via the related module params. BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:vhost_task_wake+0x5/0x10 Call Trace: <TASK> set_nx_huge_pages+0xcc/0x1e0 [kvm] param_attr_store+0x8a/0xd0 module_attr_store+0x1a/0x30 kernfs_fop_write_iter+0x12f/0x1e0 vfs_write+0x233/0x3e0 ksys_write+0x60/0xd0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f3b52710104 </TASK> Modules linked in: kvm_intel kvm CR2: 0000000000000040 Fixes: 931656b9e2ff ("kvm: defer huge page recovery vhost task to later") Cc: stable@vger.kernel.org Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250124234623.3609069-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04KVM: remove kvm_arch_post_init_vmPaolo Bonzini1-6/+1
The only statement in a kvm_arch_post_init_vm implementation can be moved into the x86 kvm_arch_init_vm. Do so and remove all traces from architecture-independent code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-04kvm: x86: SRSO_USER_KERNEL_NO is not synthesizedPaolo Bonzini1-1/+1
SYNTHESIZED_F() generally is used together with setup_force_cpu_cap(), i.e. when it makes sense to present the feature even if cpuid does not have it *and* the VM is not able to see the difference. For example, it can be used when mitigations on the host automatically protect the guest as well. The "SYNTHESIZED_F(SRSO_USER_KERNEL_NO)" line came in as a conflict resolution between the CPUID overhaul from the KVM tree and support for the feature in the x86 tree. Using it right now does not hurt, or make a difference for that matter, because there is no setup_force_cpu_cap(X86_FEATURE_SRSO_USER_KERNEL_NO). However, it is a little less future proof in case such a setup_force_cpu_cap() appears later, for a case where the kernel somehow is not vulnerable but the guest would have to apply the mitigation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds34-909/+1688
Pull kvm updates from Paolo Bonzini: "Loongarch: - Clear LLBCTL if secondary mmu mapping changes - Add hypercall service support for usermode VMM x86: - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a direct call to kvm_tdp_page_fault() when RETPOLINE is enabled - Ensure that all SEV code is compiled out when disabled in Kconfig, even if building with less brilliant compilers - Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes - Use str_enabled_disabled() to replace open coded strings - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache prior to every VM-Enter - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU - Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not - As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. When TDX will be enabled, operations on private pages will need to go through the privileged TDX Module via SEAMCALLs; as a result, they are limited and relatively slow compared to reading a PTE. The patches included in 6.14 allow KVM to keep a mirror of the private EPT in host memory, and define entries in kvm_x86_ops to operate on external page tables such as the TDX private EPT - The recently introduced conversion of the NX-page reclamation kthread to vhost_task moved the task under the main process. The task is created as soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that didn't expect to see any child task of the VM process until it started creating its own userspace threads. In particular crosvm refuses to fork() if procfs shows any child task, so unbreak it by creating the task lazily. This is arguably a userspace bug, as there can be other kinds of legitimate worker tasks and they wouldn't impede fork(); but it's not like userspace has a way to distinguish kernel worker tasks right now. Should they show as "Kthread: 1" in proc/.../status? x86 - Intel: - Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit while L2 is active, while ultimately results in a hardware-accelerated L1 EOI effectively being lost - Honor event priority when emulating Posted Interrupt delivery during nested VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the interrupt - Rework KVM's processing of the Page-Modification Logging buffer to reap entries in the same order they were created, i.e. to mark gfns dirty in the same order that hardware marked the page/PTE dirty - Misc cleanups Generic: - Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when setting memory regions and add a dedicated API for setting KVM-internal memory regions. The API can then explicitly disallow all flags for KVM-internal memory regions - Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug where KVM would return a pointer to a vCPU prior to it being fully online, and give kvm_for_each_vcpu() similar treatment to fix a similar flaw - Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a bug where userspace could coerce KVM into handling the ioctl on a vCPU that isn't yet onlined - Gracefully handle xarray insertion failures; even though such failures are impossible in practice after xa_reserve(), reserving an entry is always followed by xa_store() which does not know (or differentiate) whether there was an xa_reserve() before or not RISC-V: - Zabha, Svvptc, and Ziccrse extension support for guests. None of them require anything in KVM except for detecting them and marking them as supported; Zabha adds byte and halfword atomic operations, while the others are markers for specific operation of the TLB and of LL/SC instructions respectively - Virtualize SBI system suspend extension for Guest/VM - Support firmware counters which can be used by the guests to collect statistics about traps that occur in the host Selftests: - Rework vcpu_get_reg() to return a value instead of using an out-param, and update all affected arch code accordingly - Convert the max_guest_memory_test into a more generic mmu_stress_test. The basic gist of the "conversion" is to have the test do mprotect() on guest memory while vCPUs are accessing said memory, e.g. to verify KVM and mmu_notifiers are working as intended - Play nice with treewrite builds of unsupported architectures, e.g. arm (32-bit), as KVM selftests' Makefile doesn't do anything to ensure the target architecture is actually one KVM selftests supports - Use the kernel's $(ARCH) definition instead of the target triple for arch specific directories, e.g. arm64 instead of aarch64, mainly so as not to be different from the rest of the kernel - Ensure that format strings for logging statements are checked by the compiler even when the logging statement itself is disabled - Attempt to whack the last LLC references/misses mole in the Intel PMU counters test by adding a data load and doing CLFLUSH{OPT} on the data instead of the code being executed. It seems that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that events are counting correctly without actually knowing what the events count given the underlying hardware; this can happen if Intel reuses a formerly microarchitecture-specific event encoding as an architectural event, as was the case for Top-Down Slots" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits) kvm: defer huge page recovery vhost task to later KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() KVM: Disallow all flags for KVM-internal memslots KVM: x86: Drop double-underscores from __kvm_set_memory_region() KVM: Add a dedicated API for setting KVM-internal memslots KVM: Assert slots_lock is held when setting memory regions KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API) LoongArch: KVM: Add hypercall service support for usermode VMM LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup() KVM: VMX: read the PML log in the same order as it was written KVM: VMX: refactor PML terminology KVM: VMX: Fix comment of handle_vmx_instruction() KVM: VMX: Reinstate __exit attribute for vmx_exit() KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup() KVM: x86: Avoid double RDPKRU when loading host/guest PKRU KVM: x86: Use LVT_TIMER instead of an open coded literal RISC-V: KVM: Add new exit statstics for redirected traps RISC-V: KVM: Update firmware counters for various events RISC-V: KVM: Redirect instruction access fault trap to guest ...
2025-01-25Merge tag 'hyperv-next-signed-20250123' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Introduce a new set of Hyper-V headers in include/hyperv and replace the old hyperv-tlfs.h with the new headers (Nuno Das Neves) - Fixes for the Hyper-V VTL mode (Roman Kisel) - Fixes for cpu mask usage in Hyper-V code (Michael Kelley) - Document the guest VM hibernation behaviour (Michael Kelley) - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain) * tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Documentation: hyperv: Add overview of guest VM hibernation hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id() hyperv: Do not overlap the hvcall IO areas in get_vtl() hyperv: Enable the hypercall output page for the VTL mode hv_balloon: Fallback to generic_online_page() for non-HV hot added mem Drivers: hv: vmbus: Log on missing offers if any Drivers: hv: vmbus: Wait for boot-time offers during boot and resume uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup iommu/hyper-v: Don't assume cpu_possible_mask is dense Drivers: hv: Don't assume cpu_possible_mask is dense x86/hyperv: Don't assume cpu_possible_mask is dense hyperv: Remove the now unused hyperv-tlfs.h files hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h hyperv: Add new Hyper-V headers in include/hyperv hyperv: Clean up unnecessary #includes hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-24kvm: defer huge page recovery vhost task to laterKeith Busch2-6/+19
Some libraries want to ensure they are single threaded before forking, so making the kernel's kvm huge page recovery process a vhost task of the user process breaks those. The minijail library used by crosvm is one such affected application. Defer the task to after the first VM_RUN call, which occurs after the parent process has forked all its jailed processes. This needs to happen only once for the kvm instance, so introduce some general-purpose infrastructure for that, too. It's similar in concept to pthread_once; except it is actually usable, because the callback takes a parameter. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alyssa Ross <hi@alyssa.is> Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250123153543.2769928-1-kbusch@meta.com> [Move call_once API to include/linux. - Paolo] Cc: stable@vger.kernel.org Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-22Merge tag 'kthread-for-6.14-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21Merge tag 'objtool-core-2025-01-20' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - Introduce the generic section-based annotation infrastructure a.k.a. ASM_ANNOTATE/ANNOTATE (Peter Zijlstra) - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra) - ANNOTATE_NOENDBR - ANNOTATE_RETPOLINE_SAFE - instrumentation_{begin,end}() - VALIDATE_UNRET_BEGIN - ANNOTATE_IGNORE_ALTERNATIVE - ANNOTATE_INTRA_FUNCTION_CALL - {.UN}REACHABLE - Optimize the annotation-sections parsing code (Peter Zijlstra) - Centralize annotation definitions in <linux/objtool.h> - Unify & simplify the barrier_before_unreachable()/unreachable() definitions (Peter Zijlstra) - Convert unreachable() calls to BUG() in x86 code, as unreachable() has unreliable code generation (Peter Zijlstra) - Remove annotate_reachable() and annotate_unreachable(), as it's unreliable against compiler optimizations (Peter Zijlstra) - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra) - Robustify the annotation code by warning about unknown annotation types (Peter Zijlstra) - Allow arch code to discover jump table size, in preparation of annotated jump table support (Ard Biesheuvel) * tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Convert unreachable() to BUG() objtool: Allow arch code to discover jump table size objtool: Warn about unknown annotation types objtool: Fix ANNOTATE_REACHABLE to be a normal annotation objtool: Convert {.UN}REACHABLE to ANNOTATE objtool: Remove annotate_{,un}reachable() loongarch: Use ASM_REACHABLE x86: Convert unreachable() to BUG() unreachable: Unify objtool: Collect more annotations in objtool.h objtool: Collapse annotate sequences objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE objtool: Convert instrumentation_{begin,end}() to ANNOTATE objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE objtool: Convert ANNOTATE_NOENDBR to ANNOTATE objtool: Generic annotation infrastructure
2025-01-20Merge branch 'kvm-mirror-page-tables' into HEADPaolo Bonzini9-110/+472
As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.
2025-01-20Merge branch 'kvm-userspace-hypercall' into HEADPaolo Bonzini3-35/+67
Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not.
2025-01-20Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini26-656/+993
KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature. - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20Merge tag 'kvm-x86-vmx-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini7-79/+119
KVM VMX changes for 6.14: - Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit while L2 is active, while ultimately results in a hardware-accelerated L1 EOI effectively being lost. - Honor event priority when emulating Posted Interrupt delivery during nested VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the interrupt. - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache prior to every VM-Enter. - Rework KVM's processing of the Page-Modification Logging buffer to reap entries in the same order they were created, i.e. to mark gfns dirty in the same order that hardware marked the page/PTE dirty. - Misc cleanups.
2025-01-20Merge tag 'kvm-x86-svm-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-20/+9
KVM SVM changes for 6.14: - Macrofy the SEV=n version of the sev_xxx_guest() helpers so that the code is optimized away when building with less than brilliant compilers. - Remove a now-redundant TLB flush when guest CR4.PGE changes. - Use str_enabled_disabled() to replace open coded strings.
2025-01-20Merge tag 'kvm-x86-mmu-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-0/+4
KVM x86 MMU changes for 6.14: - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
2025-01-20Merge tag 'kvm-memslots-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-3/+4
KVM kvm_set_memory_region() cleanups and hardening for 6.14: - Add proper lockdep assertions when setting memory regions. - Add a dedicated API for setting KVM-internal memory regions. - Explicitly disallow all flags for KVM-internal memory regions.
2025-01-15KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()Yan Zhao2-4/+18
Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID) instead of 1 in kvm_mmu_page_fault(). The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e., npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(), kvm_handle_page_fault()). They either check if the return value is > 0 (as in npf_interception()) or pass it further to vcpu_run() to decide whether to break out of the kernel loop and return to the user when r <= 0. Therefore, returning any positive value is equivalent to returning 1. Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure a positive return value. This is a preparation to allow TDX's EPT violation handler to check the RET_PF* value and retry internally for RET_PF_RETRY. No functional changes are intended. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-15KVM: x86: Drop double-underscores from __kvm_set_memory_region()Sean Christopherson1-1/+1
Now that there's no outer wrapper for __kvm_set_memory_region() and it's static, drop its double-underscore prefix. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>