kernel/linux.git/virt, branch v7.0.13

KVM: Don't WARN if memory is dirtied without a vCPU when the VM is dying

2026-06-19T11:47:58+00:00

commit 8618004d3e897c0f1b71d9a9ab860461289bb89a upstream. When marking a page dirty, complain about not having a running/loaded vCPU if and only if the VM is still alive, i.e. its refcount is non-zero. This will allow fixing a memory leak for x86 SEV-ES guests without hitting what is effectively a false positive on the WARN. For some SEV-ES VM-Exits, KVM keeps a writable mapping of a guest page across an exit to userspace, and typically unmaps the page on the next KVM_RUN. But if userspace never calls KVM_RUN after such an exit, then KVM needs to unmap the page when the vCPU is destroyed, which in turn triggers the WARN about not having a running vCPU. Alternatively, SEV-ES could temporarily load the vCPU to suppress the WARN, as is done in nested_vmx_free_vcpu() (but for completely unrelated reasons; suppressing WARN from nested_put_vmcs12_pages() is pure happenstance). But loading a vCPU during destruction is gross (ideally nVMX code would be cleaned up), risks complicating the SEV-ES code (KVM would need to ensure the temporarily load()+put() only runs when the vCPU isn't already loaded), and is ultimately pointless. The motivation for the WARN is to guard against KVM dirtying guest memory without pushing the corresponding GFN to the active vCPU's dirty ring, e.g. to ensure userspace doesn't miss a dirty page. But for the VM's refcount to reach zero, there can't be _any_ userspace mappings to the dirty ring, as mapping the dirty ring requires doing mmap() on the vCPU FD. I.e. if userspace had a valid mapping for the dirty ring, then the vCPU file and thus the owning VM would still be alive. And so since userspace can't possibly reach the dirty ring, whether or not KVM technically "misses" a push to the dirty ring is irrelevant. Reported-by: Michael Roth Cc: stable@vger.kernel.org Reviewed-by: Michael Roth Signed-off-by: Sean Christopherson Message-ID: <20260501202250.2115252-15-seanjc@google.com> Signed-off-by: Paolo Bonzini Message-ID: <20260529183549.1104619-15-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman

KVM: Reject wrapped offset in kvm_reset_dirty_gfn()

2026-05-23T11:09:37+00:00

commit 577a8d3bae0531f0e5ccfac919cd8192f920a804 upstream. kvm_reset_dirty_gfn() guards the gfn range with if (!memslot || (offset + __fls(mask)) >= memslot->npages) return; but offset is u64 and the addition is unchecked. The check can be silently bypassed by a u64 wrap. The dirty ring backing those entries is MAP_SHARED at KVM_DIRTY_LOG_PAGE_OFFSET of the vcpu fd, so the VMM can rewrite the slot and offset fields of any entry between when the kernel pushes them and when KVM_RESET_DIRTY_RINGS consumes them. On reset, kvm_dirty_ring_reset() re-reads the values via READ_ONCE() and feeds them straight back into this check; only the flags handshake is treated as the handover, the slot/offset payload is taken on trust. Crafting two entries entry[i].offset = 0xffffffffffffffc1 entry[i+1].offset = 0 makes the coalescing loop in kvm_dirty_ring_reset() compute delta = (s64)(0 - 0xffffffffffffffc1) = 63 which falls in [0, BITS_PER_LONG), so it folds entry[i+1] into the existing mask by setting bit 63. The trailing kvm_reset_dirty_gfn() call then sees offset = 0xffffffffffffffc1 and __fls(mask) = 63; the sum is 0 in u64 and the bounds check passes. That offset propagates into kvm_arch_mmu_enable_log_dirty_pt_masked() unchanged. On the legacy MMU path -- kvm_memslots_have_rmaps() == true, i.e. shadow paging, any VM that has allocated shadow roots, or a write-tracked slot -- it reaches gfn_to_rmap(), which indexes slot->arch.rmap[0][] with a near-U64_MAX gfn. That is an out-of-bounds load of a kvm_rmap_head, followed by a conditional clear of PT_WRITABLE_MASK in whatever the loaded pointer points at. The path is reachable from any process holding /dev/kvm. Range-check offset on its own first, so the addition cannot wrap. memslot->npages is bounded well below U64_MAX, so once offset < npages holds, offset + __fls(mask) (with __fls(mask) < BITS_PER_LONG) stays in range. Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking") Cc: stable@vger.kernel.org Signed-off-by: Aaron Sacks Link: https://patch.msgid.link/20260512060742.1628959-1-contact@xchglabs.com/ Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman

Merge tag 'kvm-x86-generic-7.0-rc3' of https://github.com/kvm-x86/linux into HEAD

2026-03-11T17:01:55+00:00

KVM generic changes for 7.0 - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being unnecessary and confusing, triggered compiler warnings due to -Wflex-array-member-not-at-end. - Document that vcpu->mutex is take outside of kvm->slots_lock and kvm->slots_arch_lock, which is intentional and desirable despite being rather unintuitive.

KVM: always define KVM_CAP_SYNC_MMU

2026-02-28T14:31:35+00:00

KVM_CAP_SYNC_MMU is provided by KVM's MMU notifiers, which are now always available. Move the definition from individual architectures to common code. Signed-off-by: Paolo Bonzini

KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER

2026-02-28T14:31:35+00:00

All architectures now use MMU notifier for KVM page table management. Remove the Kconfig symbol and the code that is used when it is disabled. Signed-off-by: Paolo Bonzini

Convert 'alloc_obj' family to use the new default GFP_KERNEL argument

2026-02-22T01:09:51+00:00

This was done entirely with mindless brute force, using git grep -l '\

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

2026-02-21T09:02:28+00:00

This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook

Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD

2026-02-11T17:45:40+00:00

KVM mediated PMU support for 6.20 Add support for mediated PMUs, where KVM gives the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allows direct access to data MSRs and PMCs (restricted by the vPMU model), but intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. To keep overall complexity reasonable, mediated PMU usage is all or nothing for a given instance of KVM (controlled via module param). The Mediated PMU is disabled default, partly to maintain backwards compatilibity for existing setup, partly because there are tradeoffs when running with a mediated PMU that may be non-starters for some use cases, e.g. the host loses the ability to profile guests with mediated PMUs, the fastpath run loop is also a blind spot, entry/exit transitions are more expensive, etc. Versus the emulated PMU, where KVM is "just another perf user", the mediated PMU delivers more accurate profiling and monitoring (no risk of contention and thus dropped events), with significantly less overhead (fewer exits and faster emulation/programming of event selectors) E.g. when running Specint-2017 on a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from within the guest: Perf command: a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite Guest performance overhead: --------------------------------------------------------------------------- | Test case | emulated vPMU | all passthrough | passthrough with | | | | | event filters | --------------------------------------------------------------------------- | basic-sampling | 33.62% | 4.24% | 6.21% | --------------------------------------------------------------------------- | multiplex-sampling | 79.32% | 7.34% | 10.45% | ---------------------------------------------------------------------------

Merge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEAD

2026-02-11T17:45:32+00:00

KVM x86 APIC-ish changes for 6.20 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when creating a vCPU-specific mapping of guest memory. - Clean up KVM's handling of marking mapped vCPU pages dirty. - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless. - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow burying the RTC shenanigans behind CONFIG_KVM_IOAPIC=y. - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y. - Add a regression test for recent APICv update fixes. - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans. - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information. - Drop a dead function declaration.

Merge tag 'kvm-x86-gmem-6.20' of https://github.com/kvm-x86/linux into HEAD

2026-02-11T17:45:12+00:00

KVM guest_memfd changes for 6.20 - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage handling, and instead rely on SNP (the only user of the tracking) to do its own tracking via the RMP. - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to avoid non-trivial complexity for a non-existent usecase (and because in-place conversion simply can't support unaligned sources). - When populating guest_memfd memory, GUP the source page in common code and pass the refcounted page to the vendor callback, instead of letting vendor code do the heavy lifting. Doing so avoids a looming deadlock bug with in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap invalidate lock.