summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-06-12KVM: selftests: access_tracking_perf_test: bump number of NUMA nodes to 32Maxim Levitsky1-1/+1
It's rare to find a system that has more than 4 sockets, but a system can have more than 4 NUMA nodes if each socket exposes its chiplets as separate NUMA nodes. In particular, our CI caught a failure in this test on a system with two sockets, each containing an 'AMD EPYC 7601 32-Core Processor'. Bump the limit to 32, just in case. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-ID: <20260612150038.1277394-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-06-12Merge tag 'kvmarm-7.2' of ↵Paolo Bonzini40-470/+651
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 7.2 * New features: - None. Zilch. Nada. Que dalle. * Fixes and other improvements: - Significant cleanup of the vgic-v5 PPI support which was merged in 7.1. This makes the code more maintainable, and squashes a couple of bugs in the meantime. - Set of fixes for the handling of the MMU in an NV context, particularly VNCR-triggered faults. S1POE support is fixed as well. - Large set of pKVM fixes, mostly addressing recurring issues around hypervisor tracking of donated pages in obscure cases where the donation could fail and leave things in a bizarre state. - Fixes for the so-called "lazy vgic init", which resulted in sleeping operations in non-preemptible sections. This turned out to be far more invasive than initially expected... - Reduce the overhead of L1/L2 context switch by not touching the FP registers. - Fix the way non-implemented page sizes are dealt with when a guest insist on using them for S2 translation. - The usual set of low-impact fixes and cleanups all over the map.
2026-06-12Merge branch 'kvm-single-pdptrs' into HEADPaolo Bonzini8-42/+46
The non-MMU changes/preliminary cleanups from the "split kvm_mmu in three" series[1]. The final outcome is to have a single copy of the PDPTRs (in vcpu->arch) instead of two (in root_mmu and nested_mmu). [1] https://lore.kernel.org/kvm/20260603105814.10236-1-pbonzini@redhat.com/T/#t
2026-06-12KVM: x86/mmu: move pdptrs out of the MMUPaolo Bonzini5-21/+16
PDPTRs are part of the CPU state. A bit unconventionally, they are reached via vcpu->arch.walk_mmu instead of being stored in vcpu->arch directly. That is nice in principle---it would allow TDP shadow paging to have its own PDPTRs---but it is not necessary, because EPT has no PDPTRs and NPT does not cache them. Since kvm_pdptr_read does not otherwise need the MMU, drop the pdptrs from the MMU altogether. There is however something to be careful about, in that PDPTRs are now not stored separately in root_mmu and nested_mmu for L1 and L2 guests. In practice this was already not an issue: - for EPT the VMCS0x has to keep them up to date; and for the purpose of emulation they are always loaded from the VMCS on vmentry/vmexit, thanks to the clearing of dirty and available register bitmaps in vmx_switch_vmcs() - for NPT, VCPU_EXREG_PDPTR is similarly cleared for nNPT, which does not cache the PDPTRs; while for non-nNPT the PDPTRs are loaded together with the load of CR3. Note that page table PDPTRs are not affected, since they are stored in pae_root. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260530165545.25599-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-06-12KVM: x86: check that kvm_handle_invpcid is only invoked with shadow pagingPaolo Bonzini1-0/+3
This is true for both Intel and AMD. On Intel, "enable INVPCID" is set unconditionally if supported, but the vmexit is triggered by the "INVLPG exiting" control which is disabled by enable_ept. On AMD, KVM can intercept INVPCID if NPT is enabled but only in order to inject #UD in the guest. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260530165545.25599-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-06-12KVM: nSVM: invalidate cached PDPTRs across nested NPT transitionsPaolo Bonzini2-9/+26
When L2 runs under nested NPT and uses PAE paging, KVM's cached PDPTRs in mmu->pdptrs[] can hold stale or wrong values after nested transitions and across migration restore, because both nested_svm_load_cr3() and svm_get_nested_state_pages() only refresh PDPTRs on the !nested_npt path. The user-visible bug is on migration restore of an L2 running with nested NPT and 32-bit PAE paging, if userspace uses KVM_SET_SREGS rather than KVM_SET_SREGS2. In that case, load_pdptrs() leaves VCPU_EXREG_PDPTR marked as available, and kvm_pdptr_read() will use a stale translation that used L1 GPAs instead of L2 nGPAs. svm_get_nested_state_pages() runs on first KVM_RUN but skips the refresh because nested_npt_enabled() is true. The CPU itself reads L2's PDPTRs correctly from memory via L1's NPT, but KVM-side walking of guest PAE page tables uses the bogus cached values. Unlike Intel's GUEST_PDPTR0..3 fields in the VMCS, SVM has no VMCB-cached PDPTR state: the in-memory PDPTEs at the current CR3 are the only source of truth, and svm_cache_reg(VCPU_EXREG_PDPTR) simply reloads them from memory via load_pdptrs(). Clearing the avail bit (and the dirty bit because !avail/dirty is invalid) to force a reload when PDPTRs as needed fixes the bug. Do the same for nested_svm_load_cr3()'s nested_npt branch, so that the invariant "PDPTRs need reloading" is handled similarly for both immediate and deferred loading. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260530165545.25599-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-06-12KVM: nVMX: remove unnecessary code in prepare_vmcs02_rarePaolo Bonzini1-11/+0
The early vmwrite of the PDPTRs in prepare_vmcs02_rare() is redundant, because every write it does will be performed by prepare_vmcs02() if it is actually needed. In any case where the emulator or the processor need the PDPTR, either is_pae_paging() is true on vmentry, or a write of CR0, CR4 or EFER will cause a vmexit to L0. The next vmentry will refresh the PDPTRs in the vmcs02 from vmcs12. In fact, the original version[1] of what ended up being commit c7554efc8335 ("KVM: nVMX: Copy PDPTRs to/from vmcs12 only when necessary"), the writes in what is now prepare_vmcs02_rare() were removed. When the mega-collection of optimizations was posted[2], the removal of that code got dropped as a rebase good, so reinstate it. [1] https://lore.kernel.org/all/20190507160640.4812-16-sean.j.christopherson@intel.com [2] https://lore.kernel.org/all/1560445409-17363-31-git-send-email-pbonzini@redhat.com Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260530165545.25599-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-06-12KVM: x86: remove nested_mmu from mmu_is_nested()Paolo Bonzini1-1/+1
nested_mmu is always stored into vcpu->arch.walk_mmu at the same time as guest_mmu is stored into vcpu->arch.mmu. But nested_mmu is not even a proper MMU, it is only used for page walking; plus the fact that walk_mmu has to be switched at all is just an implementation detail. In the end what matters here is whether the guest is using nested page tables; vmx/nested.c and svm/nested.c check it to see if they are in nEPT or nNPT context respectively. So switch to checking root_mmu vs. guest_mmu, which is a more cogent test. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260511150648.685374-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20260530165545.25599-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-06-12Merge branch kvm-arm64/nv-mmu-7.2 into kvmarm-master/nextMarc Zyngier5-39/+74
* kvm-arm64/nv-mmu-7.2: : . : Assorted collection of fixes for NV MMU bugs : : - Correctly plug AT S1E1A handling in the emulation backend : : - Make CPTR_EL2.E0POE depend on FEAT_S1POE : : - Drop the reference on the page if the VNCR translation : races with an MMU notifier : : - Correctly synthesise an SEA if a page table walk fails due : to a guest error : : - Fully invalidate the VNCR TLB and fixmap when translating : for a new VNCR : : - Restart S1 walk when the S2 walk fails due to a race condition : : - Correctly return -EAGAIN when a S1 walk fails : : - Fix block mapping validity check in stage-1 walker for 64kB pages : : - Fix potential NULL dereference when performing an EL2 TLBI targeting : the VNCR page : : - Hold kvm->mmu_lock while initialising the vncr_tlb pointer : . KVM: arm64: nv: Hold kvm->mmu_lock while initialising vcpu->arch.vncr_tlb KVM: arm64: nv: Avoid dereferencing NULL VNCR pseudo-TLB KVM: arm64: Fix block mapping validity check in stage-1 walker KVM: arm64: nv: Restart stage-1 walk if stage-2 desc update fails KVM: arm64: Restart instruction upon race in __kvm_at_s12() KVM: arm64: nv: Inject SEA TTW when desc update can't write to GPA KVM: arm64: nv: Fully update VNCR fixmap state in kvm_translate_vncr() KVM: arm64: Don't leak PFN when kvm_translate_vncr() races MMU notifier arm64: cpufeature: Expose ID_AA64ISAR2_EL1.ATS1A to KVM KVM: arm64: Wire AT S1E1A in the system instruction handling table KVM: arm64: Key CPTR_EL2.E0POE propagation on FEAT_S1POE Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-12Merge branch kvm-arm64/misc-7.2 into kvmarm-master/nextMarc Zyngier11-28/+23
* kvm-arm64/misc-7.2: : . : - Check for a valid vcpu pointer upon deactivating traps when handling : a HYP panic in VHE mode : : - Make the __deactivate_fgt() macro use its arguments instead of the : surrounding context : : - Don't bother with initialising TPIDR_EL2 in the hyp stubs, as this : is already taken care of in more obvious places : : - Drop the unused kvm_arch pointer passed to __load_stage2() : : - Return -EOPNOTSUPP when a hypercall fails for some reason, instead of : returning whatever was in the result structure : : - Make the ITS ABI selection helpers return void, which avoids wondering : about the nature of the return code (always 0) : . KVM: arm64: vgic-its: Make ABI commit helpers return void KVM: arm64: Set a Linux errno on SMCCC error in kvm_call_hyp_nvhe() KVM: arm64: Remove @arch from __load_stage2() KVM: arm64: Don't populate TPIDR_EL2 in finalise_el2() KVM: arm64: Fix __deactivate_fgt macro parameter typo KVM: arm64: Guard against NULL vcpu on VHE hyp panic path Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-12Merge tag 'kvm-x86-svm-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini20-98/+525
KVM SVM changes for 7.2 - Add support for virtualizing gPAT (KVM previously just used L1's PAT when running L2). - Fix goofs where KVM mishandles side effects (e.g. single-step and PMC updates) when emulating VMRUN. - Fix a variety of bugs in AVIC's handling of x2APIC MSR interception, most notably where KVM didn't disable interception of IRR, ISR, and TMR regs. - Add support for virtualizing Host-Only/Guest-Only bits in the mediated PMU.
2026-06-12Merge tag 'kvm-x86-vmx-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2-18/+26
KVM VMX changes for 7.2 - Fix a largely benign bug where KVM TDX would incorrectly state it could emulate several x2APIC MSRs. - Use the "safe" WRMSR API when proxying LBR MSR writes as the to-be-written value is guest controlled and completely unvalidated.
2026-06-12Merge tag 'kvm-x86-vfio-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-63/+35
KVM VFIO changes for 7.2 Use guard() to cleanup up various KVM+VFIO flows.
2026-06-12Merge tag 'kvm-x86-sev-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini12-257/+518
KVM SEV changes for 7.2 - Don't advertise support for unusuable VM types, and account for VM types that are disabled by firmware, e.g. to mitigate security vulnerabilities. - Rewrite the SEV {en,de}crypt debug ioctls as they were riddle with bugs and unnecessarily complicated, and add comprehensive tests. - Clean up and deduplicate the SEV page pinning code. - Fix minor goofs related to writing back CPUID information after firmware rejects a CPUID page for an SNP vCPU.
2026-06-12Merge tag 'kvm-x86-selftests-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini8-45/+63
KVM selftests changes for 7.2 - Randomize the dirty log test's delay when reaping the bitmap on the first pass, as always waiting only 1ms hid a KVM RISC-V bug as the test reaped the bitmap before KVM could build up enough state to hit the bug. - A pile of one-off fixes and cleanups.
2026-06-12Merge tag 'kvm-x86-mmu-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini7-298/+302
KVM x86 MMU changes for 7.2 - Use the kernel's "enum pg_level" in the TDX APIs instead of the TDX-Module's level definitions (which are 0-based). - Rework the TDX memory APIs to not require/assume that guest memory is backed by "struct page" (in prepartion for guest_memfd hugepage support). - Overhaul the TDP MMU => S-EPT code to move as much S-EPT specific logic as possible into the TDX code, and to funnel (almost) all S-EPT updates into a single chokepoint. The motivation is largely to prepare for upcoming Dynamic PAMT support, but the cleanups are nice to have on their own. - Plug a hole in the shadow MMU where KVM fails to recursively zap nested TDP shadow when L1 is tearing its TDP page tables from the bottom up, as KVM's TDP MMU now does.
2026-06-12Merge tag 'kvm-x86-misc-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini42-543/+1112
KVM misc x86 changes for 7.2 - Handle EXIT_FASTPATH_EXIT_USERSPACE in vendor code to ensure vendor code gets a chance to handle things like reaping the PML buffer. - Ensure KVM's copy of CR0 and CR3 are up-to-date on SVM prior to invoking fastpath handlers. - Update KVM's view of PV async enabling if and only if the MSR write fully succeeds. - Fix a variety of issues where the emulator doesn't honor guest-debug state, and clean up related code along the way. - Synthesize EPT Violation and #NPF "error code" bits when injecting faults into L1 that didn't originate in hardware (in which case the VMCS/VMCB doesn't hold relevant information). - Add support for virtualizing (well, emulating) AMD's flavor of CPL>0 CPUID faulting. - Clean up the GPR APIs so that KVM's use of "raw" is consistent, and fix a variety of minor bugs along the way. - Fix an OOB memory access due to not checking the VP ID when handling a Hyper-V PV TLB flush for L2. - Fix a bug in the mediated PMU's handling of fixed counters that allowed the guest to bypass the PMU event filter. - Allow userspace to return EAGAIN when handling SNP and TDX hypercalls, so the KVM can forward a "retry" status code to the guest, and reserve all unused error codes for future usage. - Misc fixes and cleanups.
2026-06-12Merge tag 'kvm-x86-gmem-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini4-15/+63
KVM guest_memfd changes for 7.2 - Return -EEXIST instead of -EINVAL if userspace attempts to bind a gmem range to multiple memslots, and fix the test that was supposed to ensure KVM returns -EEXIST. - Treat memslot binding offsets and sizes as unsigned values to fix a bug where KVM interprets a large "offset + size" as a negative value and allows a nonsensical offset. - Use the inode number instead of the page offset for the NUMA interleaving index to fix a bug where the effective index would jump by two for consecutive pages (the caller also adds in the page offset).
2026-06-12Merge branch kvm-arm64/vgic-v5-PPI-fixes into kvmarm-master/nextMarc Zyngier15-199/+132
* kvm-arm64/vgic-v5-PPI-fixes: : . : Substantial cleanup of the vgic-v5 PPI support. From the original : cover letter: : : "With the GICv5 PPi support merged in, it has become obvious that a few : things could be improved, both from the correctness and maintainability : angles." : . KVM: arm64: Fix arch timer interrupts for GICv3-on-GICv5 guests irqchip/gic-v5: Immediately exec priority drop following activate Documentation: KVM: Clarify that PMU_V3_IRQ IntID requirements for GICv5 Documentation: KVM: Fix typos in VGICv5 documentation KVM: arm64: selftests: Improve error handling for GICv5 PPI selftest KVM: arm64: selftests: Cleanup unused vars in GICv5 PPI selftest KVM: arm64: selftests: Add missing GIC CDEN to no-vgic-v5 selftest KVM: arm64: vgic-v5: Atomically assign bits to PPI DVI bitmap KVM: arm64: vgic-v5: Add missing trap handing for NV triage KVM: arm64: vgic-v5: Limit support to 64 PPIs KVM: arm64: vgic: Rationalise per-CPU irq accessor KVM: arm64: vgic-v5: Drop defensive checks from vgic_v5_ppi_queue_irq_unlock() KVM: arm64: vgic: Consolidate vgic_allocate_private_irqs_locked() KVM: arm64: vgic: Constify struct irq_ops usage KVM: arm64: vgic-v5: Drop pointless ARM64_HAS_GICV5_CPUIF check KVM: arm64: vgic-v5: Remove use of __assign_bit() with a constant KVM: arm64: vgic-v5: Move PPI caps into kvm_vgic_global_state KVM: arm64: vgic-v5: Add for_each_visible_v5_ppi() iterator Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-12Merge branch kvm-arm64/pkvm-fixes-7.2 into kvmarm-master/nextMarc Zyngier8-24/+100
* kvm-arm64/pkvm-fixes-7.2: : . : Assorted pKVM fixes for 7.2: : : - Ensure that the vcpu memcache is filled in a number of cases (donate, : share, selftest) : : - Fix vmemmap page order handling by resetting it when initialising the : memory pool : : - Don't leak page references on failed memory donation : : - Add sanity-check for refcounted pages when donating/sharing pages : : - Clear __hyp_running_vcpu on state flush : : - Check LR upper bound against a trusted value : : - Assorted fixes for the host-side tracking of the pages shared with : EL2 as a result of some Sashiko testing from Fuad : : - Correctly forward HCR_EL2.VSE from host to guest, so that protected : guests can see SErrors : . KVM: arm64: Roll back partial shares on kvm_share_hyp() failure KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure KVM: arm64: Free hyp-share tracking node when share hypercall fails KVM: arm64: Flush HCR_EL2.VSE to deliver SErrors to pKVM guests KVM: arm64: Bound used_lrs when flushing the pKVM hyp vCPU KVM: arm64: Clear __hyp_running_vcpu when flushing the pKVM hyp vCPU KVM: arm64: Pre-check vcpu memcache for host->guest donate KVM: arm64: Pre-check vcpu memcache for host->guest share KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host KVM: arm64: Fix __pkvm_init_vm error path KVM: arm64: Reset page order in pKVM hyp_pool Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-12Merge branch kvm-arm64/nv-granule-sizes into kvmarm-master/nextMarc Zyngier2-73/+196
* kvm-arm64/nv-granule-sizes: : . : Tidying up of the behaviour when the selected page size in not : implemented, courtesy of Wei-Lin Chang. From the initial cover : letter: : : "This small series fixes the granule size selection for software stage-1 : and stage-2 walks. Previously we treat the guest's TCR/VTCR.TGx as-is : and use the encoded granule size for the walks. However this is : incorrect if the granule sizes are not advertised in the guest's : ID_AA64MMFR0_EL1.TGRAN*. The architecture specifies that when an : unsupported size is programed in TGx, it must be treated as an : implemented size. Fix this by choosing an available one while : prioritizing PAGE_SIZE." : . KVM: arm64: Fallback to a supported value for unsupported guest TGx KVM: arm64: nv: Use literal granule size in TLBI range calculation KVM: arm64: Factor out TG0/1 decoding of VTCR and TCR KVM: arm64: nv: Rename vtcr_to_walk_info() to setup_s2_walk() Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-12Merge branch kvm-arm64/nv-fp-elision into kvmarm-master/nextMarc Zyngier3-1/+32
* kvm-arm64/nv-fp-elision: : . : Significantly reduce the overhead of the context switch between L1 and : L2 guests by eliding the save/restore of the FP/SIMD/SVE registers, as : this state is shared between the two guests, and therefore can be left : live. : . KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception KVM: arm64: nv: Track L2 to L1 exception emulation Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-12Merge branch kvm-arm64/no-lazy-vgic-init into kvmarm-master/nextMarc Zyngier7-106/+94
* kvm-arm64/no-lazy-vgic-init: : . : Fix an ugly situation where the vgic lazy init could happen in : non-preemtible contexts such as vcpu reset, resulting in lockdep : splats. : : This requires revamping the way in-kernel emulation of devices : (timers, PMU) are presenting their interrupt to the vgic, and : make sure there is no need to init the vgic on the back of that. : . KVM: arm64: vgic-v2: Don't init the vgic on in-kernel interrupt injection KVM: arm64: vgic-v2: Force vgic init on injection outside the run loop KVM: arm64: pmu: Kill the PMU interrupt level cache KVM: arm64: timer: Kill the per-timer irq level cache KVM: arm64: Simplify userspace notification of interrupt state KVM: arm64: timer: Repaint kvm_timer_{should,irq_can}_fire() to kvm_timer_{pending,enabled}() Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-12KVM: arm64: vgic-its: Make ABI commit helpers return voidJackie Liu1-12/+9
The return values of vgic_its_set_abi() and vgic_its_commit_v0() are always 0 and do not carry useful error information. Simplify by changing them to void. Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Oliver Upton <oupton@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://patch.msgid.link/20260604075147.53299-1-liu.yun@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-12Merge tag 'kvm-x86-generic-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini5-18/+18
KVM generic changes for 7.2 - Rename invalidate_begin() to invalidate_start() throughout KVM to follow the kernel's nomenclature, e.g. for mmu_notifiers. - Minor cleanups.
2026-06-12Merge tag 'kvm-s390-master-7.1-4' of ↵Paolo Bonzini338-1628/+3548
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: A few more misc gmap fixes.
2026-06-11KVM: s390: vsie: Use mmu cache to allocate rmapClaudio Imbrenda3-12/+14
Use kvm_s390_mmu_cache_alloc_rmap() to allocate the rmap in gmap_insert_rmap(), instead of a normal kzalloc_obj() with GFP_ATOMIC. This guarantees forward progress. Fixes: a2c17f9270cc ("KVM: s390: New gmap code") CC: stable@vger.kernel.org # 7.1 Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20260611104850.110313-6-imbrenda@linux.ibm.com>
2026-06-11KVM: s390: vsie: Add missing radix_tree_preload() in _gaccess_shadow_fault()Claudio Imbrenda1-22/+35
Add missing radix_tree_preload() in _gaccess_shadow_fault() to guarantee forward progress. The core of _gaccess_shadow_fault() has been split into ___gaccess_shadow_fault() in order to simplify locking. Fixes: e38c884df921 ("KVM: s390: Switch to new gmap") CC: stable@vger.kernel.org # 7.1 Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20260611104850.110313-5-imbrenda@linux.ibm.com>
2026-06-11KVM: s390: vsie: Fix allocation of struct vsie_rmapClaudio Imbrenda1-1/+1
The allocation size for struct vsie_rmap in kvm_s390_mmu_cache_topup() was wrong due to a copy-paste error. Fix it by using the type name. Fixes: 12f2f61a9e1a ("KVM: s390: KVM page table management functions: allocation") CC: stable@vger.kernel.org # 7.1 Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20260611104850.110313-4-imbrenda@linux.ibm.com>
2026-06-11KVM: s390: Fix unlikely race in try_get_locked_pte()Claudio Imbrenda1-3/+3
Fix an unlikely race in try_get_locked_pte(), which could have happened if puds or pmds get unmapped between the p?dp_get() and p?d_offset() functions. Fixes: 89fa757931dc ("KVM: s390: Avoid potentially sleeping while atomic when zapping pages") CC: stable@vger.kernel.org # 7.1 Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20260611104850.110313-3-imbrenda@linux.ibm.com>
2026-06-11KVM: s390: Silence potential warnings in _gmap_crstep_xchg_atomic()Claudio Imbrenda1-1/+10
While dat_crstep_xchg_atomic() is marked as __must_check, in this particular case the return value should be ignored. Silence potential compiler warnings with a pointless check, and add a comment to explain the situation. Fixes: d1adc098ce08 ("KVM: s390: Fix _gmap_crstep_xchg_atomic()") CC: stable@vger.kernel.org # 7.1 Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20260611104850.110313-2-imbrenda@linux.ibm.com>
2026-06-10Merge tag 'pm-7.1-rc8' of ↵Linus Torvalds2-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These address some remaining fallout after introducing dynamic EPP support in the amd-pstate driver during the current development cycle: - Restore allowing writing EPP of 0 when in performance mode in the amd-pstate driver which was unnecessarily disallowed by one of the recent updates (Mario Limonciello) - Remove stale documentation of the epp_cached field in struct amd_cpudata that has been dropped recently (Zhan Xusheng)" * tag 'pm-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq/amd-pstate: Fix setting EPP in performance mode cpufreq/amd-pstate: drop stale @epp_cached kdoc
2026-06-10Merge tag 'riscv-for-linux-7.1-rc8' of ↵Linus Torvalds5-12/+15
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Paul Walmsley: - Fix the implementation of the CFI branch landing pad control prctl()s to return -EINVAL if unknown control bits are set, rather than silently ignoring the request; and add a kselftest for this case - Fix unaligned access performance testing to happen earlier in boot, which fixes a performance regression in the lib/checksum code - Fix a binfmt_elf warning when dumping core (due to missing .core_note_name for CFI registers) * tag 'riscv-for-linux-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: cfi: reject unknown flags in PR_SET_CFI riscv: Fix fast_unaligned_access_speed_key not getting initialized riscv/ptrace: Use USER_REGSET_NOTE_TYPE for REGSET_CFI
2026-06-10namespace: restrict OPEN_TREE_NAMESPACE/FSMOUNT_NAMESPACE to directoriesJann Horn1-0/+3
open_tree(..., OPEN_TREE_NAMESPACE) and fsmount(..., FSMOUNT_NAMESPACE, ...) currently work on non-directories, like regular files. That's bad for two reasons: - It ends up mounting a regular file over the inherited namespace root, which is a directory; mounting a non-directory over a directory is normally explicitly forbidden, see for example do_move_mount() - It causes setns() on the new namespace to set the cwd to a regular file, which the rest of VFS does not expect Fix it by restricting create_new_namespace() (which is used by both of these flags) to directories. Leave the behavior for OPEN_TREE_CLONE as-is, that seems unproblematic. Fixes: 9b8a0ba68246 ("mount: add OPEN_TREE_NAMESPACE") Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-06-10KVM: arm64: nv: Hold kvm->mmu_lock while initialising vcpu->arch.vncr_tlbMarc Zyngier1-2/+14
Sashiko reports that there is a race between initialising vncr_tlb and making use of it, as we don't hold the mmu_lock at this point. Additionally, it identifies a memory leak, should userspace repeatedly invokes the KVM_RUN ioctl after a failure of kvm_arch_vcpu_run_pid_change(), as we assign vncr_tlb blindly on first run, irrespective of prior allocations. Slap the two bugs in one go by taking the kvm->mmu_lock on assigning vncr_tlb, preventing the race for good, and by checking that vncr_tlb is indeed NULL prior to allocation. Reported-by: Sashiko <sashiko-bot@kernel.org> Link: https://lore.kernel.org/r/20260607180815.85FBC1F00893@smtp.kernel.org Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/20260608081108.2244133-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-10KVM: arm64: nv: Avoid dereferencing NULL VNCR pseudo-TLBMarc Zyngier1-21/+15
VNCR TLB invalidation occurs from MMU notifiers or TLBI instructions, and either can race against a vcpu not being onlined yet (no pseudo-TLB allocated). Similarly, the TLB might be invalid, and the invalidation should be skipped in this case. Both kvm_invalidate_vncr_ipa() and kvm_invalidate_vncr_va() are expected to perform the same checks, except that the latter doesn't check for the allocation and blindly dereferences the pointer. Solve this by introducing a new iterator built on top of the usual kvm_for_each_vcpu() that checks for both of the above conditions, and convert the two users to it. Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Link: https://lore.kernel.org/r/aiUvSbrWndQeUPc8@v4bel Fixes: 4ffa72ad8f37 ("KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2") Cc: stable@vger.kernel.org Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/20260607175745.297793-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-06-10Merge tag 'trace-rv-v7.1-rc6-2' of ↵Linus Torvalds12-83/+263
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier fixes from Steven Rostedt: - Fix reset ordering on per-task destruction Reset the task before dropping the slot instead of after, which was causing out-of-bound memory accesses. - Fix HA monitor synchronization and cleanup Ensure synchronous cleanup for HA monitors by running timer callbacks in RCU read-side critical sections and using synchronize_rcu() during destruction. - Avoid armed timers after tasks exit Add automatic cleanup for per-task HA monitors to prevent timers from firing after task exit. - Fix memory ordering for DA/HA monitors Fix race conditions during monitor start by using release-acquire semantics for the monitoring flag. - Fix initialization for DA/HA monitors Ensure monitors are not initialized relying on potentially corrupted state like the monitoring flag, that is not reset by all monitors type and may have an unknown state in monitors reusing the storage (per-task). - Fix memory safety in per-task and per-object monitors Prevent use-after-free and out-of-bounds access by synchronizing with in-flight tracepoint probes using tracepoint_synchronize_unregister() before freeing monitor storage or releasing task slots. - Adjust monitors for preemptible tracepoints Fix monitors that relied on tracepoints disabling preemption. Explicitly disable task migration when per-CPU monitors handle events to avoid accessing the wrong state and update the opid monitor logic. - Fix incorrect __user specifier usage Remove __user from a non-pointer variable in the extract_params() helper. - Fix bugs in the rv tool Ensure strings are NUL-terminated, fix substring matching in monitor searches, and improve cleanup and exit status handling. - Fix several bugs in rvgen Fix LTL literal stringification, subparsers' options handling, and suffix stripping in dot2k. * tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: verification/rvgen: Fix ltl2k writing True as a literal verification/rvgen: Fix options shared among commands verification/rvgen: Fix suffix strip in dot2k tools/rv: Fix cleanup after failed trace setup tools/rv: Fix substring match when listing container monitors tools/rv: Fix substring match bug in monitor name search tools/rv: Ensure monitor name and desc are NUL-terminated rv: Use 0 to check preemption enabled in opid rv: Prevent task migration while handling per-CPU events rv: Ensure synchronous cleanup for HA monitors rv: Add automatic cleanup handlers for per-task HA monitors rv: Do not rely on clean monitor when initialising HA rv: Fix monitor start ordering and memory ordering for monitoring flag rv: Ensure all pending probes terminate on per-obj monitor destroy rv: Prevent in-flight per-task handlers from using invalid slots rv: Reset per-task DA monitors before releasing the slot rv: Fix __user specifier usage in extract_params()
2026-06-10Merge tag 'trace-tools-v7.1-rc7' of ↵Linus Torvalds6-36/+32
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull RTLA fix from Steven Rostedt: - Fix multi-character short option parsing Fix regression in parsing of multiple-character short options (eg -p100 /= -p 100/, -un /= -u -n/) caused by getopt_long() internal state corruption after a refactoring. * tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rtla: Fix parsing of multi-character short options
2026-06-09Merge tag 'mm-hotfixes-stable-2026-06-08-20-51' of ↵Linus Torvalds12-36/+204
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 9 are for MM. 8 are cc:stable and the remaining 3 address post-7.1 issues or aren't considered suitable for backporting. Thre's a two-patch series "mm/damon/{reclaim,lru_sort}: handle ctx allocation failures" from SeongJae Park which fixes a couple of DAMON -ENOMEM bloopers. The rest are singletons - please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/mincore: handle non-swap entries before !CONFIG_SWAP guard arm64: mm: call pagetable dtor when freeing hot-removed page tables mm/list_lru: drain before clearing xarray entry on reparent mm/huge_memory: use correct flags for device private PMD entry mm/damon/lru_sort: handle ctx allocation failure mm/damon/reclaim: handle ctx allocation failure zram: fix use-after-free in zram_bvec_write_partial() MAINTAINERS: update Baoquan He's email address tools headers UAPI: sync linux/taskstats.h for procacct.c mm/cma_sysfs: skip inactive CMA areas in sysfs ipc/shm: serialize orphan cleanup with shm_nattch updates
2026-06-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds13-22/+103
Pull rdma fixes from Jason Gunthorpe: "Several significant bug fixes of pre-existing issues: - Missing validation on ucap fd types passed from userspace - Missing validation of HW DMA space vs userpace expected sizes in EFA queue setup - DMA corruption when using DMA block sizes >= 4G when setting up MRs in all drivers - Missing validation of CPU IDs when setting up dma handles - Missing validation of IB_MR_REREG_ACCESS when changing writability of a MR - Missing validation of received message/packet size in ISER and SRP" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/srp: bound SRP_RSP sense copy by the received length IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN RDMA: During rereg_mr ensure that REREG_ACCESS is compatible RDMA/core: Validate cpu_id against nr_cpu_ids in DMAH alloc RDMA/umem: Fix truncation for block sizes >= 4G RDMA/efa: Validate SQ ring size against max LLQ size RDMA/core: Validate the passed in fops for ib_get_ucaps()
2026-06-09KVM: x86/mmu: Recursively zap orphaned nested TDP shadow pages on emulated ↵Sean Christopherson1-1/+1
writes Recursively zap orphaned nested TDP shadow pages when emulating a guest write to a shadowed page table, regardless of whether or not the associated (parent) shadow page will be zapped, e.g. due to detected write-flooding. This plugs a hole where KVM fails to reclaim defunct, unsync shadow pages for select L1 hypervisor patterns. Commit 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent") modified KVM to recursively zap synchronized shadow pages (KVM already recursively zaps unsync children) when a child is orphaned. But the fix effectively only applied the logic to kvm_mmu_page_unlink_children(), i.e. only performs the recursive zap when KVM is already zapping a parent SP and processing its children. If L1 zaps SPTEs bottom-up (4KiB => 2MiB => ...), as KVM's TDP MMU does with CONFIG_KVM_PROVE_MMU=n since commit 8ca983631f3c ("KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity"), then KVM (as L0) will leak upwards of 4 shadow pages per GiB of L2 guest memory. Over hundreds or thousands of L2 boots, if the VM is "lucky" enough to escape write-flooding detection, i.e. not trigger reclaim of the orphaned shadow pages by dumb luck, then it's possible to end up with tens or even hundreds of thousands of unsync shadow pages and associated rmap entries. Polluting the hash table and rmap entries with a horde of stale entries can eventually degrade L2 guest boot time by an order of magnitude, especially if there is any antagonistic activity in the host, i.e. anything that will contend for mmu_lock and/or needs to walk rmaps. With "top"-down zapping, where "top" is 1GiB or above, then L0 KVM is effectively limited to leaking 4 shadow pages per 256 GiB of memory, as KVM's write flooding detection will kick in on the third write to an L1 TDP PUD, and thus recursively zap the entire 256 GiB range of the parent PGD. I.e. even though L1 KVM still recursively zaps 2MiB => 4KiB SPTEs when zapping each 1GiB SPTE, KVM only gets through two of the 1GiB SPTEs before dropping everything. E.g. hacking tracing into L0 KVM's kvm_mmu_track_write(), the top-down zapping of L1's TDP MMU for an L2 with 16GiB of memory leads to: gpa = 107407000, old = 800000010741bd07, new = 8000000000000000, level = 3, flood = 0 gpa = 10741b000, old = 8000000112fb2d07, new = 80000000000001a0, level = 2, flood = 0 gpa = 10741b008, old = 800000012509cd07, new = 80000000000001a0, level = 2, flood = 1 gpa = 10741b010, old = 80000001114b9d07, new = 80000000000001a0, level = 2, flood = 2 gpa = 107407008, old = 8000000112fb5d07, new = 8000000000000000, level = 3, flood = 1 gpa = 112fb5298, old = 8000000106f43d07, new = 80000000000001a0, level = 2, flood = 0 gpa = 112fb52a0, old = 8000000106f4dd07, new = 80000000000001a0, level = 2, flood = 1 gpa = 112fb5ea0, old = 8000000120490d07, new = 80000000000001a0, level = 2, flood = 2 gpa = 107407010, old = 8000000106df2d07, new = 8000000000000000, level = 3, flood = 2 gpa = 107410000, old = 8000000107408d07, new = 8000000000000000, level = 5, flood = 0 gpa = 107408000, old = 8000000107407d07, new = 80000000000001a0, level = 4, flood = 0 Contrast that with a bottom-up zap, which effectively allows all 2MiB SPTEs in L1 to leak their children. gpa = 167939000, old = 800000011c8f4d07, new = 8000000000000000, level = 2, flood = 0 gpa = 167939020, old = 8000000104407d07, new = 8000000000000000, level = 2, flood = 1 gpa = 167939028, old = 800000011ed20d07, new = 8000000000000000, level = 2, flood = 2 gpa = 118c70bb0, old = 8000000167ab9d07, new = 8000000000000000, level = 2, flood = 0 gpa = 118c70bb8, old = 8000000163913d07, new = 8000000000000000, level = 2, flood = 1 gpa = 118c70de8, old = 800000011cc9dd07, new = 8000000000000000, level = 2, flood = 2 gpa = 160be7fb0, old = 800000011d322d07, new = 8000000000000000, level = 2, flood = 1 gpa = 160be7fb8, old = 8000000126b1bd07, new = 8000000000000000, level = 2, flood = 2 gpa = 1634ab000, old = 800000010e984d07, new = 8000000000000000, level = 2, flood = 0 gpa = 1634ab008, old = 800000016879fd07, new = 8000000000000000, level = 2, flood = 1 gpa = 1634ab010, old = 800000016879ed07, new = 8000000000000000, level = 2, flood = 2 gpa = 11e3f1e48, old = 8000000168a33d07, new = 8000000000000000, level = 2, flood = 0 gpa = 11e3f1e50, old = 80000001664dcd07, new = 8000000000000000, level = 2, flood = 1 gpa = 1167eacb8, old = 8000000166544d07, new = 8000000000000000, level = 2, flood = 0 gpa = 1167eacc0, old = 800000015c16bd07, new = 8000000000000000, level = 2, flood = 1 gpa = 1689e89b8, old = 800000015f296d07, new = 8000000000000000, level = 2, flood = 0 gpa = 1689e89c0, old = 8000000167ca8d07, new = 8000000000000000, level = 2, flood = 1 gpa = 107b35eb8, old = 8000000161e71d07, new = 8000000000000000, level = 2, flood = 0 gpa = 107b35ec0, old = 8000000118cf3d07, new = 8000000000000000, level = 2, flood = 1 gpa = 118cf2d48, old = 8000000118cf1d07, new = 8000000000000000, level = 2, flood = 0 gpa = 118cf2d50, old = 8000000118cf0d07, new = 8000000000000000, level = 2, flood = 1 gpa = 118dcb770, old = 8000000118dcad07, new = 8000000000000000, level = 2, flood = 0 gpa = 118dcb778, old = 8000000118dc9d07, new = 8000000000000000, level = 2, flood = 1 gpa = 118dc87e8, old = 8000000126997d07, new = 8000000000000000, level = 2, flood = 0 gpa = 118dc87f0, old = 8000000126996d07, new = 8000000000000000, level = 2, flood = 1 gpa = 126995148, old = 8000000126994d07, new = 8000000000000000, level = 2, flood = 0 gpa = 126995150, old = 8000000103477d07, new = 8000000000000000, level = 2, flood = 1 gpa = 1034764c8, old = 8000000103475d07, new = 8000000000000000, level = 2, flood = 0 gpa = 1034764d0, old = 8000000103474d07, new = 8000000000000000, level = 2, flood = 1 gpa = 10ea4b788, old = 800000010ea4ad07, new = 8000000000000000, level = 2, flood = 0 gpa = 10ea4b790, old = 800000010ea49d07, new = 8000000000000000, level = 2, flood = 1 gpa = 10ea48928, old = 800000011a5bfd07, new = 8000000000000000, level = 2, flood = 0 gpa = 10ea48930, old = 800000011a5bed07, new = 8000000000000000, level = 2, flood = 1 gpa = 11a5bd0d8, old = 800000011a5bcd07, new = 8000000000000000, level = 2, flood = 0 gpa = 11a5bd0e0, old = 800000011d323d07, new = 8000000000000000, level = 2, flood = 1 gpa = 122ce2b40, old = 800000011fe0bd07, new = 8000000000000000, level = 2, flood = 0 gpa = 122ce2b48, old = 800000010e985d07, new = 8000000000000000, level = 2, flood = 1 gpa = 122ce2b50, old = 8000000161c9dd07, new = 8000000000000000, level = 2, flood = 2 gpa = 16864c000, old = 8000000167939d07, new = 8000000000000000, level = 3, flood = 0 gpa = 16864c008, old = 8000000118c70d07, new = 8000000000000000, level = 3, flood = 1 gpa = 16864c010, old = 80000001688a6d07, new = 8000000000000000, level = 3, flood = 2 gpa = 11c8f7000, old = 80000001608a7d07, new = 8000000000000000, level = 5, flood = 0 gpa = 1608a7000, old = 800000016864cd07, new = 80000000000001a0, level = 4, flood = 0 Note, in the shadow MMU, "level" describes the level a shadow page "points" at, not the level of its associated SPTE. I.e. when write-flooding of 1GiB PUD entries is detected, KVM recursively zaps shadow pages covering 256GiB worth of memory. And as shown above, KVM's write-flooding detection operates at all levels, so a single PMD (in L1) can effectively only leak two unsync children (4KiB shadow pages) before it gets recursively zapped. As a result, for the top-down zap, L0 KVM will leak at most 4 unsync shadow pages per 256GiB of L2 memory. The top-down zap also makes it more likely that L1 will self-heal (to some extent), as any shadow pages that are "rediscovered" by future runs of L2 can get reclaimed by a recursive zap, whereas bottom-up zapping orphans shadow pages over and over. Note, in theory, there is some risk of over-zapping, e.g. due to zapping a a large branch of the paging tree that L1 is only temporarily removing. In practice, the usage patterns of hypervisors are highly unlikely to trigger false positives. E.g. temporarily changing paging protections is typically done at the leaf, not on a non-leaf entry. And if the L1 hypervisor is updating large swaths of PTEs, e.g. to (temporarily?) remove chunks of memory from L2, then L0 KVM's write-flooding detection will kick in, and the children would be zapped anyways. Fixes: 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent") Cc: Yosry Ahmed <yosry@kernel.org> Cc: Jim Mattson <jmattson@google.com> Cc: James Houghton <jthoughton@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260605174611.2222504-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-06-08RDMA/srp: bound SRP_RSP sense copy by the received lengthMichael Bommarito1-6/+24
srp_process_rsp() copies sense data from rsp->data + resp_data_len, where resp_data_len is the full 32-bit value supplied by the SRP target and is never checked against the number of bytes actually received (wc->byte_len). The copy length is bounded to SCSI_SENSE_BUFFERSIZE, so at most 96 bytes are copied, but the source offset is not bounded. A malicious or compromised SRP target on the InfiniBand/RoCE fabric that the initiator has logged into can return an SRP_RSP with SRP_RSP_FLAG_SNSVALID set and a large resp_data_len. The receive buffer is allocated at the target-chosen max_ti_iu_len, so the source of the sense copy lands past the bytes actually received; with resp_data_len near 0xFFFFFFFF it is gigabytes past the buffer and the read faults. Copy the sense data only if it has not been truncated, that is, only if the response header, the response data, and the sense region fit within the bytes actually received; otherwise drop the sense and log. The in-tree iSER and NVMe-RDMA receive paths already bound their parse by wc->byte_len; this brings ib_srp into line with them. Fixes: aef9ec39c47f ("IB: Add SCSI RDMA Protocol (SRP) initiator") Link: https://patch.msgid.link/r/20260602220457.2542840-1-michael.bommarito@gmail.com Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-08IB/isert: Reject login PDUs shorter than ISER_HEADERS_LENMichael Bommarito1-0/+6
In drivers/infiniband/ulp/isert/ib_isert.c, isert_login_recv_done() computes the login request payload length as wc->byte_len minus ISER_HEADERS_LEN with no lower bound, and login_req_len is a signed int. A remote iSER initiator can post a login Send work request carrying fewer than ISER_HEADERS_LEN (76) bytes, so the subtraction underflows and login_req_len becomes negative. isert_rx_login_req() then reads that negative length back into a signed int, takes size = min(rx_buflen, MAX_KEY_VALUE_PAIRS), and because the min() is signed it keeps the negative value; the value is then passed as the memcpy() length and sign-extended to a multi-gigabyte size_t. The copy into the 8192-byte login->req_buf runs far out of bounds and faults, crashing the target node. The login phase precedes iSCSI authentication, so no credentials are required to reach this path. Reject any login PDU shorter than ISER_HEADERS_LEN before the subtraction, mirroring the existing early return on a failed work completion, so login_req_len can never go negative. The upper bound was already safe: a posted login buffer cannot deliver more than ISER_RX_PAYLOAD_SIZE, so the difference stays at or below MAX_KEY_VALUE_PAIRS and the existing min() clamps it; only the missing lower bound needs to be added. Fixes: b8d26b3be8b3 ("iser-target: Add iSCSI Extensions for RDMA (iSER) target driver") Link: https://patch.msgid.link/r/20260602194642.2273217-1-michael.bommarito@gmail.com Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-08RDMA: During rereg_mr ensure that REREG_ACCESS is compatibleJason Gunthorpe7-0/+45
If IB_MR_REREG_ACCESS changes from RO to RW then the umem has to be re-evaluated to ensure it is properly pinned as RW. Since the umem is hidden inside each driver's mr struct add a ib_umem_check_rereg() function that each driver has to call before processing IB_MR_REREG_ACCESS. mlx4 has to retain its duplicate ib_access_writable check because it implements IB_MR_REREG_ACCESS | IB_MR_REREG_TRANS by changing both items in place sequentially while the MR is live, so it will continue to not support this combination. Cc: stable@vger.kernel.org Fixes: b40656aa7d55 ("RDMA/umem: remove FOLL_FORCE usage") Link: https://patch.msgid.link/r/0-v1-06fb1a2d6cf5+107-rereg_access_jgg@nvidia.com Reported-by: Philip Tsukerman <philiptsukerman@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-08Documentation: KVM: Synchronize x86 VM typesCarlos López1-0/+2
KVM has reflected KVM_X86_SNP_VM to userspace since 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support"), and KVM_X86_TDX_VM since 161d34609f9b ("KVM: TDX: Make TDX VM type supported"). Update the documentation to reflect this fact. Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support") Fixes: 161d34609f9b ("KVM: TDX: Make TDX VM type supported") Signed-off-by: Carlos López <clopez@suse.de> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260603114504.814647-2-clopez@suse.de [sean: use one tab instead of two] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-06-08KVM: selftests: Add regression test for mediated PMU fixed counter filter bugSean Christopherson1-0/+6
Add a regression test where KVM would inadvertently ignore PMU event filters on writes that change _some_ bits in FIXED_CTR_CTRL, but not the enable bits for PMCs that are denied to the guest. Link: https://patch.msgid.link/20260603231905.1738487-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-06-08KVM: x86/pmu: Use hardware value when reprogramming for FIXED_CTR_CTRL changesSean Christopherson1-1/+9
When (conditionally) reprogramming fixed counters, use the hardware value of FIXED_CTR_CTRL to detect changes, not the guest's original value. For guests with a mediated PMU, overwriting fixed_ctr_ctrl_hw at the start of reprogramming without actually reacting to changes in fixed_ctr_ctrl_hw can lead to KVM ignoring PMU event filters. E.g. if the guest attempts to enable a fixed PMC that is disallowed, and then toggles a different PMC in a subsequent WRMSR, KVM will update pmu->fixed_ctr_ctrl_hw and reprogram the PMC that is changing, but not the others that are now effectively enabled in pmu->fixed_ctr_ctrl_hw. Note, the perf-based PMU is unaffected, as it doesn't use fixed_ctr_ctrl_hw (which is also why keying off fixed_ctr_ctrl_hw works for both PMUs. Note #2, fixed_ctr_ctrl_hw won't mess up pmc_in_use either, because the latter isn't used by the mediated PMU. Its purpose is solely to release perf events that are no longer being actively used, and the meadiated PMU obviously doesn't create perf events. Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://lore.kernel.org/all/20260528005419.0228F1F00A3A@smtp.kernel.org Link: https://patch.msgid.link/20260603231905.1738487-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-06-08KVM: x86: hyper-v: Bound the bank index when querying sparse banksHyunwoo Kim1-0/+5
When checking if a VP ID is included in a sparse bank set, explicitly check that the ID can actually be contained in a sparse bank (the TLFS allows for a maximum of 64 banks of 64 vCPUs each). When handling a paravirtual TLB flush for L2, the VP ID is copied verbatim from the enlightened VMCS, without any bounds check, i.e. isn't guaranteed to be under the limit of 4096. Failure to check the bounds of the VP ID leads to an out-of-bounds read when testing the sparse bank, and super strictly speaking could lead to KVM performing an unnecessary TLB flush for an L2 vCPU. ================================================================== BUG: KASAN: use-after-free in hv_is_vp_in_sparse_set+0x85/0x100 [kvm] Read of size 8 at addr ffff88811ba5f598 by task hyperv_evmcs/2802 CPU: 12 UID: 1000 PID: 2802 Comm: hyperv_evmcs Not tainted 7.1.0-rc2 #7 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x51/0x60 print_report+0xcb/0x5d0 kasan_report+0xb4/0xe0 kasan_check_range+0x35/0x1b0 hv_is_vp_in_sparse_set+0x85/0x100 [kvm] kvm_hv_flush_tlb+0xe9e/0x16c0 [kvm] kvm_hv_hypercall+0xe6b/0x1e60 [kvm] vmx_handle_exit+0x485/0x1b60 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x22e3/0x5070 [kvm] kvm_vcpu_ioctl+0x5d0/0x10c0 [kvm] __x64_sys_ioctl+0x129/0x1a0 do_syscall_64+0xb9/0xcf0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f0e62d1a9bf </TASK> The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x11ba5f flags: 0x4000000000000000(zone=1) raw: 4000000000000000 0000000000000000 00000000ffffffff 0000000000000000 raw: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88811ba5f480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88811ba5f500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88811ba5f580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88811ba5f600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88811ba5f680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint Opportunistically add a compile time assertion to ensure the maximum number of sparse banks exactly matches the number of possible bits in the passed in mask. Cc: stable@vger.kernel.org Fixes: c58a318f6090 ("KVM: x86: hyper-v: L2 TLB flush") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/aiQyZIJtO-2Aj_xN@v4bel [sean: add KASAN splat, drop comment, add assert, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-06-08KVM: guest_memfd: fix NUMA interleave index double-countingMichael S. Tsirkin1-3/+4
kvm_gmem_get_policy() sets the interleave index (the output param that's typically named "ilx") to the full page offset (vm_pgoff + vma offset). But get_vma_policy() adds the page offset on top of the interleave index, and so the offset is counted twice. This causes NUMA interleaving to skip nodes: for order-0 pages the effective index jumps by 2 for each consecutive page. The vm_op.get_policy() implementation should return only a per-file bias in the interleave index (like shmem_get_policy does with inode->i_ino), letting get_vma_policy() add the page-offset component. Fix by setting the output interleave index to the inode number (a la shmem) instead of the full page offset, as the index is intended to be a constant, semi-random value for a given file, e.g. so that interleaving doesn't start at the same node for every file, and so that allocations are round-robined across nodes based on the page offset (the selected node would bounce/skip around if the index isn't constant). Found by Sashiko (sashiko.dev) AI code review. Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy") Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Shivank Garg <shivankg@amd.com> Tested-by: Shivank Garg <shivankg@amd.com> Fixes: 7f3779a3ac3e ("mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()") Link: https://patch.msgid.link/0eff0a90667b900bee837d06b5db5025e1f304b5.1780501924.git.mst@redhat.com [sean: use reverse fir-tree, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-06-08Merge tag 'v7.1-p5' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: - Fix random config build failure on s390. * tag 'v7.1-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: s390 - add select CRYPTO_AEAD for aes