| Age | Commit message (Collapse) | Author | Files | Lines |
|
It's rare to find a system that has more than 4 sockets,
but a system can have more than 4 NUMA nodes if each socket
exposes its chiplets as separate NUMA nodes.
In particular, our CI caught a failure in this test on a system with
two sockets, each containing an 'AMD EPYC 7601 32-Core Processor'.
Bump the limit to 32, just in case.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-ID: <20260612150038.1277394-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 7.2
* New features:
- None. Zilch. Nada. Que dalle.
* Fixes and other improvements:
- Significant cleanup of the vgic-v5 PPI support which was merged in
7.1. This makes the code more maintainable, and squashes a couple
of bugs in the meantime.
- Set of fixes for the handling of the MMU in an NV context,
particularly VNCR-triggered faults. S1POE support is fixed
as well.
- Large set of pKVM fixes, mostly addressing recurring issues
around hypervisor tracking of donated pages in obscure cases
where the donation could fail and leave things in a bizarre
state.
- Fixes for the so-called "lazy vgic init", which resulted in
sleeping operations in non-preemptible sections. This turned
out to be far more invasive than initially expected...
- Reduce the overhead of L1/L2 context switch by not touching
the FP registers.
- Fix the way non-implemented page sizes are dealt with when
a guest insist on using them for S2 translation.
- The usual set of low-impact fixes and cleanups all over the map.
|
|
The non-MMU changes/preliminary cleanups from the "split kvm_mmu in
three" series[1]. The final outcome is to have a single copy of the
PDPTRs (in vcpu->arch) instead of two (in root_mmu and nested_mmu).
[1] https://lore.kernel.org/kvm/20260603105814.10236-1-pbonzini@redhat.com/T/#t
|
|
PDPTRs are part of the CPU state. A bit unconventionally, they are
reached via vcpu->arch.walk_mmu instead of being stored in vcpu->arch
directly. That is nice in principle---it would allow TDP shadow paging
to have its own PDPTRs---but it is not necessary, because EPT has no
PDPTRs and NPT does not cache them.
Since kvm_pdptr_read does not otherwise need the MMU, drop the pdptrs
from the MMU altogether. There is however something to be careful
about, in that PDPTRs are now not stored separately in root_mmu and
nested_mmu for L1 and L2 guests. In practice this was already not
an issue:
- for EPT the VMCS0x has to keep them up to date; and for the purpose
of emulation they are always loaded from the VMCS on vmentry/vmexit,
thanks to the clearing of dirty and available register bitmaps in
vmx_switch_vmcs()
- for NPT, VCPU_EXREG_PDPTR is similarly cleared for nNPT, which does
not cache the PDPTRs; while for non-nNPT the PDPTRs are loaded
together with the load of CR3.
Note that page table PDPTRs are not affected, since they are stored
in pae_root.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This is true for both Intel and AMD. On Intel, "enable INVPCID" is
set unconditionally if supported, but the vmexit is triggered by the
"INVLPG exiting" control which is disabled by enable_ept. On AMD, KVM
can intercept INVPCID if NPT is enabled but only in order to inject #UD
in the guest.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When L2 runs under nested NPT and uses PAE paging, KVM's cached PDPTRs
in mmu->pdptrs[] can hold stale or wrong values after nested
transitions and across migration restore, because both
nested_svm_load_cr3() and svm_get_nested_state_pages() only refresh
PDPTRs on the !nested_npt path.
The user-visible bug is on migration restore of an L2 running with nested
NPT and 32-bit PAE paging, if userspace uses KVM_SET_SREGS rather than
KVM_SET_SREGS2. In that case, load_pdptrs() leaves VCPU_EXREG_PDPTR
marked as available, and kvm_pdptr_read() will use a stale translation
that used L1 GPAs instead of L2 nGPAs. svm_get_nested_state_pages()
runs on first KVM_RUN but skips the refresh because nested_npt_enabled()
is true. The CPU itself reads L2's PDPTRs correctly from memory via
L1's NPT, but KVM-side walking of guest PAE page tables uses the bogus
cached values.
Unlike Intel's GUEST_PDPTR0..3 fields in the VMCS, SVM has no
VMCB-cached PDPTR state: the in-memory PDPTEs at the current CR3 are
the only source of truth, and svm_cache_reg(VCPU_EXREG_PDPTR) simply
reloads them from memory via load_pdptrs(). Clearing the avail
bit (and the dirty bit because !avail/dirty is invalid) to force
a reload when PDPTRs as needed fixes the bug.
Do the same for nested_svm_load_cr3()'s nested_npt branch, so that
the invariant "PDPTRs need reloading" is handled similarly for both
immediate and deferred loading.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The early vmwrite of the PDPTRs in prepare_vmcs02_rare() is redundant, because
every write it does will be performed by prepare_vmcs02() if it is actually
needed.
In any case where the emulator or the processor need the PDPTR, either
is_pae_paging() is true on vmentry, or a write of CR0, CR4 or EFER will
cause a vmexit to L0. The next vmentry will refresh the PDPTRs in the
vmcs02 from vmcs12.
In fact, the original version[1] of what ended up being commit
c7554efc8335 ("KVM: nVMX: Copy PDPTRs to/from vmcs12 only when
necessary"), the writes in what is now prepare_vmcs02_rare() were removed.
When the mega-collection of optimizations was posted[2], the removal of
that code got dropped as a rebase good, so reinstate it.
[1] https://lore.kernel.org/all/20190507160640.4812-16-sean.j.christopherson@intel.com
[2] https://lore.kernel.org/all/1560445409-17363-31-git-send-email-pbonzini@redhat.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
nested_mmu is always stored into vcpu->arch.walk_mmu at the same time as
guest_mmu is stored into vcpu->arch.mmu. But nested_mmu is not even
a proper MMU, it is only used for page walking; plus the fact that
walk_mmu has to be switched at all is just an implementation detail.
In the end what matters here is whether the guest is using nested
page tables; vmx/nested.c and svm/nested.c check it to see if they
are in nEPT or nNPT context respectively. So switch to checking
root_mmu vs. guest_mmu, which is a more cogent test.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260511150648.685374-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
* kvm-arm64/nv-mmu-7.2:
: .
: Assorted collection of fixes for NV MMU bugs
:
: - Correctly plug AT S1E1A handling in the emulation backend
:
: - Make CPTR_EL2.E0POE depend on FEAT_S1POE
:
: - Drop the reference on the page if the VNCR translation
: races with an MMU notifier
:
: - Correctly synthesise an SEA if a page table walk fails due
: to a guest error
:
: - Fully invalidate the VNCR TLB and fixmap when translating
: for a new VNCR
:
: - Restart S1 walk when the S2 walk fails due to a race condition
:
: - Correctly return -EAGAIN when a S1 walk fails
:
: - Fix block mapping validity check in stage-1 walker for 64kB pages
:
: - Fix potential NULL dereference when performing an EL2 TLBI targeting
: the VNCR page
:
: - Hold kvm->mmu_lock while initialising the vncr_tlb pointer
: .
KVM: arm64: nv: Hold kvm->mmu_lock while initialising vcpu->arch.vncr_tlb
KVM: arm64: nv: Avoid dereferencing NULL VNCR pseudo-TLB
KVM: arm64: Fix block mapping validity check in stage-1 walker
KVM: arm64: nv: Restart stage-1 walk if stage-2 desc update fails
KVM: arm64: Restart instruction upon race in __kvm_at_s12()
KVM: arm64: nv: Inject SEA TTW when desc update can't write to GPA
KVM: arm64: nv: Fully update VNCR fixmap state in kvm_translate_vncr()
KVM: arm64: Don't leak PFN when kvm_translate_vncr() races MMU notifier
arm64: cpufeature: Expose ID_AA64ISAR2_EL1.ATS1A to KVM
KVM: arm64: Wire AT S1E1A in the system instruction handling table
KVM: arm64: Key CPTR_EL2.E0POE propagation on FEAT_S1POE
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/misc-7.2:
: .
: - Check for a valid vcpu pointer upon deactivating traps when handling
: a HYP panic in VHE mode
:
: - Make the __deactivate_fgt() macro use its arguments instead of the
: surrounding context
:
: - Don't bother with initialising TPIDR_EL2 in the hyp stubs, as this
: is already taken care of in more obvious places
:
: - Drop the unused kvm_arch pointer passed to __load_stage2()
:
: - Return -EOPNOTSUPP when a hypercall fails for some reason, instead of
: returning whatever was in the result structure
:
: - Make the ITS ABI selection helpers return void, which avoids wondering
: about the nature of the return code (always 0)
: .
KVM: arm64: vgic-its: Make ABI commit helpers return void
KVM: arm64: Set a Linux errno on SMCCC error in kvm_call_hyp_nvhe()
KVM: arm64: Remove @arch from __load_stage2()
KVM: arm64: Don't populate TPIDR_EL2 in finalise_el2()
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
KVM SVM changes for 7.2
- Add support for virtualizing gPAT (KVM previously just used L1's PAT when
running L2).
- Fix goofs where KVM mishandles side effects (e.g. single-step and PMC
updates) when emulating VMRUN.
- Fix a variety of bugs in AVIC's handling of x2APIC MSR interception, most
notably where KVM didn't disable interception of IRR, ISR, and TMR regs.
- Add support for virtualizing Host-Only/Guest-Only bits in the mediated PMU.
|
|
KVM VMX changes for 7.2
- Fix a largely benign bug where KVM TDX would incorrectly state it could
emulate several x2APIC MSRs.
- Use the "safe" WRMSR API when proxying LBR MSR writes as the to-be-written
value is guest controlled and completely unvalidated.
|
|
KVM VFIO changes for 7.2
Use guard() to cleanup up various KVM+VFIO flows.
|
|
KVM SEV changes for 7.2
- Don't advertise support for unusuable VM types, and account for VM types
that are disabled by firmware, e.g. to mitigate security vulnerabilities.
- Rewrite the SEV {en,de}crypt debug ioctls as they were riddle with bugs and
unnecessarily complicated, and add comprehensive tests.
- Clean up and deduplicate the SEV page pinning code.
- Fix minor goofs related to writing back CPUID information after firmware
rejects a CPUID page for an SNP vCPU.
|
|
KVM selftests changes for 7.2
- Randomize the dirty log test's delay when reaping the bitmap on the first
pass, as always waiting only 1ms hid a KVM RISC-V bug as the test reaped the
bitmap before KVM could build up enough state to hit the bug.
- A pile of one-off fixes and cleanups.
|
|
KVM x86 MMU changes for 7.2
- Use the kernel's "enum pg_level" in the TDX APIs instead of the TDX-Module's
level definitions (which are 0-based).
- Rework the TDX memory APIs to not require/assume that guest memory is
backed by "struct page" (in prepartion for guest_memfd hugepage support).
- Overhaul the TDP MMU => S-EPT code to move as much S-EPT specific logic as
possible into the TDX code, and to funnel (almost) all S-EPT updates into
a single chokepoint. The motivation is largely to prepare for upcoming
Dynamic PAMT support, but the cleanups are nice to have on their own.
- Plug a hole in the shadow MMU where KVM fails to recursively zap nested TDP
shadow when L1 is tearing its TDP page tables from the bottom up, as KVM's
TDP MMU now does.
|
|
KVM misc x86 changes for 7.2
- Handle EXIT_FASTPATH_EXIT_USERSPACE in vendor code to ensure vendor code
gets a chance to handle things like reaping the PML buffer.
- Ensure KVM's copy of CR0 and CR3 are up-to-date on SVM prior to invoking
fastpath handlers.
- Update KVM's view of PV async enabling if and only if the MSR write fully
succeeds.
- Fix a variety of issues where the emulator doesn't honor guest-debug state,
and clean up related code along the way.
- Synthesize EPT Violation and #NPF "error code" bits when injecting faults
into L1 that didn't originate in hardware (in which case the VMCS/VMCB
doesn't hold relevant information).
- Add support for virtualizing (well, emulating) AMD's flavor of CPL>0 CPUID
faulting.
- Clean up the GPR APIs so that KVM's use of "raw" is consistent, and fix a
variety of minor bugs along the way.
- Fix an OOB memory access due to not checking the VP ID when handling a
Hyper-V PV TLB flush for L2.
- Fix a bug in the mediated PMU's handling of fixed counters that allowed the
guest to bypass the PMU event filter.
- Allow userspace to return EAGAIN when handling SNP and TDX hypercalls, so
the KVM can forward a "retry" status code to the guest, and reserve all
unused error codes for future usage.
- Misc fixes and cleanups.
|
|
KVM guest_memfd changes for 7.2
- Return -EEXIST instead of -EINVAL if userspace attempts to bind a gmem
range to multiple memslots, and fix the test that was supposed to ensure
KVM returns -EEXIST.
- Treat memslot binding offsets and sizes as unsigned values to fix a bug
where KVM interprets a large "offset + size" as a negative value and allows
a nonsensical offset.
- Use the inode number instead of the page offset for the NUMA interleaving
index to fix a bug where the effective index would jump by two for
consecutive pages (the caller also adds in the page offset).
|
|
* kvm-arm64/vgic-v5-PPI-fixes:
: .
: Substantial cleanup of the vgic-v5 PPI support. From the original
: cover letter:
:
: "With the GICv5 PPi support merged in, it has become obvious that a few
: things could be improved, both from the correctness and maintainability
: angles."
: .
KVM: arm64: Fix arch timer interrupts for GICv3-on-GICv5 guests
irqchip/gic-v5: Immediately exec priority drop following activate
Documentation: KVM: Clarify that PMU_V3_IRQ IntID requirements for GICv5
Documentation: KVM: Fix typos in VGICv5 documentation
KVM: arm64: selftests: Improve error handling for GICv5 PPI selftest
KVM: arm64: selftests: Cleanup unused vars in GICv5 PPI selftest
KVM: arm64: selftests: Add missing GIC CDEN to no-vgic-v5 selftest
KVM: arm64: vgic-v5: Atomically assign bits to PPI DVI bitmap
KVM: arm64: vgic-v5: Add missing trap handing for NV triage
KVM: arm64: vgic-v5: Limit support to 64 PPIs
KVM: arm64: vgic: Rationalise per-CPU irq accessor
KVM: arm64: vgic-v5: Drop defensive checks from vgic_v5_ppi_queue_irq_unlock()
KVM: arm64: vgic: Consolidate vgic_allocate_private_irqs_locked()
KVM: arm64: vgic: Constify struct irq_ops usage
KVM: arm64: vgic-v5: Drop pointless ARM64_HAS_GICV5_CPUIF check
KVM: arm64: vgic-v5: Remove use of __assign_bit() with a constant
KVM: arm64: vgic-v5: Move PPI caps into kvm_vgic_global_state
KVM: arm64: vgic-v5: Add for_each_visible_v5_ppi() iterator
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/pkvm-fixes-7.2:
: .
: Assorted pKVM fixes for 7.2:
:
: - Ensure that the vcpu memcache is filled in a number of cases (donate,
: share, selftest)
:
: - Fix vmemmap page order handling by resetting it when initialising the
: memory pool
:
: - Don't leak page references on failed memory donation
:
: - Add sanity-check for refcounted pages when donating/sharing pages
:
: - Clear __hyp_running_vcpu on state flush
:
: - Check LR upper bound against a trusted value
:
: - Assorted fixes for the host-side tracking of the pages shared with
: EL2 as a result of some Sashiko testing from Fuad
:
: - Correctly forward HCR_EL2.VSE from host to guest, so that protected
: guests can see SErrors
: .
KVM: arm64: Roll back partial shares on kvm_share_hyp() failure
KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure
KVM: arm64: Free hyp-share tracking node when share hypercall fails
KVM: arm64: Flush HCR_EL2.VSE to deliver SErrors to pKVM guests
KVM: arm64: Bound used_lrs when flushing the pKVM hyp vCPU
KVM: arm64: Clear __hyp_running_vcpu when flushing the pKVM hyp vCPU
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host
KVM: arm64: Fix __pkvm_init_vm error path
KVM: arm64: Reset page order in pKVM hyp_pool
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/nv-granule-sizes:
: .
: Tidying up of the behaviour when the selected page size in not
: implemented, courtesy of Wei-Lin Chang. From the initial cover
: letter:
:
: "This small series fixes the granule size selection for software stage-1
: and stage-2 walks. Previously we treat the guest's TCR/VTCR.TGx as-is
: and use the encoded granule size for the walks. However this is
: incorrect if the granule sizes are not advertised in the guest's
: ID_AA64MMFR0_EL1.TGRAN*. The architecture specifies that when an
: unsupported size is programed in TGx, it must be treated as an
: implemented size. Fix this by choosing an available one while
: prioritizing PAGE_SIZE."
: .
KVM: arm64: Fallback to a supported value for unsupported guest TGx
KVM: arm64: nv: Use literal granule size in TLBI range calculation
KVM: arm64: Factor out TG0/1 decoding of VTCR and TCR
KVM: arm64: nv: Rename vtcr_to_walk_info() to setup_s2_walk()
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/nv-fp-elision:
: .
: Significantly reduce the overhead of the context switch between L1 and
: L2 guests by eliding the save/restore of the FP/SIMD/SVE registers, as
: this state is shared between the two guests, and therefore can be left
: live.
: .
KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception
KVM: arm64: nv: Track L2 to L1 exception emulation
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/no-lazy-vgic-init:
: .
: Fix an ugly situation where the vgic lazy init could happen in
: non-preemtible contexts such as vcpu reset, resulting in lockdep
: splats.
:
: This requires revamping the way in-kernel emulation of devices
: (timers, PMU) are presenting their interrupt to the vgic, and
: make sure there is no need to init the vgic on the back of that.
: .
KVM: arm64: vgic-v2: Don't init the vgic on in-kernel interrupt injection
KVM: arm64: vgic-v2: Force vgic init on injection outside the run loop
KVM: arm64: pmu: Kill the PMU interrupt level cache
KVM: arm64: timer: Kill the per-timer irq level cache
KVM: arm64: Simplify userspace notification of interrupt state
KVM: arm64: timer: Repaint kvm_timer_{should,irq_can}_fire() to kvm_timer_{pending,enabled}()
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The return values of vgic_its_set_abi() and vgic_its_commit_v0() are always
0 and do not carry useful error information. Simplify by changing them to
void.
Suggested-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Oliver Upton <oupton@kernel.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://patch.msgid.link/20260604075147.53299-1-liu.yun@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
KVM generic changes for 7.2
- Rename invalidate_begin() to invalidate_start() throughout KVM to follow
the kernel's nomenclature, e.g. for mmu_notifiers.
- Minor cleanups.
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: A few more misc gmap fixes.
|
|
Use kvm_s390_mmu_cache_alloc_rmap() to allocate the rmap in
gmap_insert_rmap(), instead of a normal kzalloc_obj() with GFP_ATOMIC.
This guarantees forward progress.
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
CC: stable@vger.kernel.org # 7.1
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260611104850.110313-6-imbrenda@linux.ibm.com>
|
|
Add missing radix_tree_preload() in _gaccess_shadow_fault() to
guarantee forward progress. The core of _gaccess_shadow_fault() has
been split into ___gaccess_shadow_fault() in order to simplify locking.
Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
CC: stable@vger.kernel.org # 7.1
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260611104850.110313-5-imbrenda@linux.ibm.com>
|
|
The allocation size for struct vsie_rmap in kvm_s390_mmu_cache_topup()
was wrong due to a copy-paste error.
Fix it by using the type name.
Fixes: 12f2f61a9e1a ("KVM: s390: KVM page table management functions: allocation")
CC: stable@vger.kernel.org # 7.1
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260611104850.110313-4-imbrenda@linux.ibm.com>
|
|
Fix an unlikely race in try_get_locked_pte(), which could have happened
if puds or pmds get unmapped between the p?dp_get() and p?d_offset()
functions.
Fixes: 89fa757931dc ("KVM: s390: Avoid potentially sleeping while atomic when zapping pages")
CC: stable@vger.kernel.org # 7.1
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260611104850.110313-3-imbrenda@linux.ibm.com>
|
|
While dat_crstep_xchg_atomic() is marked as __must_check, in this
particular case the return value should be ignored.
Silence potential compiler warnings with a pointless check, and add a
comment to explain the situation.
Fixes: d1adc098ce08 ("KVM: s390: Fix _gmap_crstep_xchg_atomic()")
CC: stable@vger.kernel.org # 7.1
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260611104850.110313-2-imbrenda@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These address some remaining fallout after introducing dynamic EPP
support in the amd-pstate driver during the current development cycle:
- Restore allowing writing EPP of 0 when in performance mode in the
amd-pstate driver which was unnecessarily disallowed by one of the
recent updates (Mario Limonciello)
- Remove stale documentation of the epp_cached field in struct
amd_cpudata that has been dropped recently (Zhan Xusheng)"
* tag 'pm-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq/amd-pstate: Fix setting EPP in performance mode
cpufreq/amd-pstate: drop stale @epp_cached kdoc
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
- Fix the implementation of the CFI branch landing pad control prctl()s
to return -EINVAL if unknown control bits are set, rather than
silently ignoring the request; and add a kselftest for this case
- Fix unaligned access performance testing to happen earlier in boot,
which fixes a performance regression in the lib/checksum code
- Fix a binfmt_elf warning when dumping core (due to missing
.core_note_name for CFI registers)
* tag 'riscv-for-linux-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: cfi: reject unknown flags in PR_SET_CFI
riscv: Fix fast_unaligned_access_speed_key not getting initialized
riscv/ptrace: Use USER_REGSET_NOTE_TYPE for REGSET_CFI
|
|
open_tree(..., OPEN_TREE_NAMESPACE) and
fsmount(..., FSMOUNT_NAMESPACE, ...) currently work on non-directories,
like regular files. That's bad for two reasons:
- It ends up mounting a regular file over the inherited namespace root,
which is a directory; mounting a non-directory over a directory is
normally explicitly forbidden, see for example do_move_mount()
- It causes setns() on the new namespace to set the cwd to a regular
file, which the rest of VFS does not expect
Fix it by restricting create_new_namespace() (which is used by both of
these flags) to directories.
Leave the behavior for OPEN_TREE_CLONE as-is, that seems unproblematic.
Fixes: 9b8a0ba68246 ("mount: add OPEN_TREE_NAMESPACE")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Sashiko reports that there is a race between initialising vncr_tlb
and making use of it, as we don't hold the mmu_lock at this point.
Additionally, it identifies a memory leak, should userspace repeatedly
invokes the KVM_RUN ioctl after a failure of kvm_arch_vcpu_run_pid_change(),
as we assign vncr_tlb blindly on first run, irrespective of prior
allocations.
Slap the two bugs in one go by taking the kvm->mmu_lock on assigning
vncr_tlb, preventing the race for good, and by checking that vncr_tlb
is indeed NULL prior to allocation.
Reported-by: Sashiko <sashiko-bot@kernel.org>
Link: https://lore.kernel.org/r/20260607180815.85FBC1F00893@smtp.kernel.org
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260608081108.2244133-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
VNCR TLB invalidation occurs from MMU notifiers or TLBI instructions,
and either can race against a vcpu not being onlined yet (no pseudo-TLB
allocated). Similarly, the TLB might be invalid, and the invalidation
should be skipped in this case.
Both kvm_invalidate_vncr_ipa() and kvm_invalidate_vncr_va() are
expected to perform the same checks, except that the latter doesn't
check for the allocation and blindly dereferences the pointer.
Solve this by introducing a new iterator built on top of the usual
kvm_for_each_vcpu() that checks for both of the above conditions,
and convert the two users to it.
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Link: https://lore.kernel.org/r/aiUvSbrWndQeUPc8@v4bel
Fixes: 4ffa72ad8f37 ("KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2")
Cc: stable@vger.kernel.org
Reviewed-by: Oliver Upton <oupton@kernel.org>
Link: https://patch.msgid.link/20260607175745.297793-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier fixes from Steven Rostedt:
- Fix reset ordering on per-task destruction
Reset the task before dropping the slot instead of after, which was
causing out-of-bound memory accesses.
- Fix HA monitor synchronization and cleanup
Ensure synchronous cleanup for HA monitors by running timer callbacks
in RCU read-side critical sections and using synchronize_rcu() during
destruction.
- Avoid armed timers after tasks exit
Add automatic cleanup for per-task HA monitors to prevent timers from
firing after task exit.
- Fix memory ordering for DA/HA monitors
Fix race conditions during monitor start by using release-acquire
semantics for the monitoring flag.
- Fix initialization for DA/HA monitors
Ensure monitors are not initialized relying on potentially corrupted
state like the monitoring flag, that is not reset by all monitors
type and may have an unknown state in monitors reusing the storage
(per-task).
- Fix memory safety in per-task and per-object monitors
Prevent use-after-free and out-of-bounds access by synchronizing with
in-flight tracepoint probes using tracepoint_synchronize_unregister()
before freeing monitor storage or releasing task slots.
- Adjust monitors for preemptible tracepoints
Fix monitors that relied on tracepoints disabling preemption.
Explicitly disable task migration when per-CPU monitors handle events
to avoid accessing the wrong state and update the opid monitor logic.
- Fix incorrect __user specifier usage
Remove __user from a non-pointer variable in the extract_params()
helper.
- Fix bugs in the rv tool
Ensure strings are NUL-terminated, fix substring matching in monitor
searches, and improve cleanup and exit status handling.
- Fix several bugs in rvgen
Fix LTL literal stringification, subparsers' options handling, and
suffix stripping in dot2k.
* tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
verification/rvgen: Fix ltl2k writing True as a literal
verification/rvgen: Fix options shared among commands
verification/rvgen: Fix suffix strip in dot2k
tools/rv: Fix cleanup after failed trace setup
tools/rv: Fix substring match when listing container monitors
tools/rv: Fix substring match bug in monitor name search
tools/rv: Ensure monitor name and desc are NUL-terminated
rv: Use 0 to check preemption enabled in opid
rv: Prevent task migration while handling per-CPU events
rv: Ensure synchronous cleanup for HA monitors
rv: Add automatic cleanup handlers for per-task HA monitors
rv: Do not rely on clean monitor when initialising HA
rv: Fix monitor start ordering and memory ordering for monitoring flag
rv: Ensure all pending probes terminate on per-obj monitor destroy
rv: Prevent in-flight per-task handlers from using invalid slots
rv: Reset per-task DA monitors before releasing the slot
rv: Fix __user specifier usage in extract_params()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull RTLA fix from Steven Rostedt:
- Fix multi-character short option parsing
Fix regression in parsing of multiple-character short options
(eg -p100 /= -p 100/, -un /= -u -n/) caused by getopt_long()
internal state corruption after a refactoring.
* tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rtla: Fix parsing of multi-character short options
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"11 hotfixes. 9 are for MM. 8 are cc:stable and the remaining 3 address
post-7.1 issues or aren't considered suitable for backporting.
Thre's a two-patch series "mm/damon/{reclaim,lru_sort}: handle ctx
allocation failures" from SeongJae Park which fixes a couple of DAMON
-ENOMEM bloopers. The rest are singletons - please see the individual
changelogs for details"
* tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mincore: handle non-swap entries before !CONFIG_SWAP guard
arm64: mm: call pagetable dtor when freeing hot-removed page tables
mm/list_lru: drain before clearing xarray entry on reparent
mm/huge_memory: use correct flags for device private PMD entry
mm/damon/lru_sort: handle ctx allocation failure
mm/damon/reclaim: handle ctx allocation failure
zram: fix use-after-free in zram_bvec_write_partial()
MAINTAINERS: update Baoquan He's email address
tools headers UAPI: sync linux/taskstats.h for procacct.c
mm/cma_sysfs: skip inactive CMA areas in sysfs
ipc/shm: serialize orphan cleanup with shm_nattch updates
|
|
Pull rdma fixes from Jason Gunthorpe:
"Several significant bug fixes of pre-existing issues:
- Missing validation on ucap fd types passed from userspace
- Missing validation of HW DMA space vs userpace expected sizes in
EFA queue setup
- DMA corruption when using DMA block sizes >= 4G when setting up MRs
in all drivers
- Missing validation of CPU IDs when setting up dma handles
- Missing validation of IB_MR_REREG_ACCESS when changing writability
of a MR
- Missing validation of received message/packet size in ISER and SRP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/srp: bound SRP_RSP sense copy by the received length
IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN
RDMA: During rereg_mr ensure that REREG_ACCESS is compatible
RDMA/core: Validate cpu_id against nr_cpu_ids in DMAH alloc
RDMA/umem: Fix truncation for block sizes >= 4G
RDMA/efa: Validate SQ ring size against max LLQ size
RDMA/core: Validate the passed in fops for ib_get_ucaps()
|
|
writes
Recursively zap orphaned nested TDP shadow pages when emulating a guest
write to a shadowed page table, regardless of whether or not the associated
(parent) shadow page will be zapped, e.g. due to detected write-flooding.
This plugs a hole where KVM fails to reclaim defunct, unsync shadow pages
for select L1 hypervisor patterns. Commit 2de4085cccea ("KVM: x86/MMU:
Recursively zap nested TDP SPs when zapping last/only parent") modified KVM
to recursively zap synchronized shadow pages (KVM already recursively zaps
unsync children) when a child is orphaned. But the fix effectively only
applied the logic to kvm_mmu_page_unlink_children(), i.e. only performs the
recursive zap when KVM is already zapping a parent SP and processing its
children.
If L1 zaps SPTEs bottom-up (4KiB => 2MiB => ...), as KVM's TDP MMU does
with CONFIG_KVM_PROVE_MMU=n since commit 8ca983631f3c ("KVM: x86/mmu: Zap
invalidated TDP MMU roots at 4KiB granularity"), then KVM (as L0) will leak
upwards of 4 shadow pages per GiB of L2 guest memory. Over hundreds or
thousands of L2 boots, if the VM is "lucky" enough to escape write-flooding
detection, i.e. not trigger reclaim of the orphaned shadow pages by dumb
luck, then it's possible to end up with tens or even hundreds of thousands
of unsync shadow pages and associated rmap entries.
Polluting the hash table and rmap entries with a horde of stale entries
can eventually degrade L2 guest boot time by an order of magnitude,
especially if there is any antagonistic activity in the host, i.e. anything
that will contend for mmu_lock and/or needs to walk rmaps.
With "top"-down zapping, where "top" is 1GiB or above, then L0 KVM is
effectively limited to leaking 4 shadow pages per 256 GiB of memory, as
KVM's write flooding detection will kick in on the third write to an L1
TDP PUD, and thus recursively zap the entire 256 GiB range of the parent
PGD. I.e. even though L1 KVM still recursively zaps 2MiB => 4KiB SPTEs
when zapping each 1GiB SPTE, KVM only gets through two of the 1GiB SPTEs
before dropping everything. E.g. hacking tracing into L0 KVM's
kvm_mmu_track_write(), the top-down zapping of L1's TDP MMU for an L2 with
16GiB of memory leads to:
gpa = 107407000, old = 800000010741bd07, new = 8000000000000000, level = 3, flood = 0
gpa = 10741b000, old = 8000000112fb2d07, new = 80000000000001a0, level = 2, flood = 0
gpa = 10741b008, old = 800000012509cd07, new = 80000000000001a0, level = 2, flood = 1
gpa = 10741b010, old = 80000001114b9d07, new = 80000000000001a0, level = 2, flood = 2
gpa = 107407008, old = 8000000112fb5d07, new = 8000000000000000, level = 3, flood = 1
gpa = 112fb5298, old = 8000000106f43d07, new = 80000000000001a0, level = 2, flood = 0
gpa = 112fb52a0, old = 8000000106f4dd07, new = 80000000000001a0, level = 2, flood = 1
gpa = 112fb5ea0, old = 8000000120490d07, new = 80000000000001a0, level = 2, flood = 2
gpa = 107407010, old = 8000000106df2d07, new = 8000000000000000, level = 3, flood = 2
gpa = 107410000, old = 8000000107408d07, new = 8000000000000000, level = 5, flood = 0
gpa = 107408000, old = 8000000107407d07, new = 80000000000001a0, level = 4, flood = 0
Contrast that with a bottom-up zap, which effectively allows all 2MiB SPTEs
in L1 to leak their children.
gpa = 167939000, old = 800000011c8f4d07, new = 8000000000000000, level = 2, flood = 0
gpa = 167939020, old = 8000000104407d07, new = 8000000000000000, level = 2, flood = 1
gpa = 167939028, old = 800000011ed20d07, new = 8000000000000000, level = 2, flood = 2
gpa = 118c70bb0, old = 8000000167ab9d07, new = 8000000000000000, level = 2, flood = 0
gpa = 118c70bb8, old = 8000000163913d07, new = 8000000000000000, level = 2, flood = 1
gpa = 118c70de8, old = 800000011cc9dd07, new = 8000000000000000, level = 2, flood = 2
gpa = 160be7fb0, old = 800000011d322d07, new = 8000000000000000, level = 2, flood = 1
gpa = 160be7fb8, old = 8000000126b1bd07, new = 8000000000000000, level = 2, flood = 2
gpa = 1634ab000, old = 800000010e984d07, new = 8000000000000000, level = 2, flood = 0
gpa = 1634ab008, old = 800000016879fd07, new = 8000000000000000, level = 2, flood = 1
gpa = 1634ab010, old = 800000016879ed07, new = 8000000000000000, level = 2, flood = 2
gpa = 11e3f1e48, old = 8000000168a33d07, new = 8000000000000000, level = 2, flood = 0
gpa = 11e3f1e50, old = 80000001664dcd07, new = 8000000000000000, level = 2, flood = 1
gpa = 1167eacb8, old = 8000000166544d07, new = 8000000000000000, level = 2, flood = 0
gpa = 1167eacc0, old = 800000015c16bd07, new = 8000000000000000, level = 2, flood = 1
gpa = 1689e89b8, old = 800000015f296d07, new = 8000000000000000, level = 2, flood = 0
gpa = 1689e89c0, old = 8000000167ca8d07, new = 8000000000000000, level = 2, flood = 1
gpa = 107b35eb8, old = 8000000161e71d07, new = 8000000000000000, level = 2, flood = 0
gpa = 107b35ec0, old = 8000000118cf3d07, new = 8000000000000000, level = 2, flood = 1
gpa = 118cf2d48, old = 8000000118cf1d07, new = 8000000000000000, level = 2, flood = 0
gpa = 118cf2d50, old = 8000000118cf0d07, new = 8000000000000000, level = 2, flood = 1
gpa = 118dcb770, old = 8000000118dcad07, new = 8000000000000000, level = 2, flood = 0
gpa = 118dcb778, old = 8000000118dc9d07, new = 8000000000000000, level = 2, flood = 1
gpa = 118dc87e8, old = 8000000126997d07, new = 8000000000000000, level = 2, flood = 0
gpa = 118dc87f0, old = 8000000126996d07, new = 8000000000000000, level = 2, flood = 1
gpa = 126995148, old = 8000000126994d07, new = 8000000000000000, level = 2, flood = 0
gpa = 126995150, old = 8000000103477d07, new = 8000000000000000, level = 2, flood = 1
gpa = 1034764c8, old = 8000000103475d07, new = 8000000000000000, level = 2, flood = 0
gpa = 1034764d0, old = 8000000103474d07, new = 8000000000000000, level = 2, flood = 1
gpa = 10ea4b788, old = 800000010ea4ad07, new = 8000000000000000, level = 2, flood = 0
gpa = 10ea4b790, old = 800000010ea49d07, new = 8000000000000000, level = 2, flood = 1
gpa = 10ea48928, old = 800000011a5bfd07, new = 8000000000000000, level = 2, flood = 0
gpa = 10ea48930, old = 800000011a5bed07, new = 8000000000000000, level = 2, flood = 1
gpa = 11a5bd0d8, old = 800000011a5bcd07, new = 8000000000000000, level = 2, flood = 0
gpa = 11a5bd0e0, old = 800000011d323d07, new = 8000000000000000, level = 2, flood = 1
gpa = 122ce2b40, old = 800000011fe0bd07, new = 8000000000000000, level = 2, flood = 0
gpa = 122ce2b48, old = 800000010e985d07, new = 8000000000000000, level = 2, flood = 1
gpa = 122ce2b50, old = 8000000161c9dd07, new = 8000000000000000, level = 2, flood = 2
gpa = 16864c000, old = 8000000167939d07, new = 8000000000000000, level = 3, flood = 0
gpa = 16864c008, old = 8000000118c70d07, new = 8000000000000000, level = 3, flood = 1
gpa = 16864c010, old = 80000001688a6d07, new = 8000000000000000, level = 3, flood = 2
gpa = 11c8f7000, old = 80000001608a7d07, new = 8000000000000000, level = 5, flood = 0
gpa = 1608a7000, old = 800000016864cd07, new = 80000000000001a0, level = 4, flood = 0
Note, in the shadow MMU, "level" describes the level a shadow page "points"
at, not the level of its associated SPTE. I.e. when write-flooding of 1GiB
PUD entries is detected, KVM recursively zaps shadow pages covering 256GiB
worth of memory. And as shown above, KVM's write-flooding detection
operates at all levels, so a single PMD (in L1) can effectively only leak
two unsync children (4KiB shadow pages) before it gets recursively zapped.
As a result, for the top-down zap, L0 KVM will leak at most 4 unsync shadow
pages per 256GiB of L2 memory.
The top-down zap also makes it more likely that L1 will self-heal (to some
extent), as any shadow pages that are "rediscovered" by future runs of L2
can get reclaimed by a recursive zap, whereas bottom-up zapping orphans
shadow pages over and over.
Note, in theory, there is some risk of over-zapping, e.g. due to zapping a
a large branch of the paging tree that L1 is only temporarily removing. In
practice, the usage patterns of hypervisors are highly unlikely to trigger
false positives. E.g. temporarily changing paging protections is typically
done at the leaf, not on a non-leaf entry. And if the L1 hypervisor is
updating large swaths of PTEs, e.g. to (temporarily?) remove chunks of
memory from L2, then L0 KVM's write-flooding detection will kick in, and
the children would be zapped anyways.
Fixes: 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent")
Cc: Yosry Ahmed <yosry@kernel.org>
Cc: Jim Mattson <jmattson@google.com>
Cc: James Houghton <jthoughton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260605174611.2222504-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
srp_process_rsp() copies sense data from rsp->data + resp_data_len,
where resp_data_len is the full 32-bit value supplied by the SRP target
and is never checked against the number of bytes actually received
(wc->byte_len). The copy length is bounded to SCSI_SENSE_BUFFERSIZE, so
at most 96 bytes are copied, but the source offset is not bounded.
A malicious or compromised SRP target on the InfiniBand/RoCE fabric that
the initiator has logged into can return an SRP_RSP with
SRP_RSP_FLAG_SNSVALID set and a large resp_data_len. The receive buffer
is allocated at the target-chosen max_ti_iu_len, so the source of the
sense copy lands past the bytes actually received; with resp_data_len
near 0xFFFFFFFF it is gigabytes past the buffer and the read faults.
Copy the sense data only if it has not been truncated, that is, only if
the response header, the response data, and the sense region fit within
the bytes actually received; otherwise drop the sense and log. The
in-tree iSER and NVMe-RDMA receive paths already bound their parse by
wc->byte_len; this brings ib_srp into line with them.
Fixes: aef9ec39c47f ("IB: Add SCSI RDMA Protocol (SRP) initiator")
Link: https://patch.msgid.link/r/20260602220457.2542840-1-michael.bommarito@gmail.com
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
In drivers/infiniband/ulp/isert/ib_isert.c, isert_login_recv_done()
computes the login request payload length as wc->byte_len minus
ISER_HEADERS_LEN with no lower bound, and login_req_len is a signed int.
A remote iSER initiator can post a login Send work request carrying
fewer than ISER_HEADERS_LEN (76) bytes, so the subtraction underflows
and login_req_len becomes negative.
isert_rx_login_req() then reads that negative length back into a signed
int, takes size = min(rx_buflen, MAX_KEY_VALUE_PAIRS), and because the
min() is signed it keeps the negative value; the value is then passed as
the memcpy() length and sign-extended to a multi-gigabyte size_t. The
copy into the 8192-byte login->req_buf runs far out of bounds and
faults, crashing the target node. The login phase precedes iSCSI
authentication, so no credentials are required to reach this path.
Reject any login PDU shorter than ISER_HEADERS_LEN before the
subtraction, mirroring the existing early return on a failed work
completion, so login_req_len can never go negative. The upper bound was
already safe: a posted login buffer cannot deliver more than
ISER_RX_PAYLOAD_SIZE, so the difference stays at or below
MAX_KEY_VALUE_PAIRS and the existing min() clamps it; only the missing
lower bound needs to be added.
Fixes: b8d26b3be8b3 ("iser-target: Add iSCSI Extensions for RDMA (iSER) target driver")
Link: https://patch.msgid.link/r/20260602194642.2273217-1-michael.bommarito@gmail.com
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
If IB_MR_REREG_ACCESS changes from RO to RW then the umem has to be
re-evaluated to ensure it is properly pinned as RW. Since the umem is
hidden inside each driver's mr struct add a ib_umem_check_rereg() function
that each driver has to call before processing IB_MR_REREG_ACCESS.
mlx4 has to retain its duplicate ib_access_writable check because it
implements IB_MR_REREG_ACCESS | IB_MR_REREG_TRANS by changing both items
in place sequentially while the MR is live, so it will continue to not
support this combination.
Cc: stable@vger.kernel.org
Fixes: b40656aa7d55 ("RDMA/umem: remove FOLL_FORCE usage")
Link: https://patch.msgid.link/r/0-v1-06fb1a2d6cf5+107-rereg_access_jgg@nvidia.com
Reported-by: Philip Tsukerman <philiptsukerman@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
KVM has reflected KVM_X86_SNP_VM to userspace since 1dfe571c12cf
("KVM: SEV: Add initial SEV-SNP support"), and KVM_X86_TDX_VM since
161d34609f9b ("KVM: TDX: Make TDX VM type supported"). Update the
documentation to reflect this fact.
Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
Fixes: 161d34609f9b ("KVM: TDX: Make TDX VM type supported")
Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260603114504.814647-2-clopez@suse.de
[sean: use one tab instead of two]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a regression test where KVM would inadvertently ignore PMU event
filters on writes that change _some_ bits in FIXED_CTR_CTRL, but not the
enable bits for PMCs that are denied to the guest.
Link: https://patch.msgid.link/20260603231905.1738487-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When (conditionally) reprogramming fixed counters, use the hardware value
of FIXED_CTR_CTRL to detect changes, not the guest's original value. For
guests with a mediated PMU, overwriting fixed_ctr_ctrl_hw at the start of
reprogramming without actually reacting to changes in fixed_ctr_ctrl_hw can
lead to KVM ignoring PMU event filters.
E.g. if the guest attempts to enable a fixed PMC that is disallowed, and
then toggles a different PMC in a subsequent WRMSR, KVM will update
pmu->fixed_ctr_ctrl_hw and reprogram the PMC that is changing, but not the
others that are now effectively enabled in pmu->fixed_ctr_ctrl_hw.
Note, the perf-based PMU is unaffected, as it doesn't use fixed_ctr_ctrl_hw
(which is also why keying off fixed_ctr_ctrl_hw works for both PMUs.
Note #2, fixed_ctr_ctrl_hw won't mess up pmc_in_use either, because the
latter isn't used by the mediated PMU. Its purpose is solely to release
perf events that are no longer being actively used, and the meadiated PMU
obviously doesn't create perf events.
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260528005419.0228F1F00A3A@smtp.kernel.org
Link: https://patch.msgid.link/20260603231905.1738487-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When checking if a VP ID is included in a sparse bank set, explicitly check
that the ID can actually be contained in a sparse bank (the TLFS allows for
a maximum of 64 banks of 64 vCPUs each). When handling a paravirtual TLB
flush for L2, the VP ID is copied verbatim from the enlightened VMCS,
without any bounds check, i.e. isn't guaranteed to be under the limit of
4096.
Failure to check the bounds of the VP ID leads to an out-of-bounds read
when testing the sparse bank, and super strictly speaking could lead to KVM
performing an unnecessary TLB flush for an L2 vCPU.
==================================================================
BUG: KASAN: use-after-free in hv_is_vp_in_sparse_set+0x85/0x100 [kvm]
Read of size 8 at addr ffff88811ba5f598 by task hyperv_evmcs/2802
CPU: 12 UID: 1000 PID: 2802 Comm: hyperv_evmcs Not tainted 7.1.0-rc2 #7 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x51/0x60
print_report+0xcb/0x5d0
kasan_report+0xb4/0xe0
kasan_check_range+0x35/0x1b0
hv_is_vp_in_sparse_set+0x85/0x100 [kvm]
kvm_hv_flush_tlb+0xe9e/0x16c0 [kvm]
kvm_hv_hypercall+0xe6b/0x1e60 [kvm]
vmx_handle_exit+0x485/0x1b60 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x22e3/0x5070 [kvm]
kvm_vcpu_ioctl+0x5d0/0x10c0 [kvm]
__x64_sys_ioctl+0x129/0x1a0
do_syscall_64+0xb9/0xcf0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f0e62d1a9bf
</TASK>
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x11ba5f
flags: 0x4000000000000000(zone=1)
raw: 4000000000000000 0000000000000000 00000000ffffffff 0000000000000000
raw: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88811ba5f480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88811ba5f500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88811ba5f580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88811ba5f600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88811ba5f680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
Opportunistically add a compile time assertion to ensure the maximum number
of sparse banks exactly matches the number of possible bits in the passed
in mask.
Cc: stable@vger.kernel.org
Fixes: c58a318f6090 ("KVM: x86: hyper-v: L2 TLB flush")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/aiQyZIJtO-2Aj_xN@v4bel
[sean: add KASAN splat, drop comment, add assert, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
kvm_gmem_get_policy() sets the interleave index (the output param that's
typically named "ilx") to the full page offset (vm_pgoff + vma offset).
But get_vma_policy() adds the page offset on top of the interleave index,
and so the offset is counted twice. This causes NUMA interleaving to skip
nodes: for order-0 pages the effective index jumps by 2 for each
consecutive page.
The vm_op.get_policy() implementation should return only a per-file bias in
the interleave index (like shmem_get_policy does with inode->i_ino),
letting get_vma_policy() add the page-offset component.
Fix by setting the output interleave index to the inode number (a la shmem)
instead of the full page offset, as the index is intended to be a constant,
semi-random value for a given file, e.g. so that interleaving doesn't start
at the same node for every file, and so that allocations are round-robined
across nodes based on the page offset (the selected node would bounce/skip
around if the index isn't constant).
Found by Sashiko (sashiko.dev) AI code review.
Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Fixes: 7f3779a3ac3e ("mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()")
Link: https://patch.msgid.link/0eff0a90667b900bee837d06b5db5025e1f304b5.1780501924.git.mst@redhat.com
[sean: use reverse fir-tree, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
- Fix random config build failure on s390.
* tag 'v7.1-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: s390 - add select CRYPTO_AEAD for aes
|