kernel/linux.git/arch/x86/include/asm, branch v6.1.168

x86/sev: Allow IBPB-on-Entry feature for SNP guests

2026-03-25T10:03:14+00:00

[ Upstream commit 9073428bb204d921ae15326bb7d4558d9d269aab ] The SEV-SNP IBPB-on-Entry feature does not require a guest-side implementation. It was added in Zen5 h/w, after the first SNP Zen implementation, and thus was not accounted for when the initial set of SNP features were added to the kernel. In its abundant precaution, commit 8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support") included SEV_STATUS' IBPB-on-Entry bit as a reserved bit, thereby masking guests from using the feature. Allow guests to make use of IBPB-on-Entry when supported by the hypervisor, as the bit is now architecturally defined and safe to expose. Fixes: 8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support") Signed-off-by: Kim Phillips Signed-off-by: Borislav Petkov (AMD) Reviewed-by: Nikunj A Dadhania Reviewed-by: Tom Lendacky Cc: stable@kernel.org Link: https://patch.msgid.link/20260203222405.4065706-2-kim.phillips@amd.com [ No SECURE_AVIC ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

x86/efi: defer freeing of boot services memory

2026-03-25T10:02:58+00:00

commit a4b0bf6a40f3c107c67a24fbc614510ef5719980 upstream. efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE and EFI_BOOT_SERVICES_DATA using memblock_free_late(). There are two issue with that: memblock_free_late() should be used for memory allocated with memblock_alloc() while the memory reserved with memblock_reserve() should be freed with free_reserved_area(). More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y efi_free_boot_services() is called before deferred initialization of the memory map is complete. Benjamin Herrenschmidt reports that this causes a leak of ~140MB of RAM on EC2 t3a.nano instances which only have 512MB or RAM. If the freed memory resides in the areas that memory map for them is still uninitialized, they won't be actually freed because memblock_free_late() calls memblock_free_pages() and the latter skips uninitialized pages. Using free_reserved_area() at this point is also problematic because __free_page() accesses the buddy of the freed page and that again might end up in uninitialized part of the memory map. Delaying the entire efi_free_boot_services() could be problematic because in addition to freeing boot services memory it updates efi.memmap without any synchronization and that's undesirable late in boot when there is concurrency. More robust approach is to only defer freeing of the EFI boot services memory. Split efi_free_boot_services() in two. First efi_unmap_boot_services() collects ranges that should be freed into an array then efi_free_boot_services() later frees them after deferred init is complete. Link: https://lore.kernel.org/all/ec2aaef14783869b3be6e3c253b2dcbf67dbc12a.camel@kernel.crashing.org Fixes: 916f676f8dc0 ("x86, efi: Retain boot service code until after switching to virtual mode") Cc: Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Ard Biesheuvel Signed-off-by: Greg Kroah-Hartman

x86/kfence: fix booting on 32bit non-PAE systems

2026-02-11T12:37:19+00:00

commit 16459fe7e0ca6520a6e8f603de4ccd52b90fd765 upstream. The original patch inverted the PTE unconditionally to avoid L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level paging. Adjust the logic to use the flip_protnone_guard() helper, which is a nop on 2-level paging but inverts the address bits in all other paging modes. This doesn't matter for the Xen aspect of the original change. Linux no longer supports running 32bit PV under Xen, and Xen doesn't support running any 32bit PV guests without using PAE paging. Link: https://lkml.kernel.org/r/20260126211046.2096622-1-andrew.cooper3@citrix.com Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs") Reported-by: Ryusuke Konishi Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/ Signed-off-by: Andrew Cooper Tested-by: Ryusuke Konishi Tested-by: Borislav Petkov (AMD) Cc: Alexander Potapenko Cc: Marco Elver Cc: Dmitry Vyukov Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Jann Horn Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

x86/kfence: avoid writing L1TF-vulnerable PTEs

2026-02-06T15:44:07+00:00

commit b505f1944535f83d369ae68813e7634d11b990d3 upstream. For native, the choice of PTE is fine. There's real memory backing the non-present PTE. However, for XenPV, Xen complains: (XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing To explain, some background on XenPV pagetables: Xen PV guests are control their own pagetables; they choose the new PTE value, and use hypercalls to make changes so Xen can audit for safety. In addition to a regular reference count, Xen also maintains a type reference count. e.g. SegDesc (referenced by vGDT/vLDT), Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a lower pagetable level). This is in order to prevent e.g. a page being inserted into the pagetables for which the guest has a writable mapping. For non-present mappings, all other bits become software accessible, and typically contain metadata rather a real frame address. There is nothing that a reference count could sensibly be tied to. As such, even if Xen could recognise the address as currently safe, nothing would prevent that frame from changing owner to another VM in the future. When Xen detects a PV guest writing a L1TF-PTE, it responds by activating shadow paging. This is normally only used for the live phase of migration, and comes with a reasonable overhead. KFENCE only cares about getting #PF to catch wild accesses; it doesn't care about the value for non-present mappings. Use a fully inverted PTE, to avoid hitting the slow path when running under Xen. While adjusting the logic, take the opportunity to skip all actions if the PTE is already in the right state, half the number PVOps callouts, and skip TLB maintenance on a !P -> P transition which benefits non-Xen cases too. Link: https://lkml.kernel.org/r/20260106180426.710013-1-andrew.cooper3@citrix.com Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86") Signed-off-by: Andrew Cooper Tested-by: Marco Elver Cc: Alexander Potapenko Cc: Marco Elver Cc: Dmitry Vyukov Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Jann Horn Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

KVM: SVM: Don't skip unrelated instruction if INT3/INTO is replaced

2026-01-11T14:19:15+00:00

[ Upstream commit 4da3768e1820cf15cced390242d8789aed34f54d ] When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn instruction, discard the exception and retry the instruction if the code stream is changed (e.g. by a different vCPU) between when the CPU executes the instruction and when KVM decodes the instruction to get the next RIP. As effectively predicted by commit 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction"), failure to verify that the correct INTn instruction was decoded can effectively clobber guest state due to decoding the wrong instruction and thus specifying the wrong next RIP. The bug most often manifests as "Oops: int3" panics on static branch checks in Linux guests. Enabling or disabling a static branch in Linux uses the kernel's "text poke" code patching mechanism. To modify code while other CPUs may be executing that code, Linux (temporarily) replaces the first byte of the original instruction with an int3 (opcode 0xcc), then patches in the new code stream except for the first byte, and finally replaces the int3 with the first byte of the new code stream. If a CPU hits the int3, i.e. executes the code while it's being modified, then the guest kernel must look up the RIP to determine how to handle the #BP, e.g. by emulating the new instruction. If the RIP is incorrect, then this lookup fails and the guest kernel panics. The bug reproduces almost instantly by hacking the guest kernel to repeatedly check a static branch[1] while running a drgn script[2] on the host to constantly swap out the memory containing the guest's TSS. [1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a [2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Omar Sandoval Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com Signed-off-by: Sean Christopherson Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

KVM: x86/mmu: Use EMULTYPE flag to track write #PFs to shadow pages

2026-01-11T14:19:15+00:00

[ Upstream commit 258d985f6eb360c9c7aacd025d0dbc080a59423f ] Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults on self-changing writes to shadowed page tables instead of propagating that information to the emulator via a semi-persistent vCPU flag. Using a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented, as it's not at all obvious that clearing the flag only when emulation actually occurs is correct. E.g. if KVM sets the flag and then retries the fault without ever getting to the emulator, the flag will be left set for future calls into the emulator. But because the flag is consumed if and only if both EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated MMIO, or while L2 is active, KVM avoids false positives on a stale flag since FNAME(page_fault) is guaranteed to be run and refresh the flag before it's ultimately consumed by the tail end of reexecute_instruction(). Signed-off-by: Sean Christopherson Message-Id: <20230202182817.407394-2-seanjc@google.com> Signed-off-by: Paolo Bonzini Stable-dep-of: 4da3768e1820 ("KVM: SVM: Don't skip unrelated instruction if INT3/INTO is replaced") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

x86/ptrace: Always inline trivial accessors

2026-01-11T14:18:39+00:00

[ Upstream commit 1fe4002cf7f23d70c79bda429ca2a9423ebcfdfa ] A KASAN build bloats these single load/store helpers such that it fails to inline them: vmlinux.o: error: objtool: irqentry_exit+0x5e8: call to instruction_pointer_set() with UACCESS enabled Make sure the compiler isn't allowed to do stupid. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://patch.msgid.link/20251031105435.GU4068168@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin

x86/vdso: Fix output operand size of RDPID

2025-10-15T09:56:25+00:00

[ Upstream commit ac9c408ed19d535289ca59200dd6a44a6a2d6036 ] RDPID instruction outputs to a word-sized register (64-bit on x86_64 and 32-bit on x86_32). Use an unsigned long variable to store the correct size. LSL outputs to 32-bit register, use %k operand prefix to always print the 32-bit name of the register. Use RDPID insn mnemonic while at it as the minimum binutils version of 2.30 supports it. [ bp: Merge two patches touching the same function into a single one. ] Fixes: ffebbaedc861 ("x86/vdso: Introduce helper functions for CPU and node number") Signed-off-by: Uros Bizjak Signed-off-by: Borislav Petkov (AMD) Link: https://lore.kernel.org/20250616095315.230620-1-ubizjak@gmail.com Signed-off-by: Sasha Levin

x86/vmscape: Add conditional IBPB mitigation

2025-09-11T15:19:15+00:00

Commit 2f8f173413f1cbf52660d04df92d0069c4306d25 upstream. VMSCAPE is a vulnerability that exploits insufficient branch predictor isolation between a guest and a userspace hypervisor (like QEMU). Existing mitigations already protect kernel/KVM from a malicious guest. Userspace can additionally be protected by flushing the branch predictors after a VMexit. Since it is the userspace that consumes the poisoned branch predictors, conditionally issue an IBPB after a VMexit and before returning to userspace. Workloads that frequently switch between hypervisor and userspace will incur the most overhead from the new IBPB. This new IBPB is not integrated with the existing IBPB sites. For instance, a task can use the existing speculation control prctl() to get an IBPB at context switch time. With this implementation, the IBPB is doubled up: one at context switch and another before running userspace. The intent is to integrate and optimize these cases post-embargo. [ dhansen: elaborate on suboptimal IBPB solution ] Suggested-by: Dave Hansen Signed-off-by: Pawan Gupta Signed-off-by: Dave Hansen Reviewed-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Acked-by: Sean Christopherson Signed-off-by: Borislav Petkov (AMD) Signed-off-by: Greg Kroah-Hartman

x86/vmscape: Enumerate VMSCAPE bug

2025-09-11T15:19:15+00:00

Commit a508cec6e5215a3fbc7e73ae86a5c5602187934d upstream. The VMSCAPE vulnerability may allow a guest to cause Branch Target Injection (BTI) in userspace hypervisors. Kernels (both host and guest) have existing defenses against direct BTI attacks from guests. There are also inter-process BTI mitigations which prevent processes from attacking each other. However, the threat in this case is to a userspace hypervisor within the same process as the attacker. Userspace hypervisors have access to their own sensitive data like disk encryption keys and also typically have access to all guest data. This means guest userspace may use the hypervisor as a confused deputy to attack sensitive guest kernel data. There are no existing mitigations for these attacks. Introduce X86_BUG_VMSCAPE for this vulnerability and set it on affected Intel and AMD CPUs. Signed-off-by: Pawan Gupta Signed-off-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Signed-off-by: Borislav Petkov (AMD) Signed-off-by: Greg Kroah-Hartman