kernel/linux.git/arch/x86/include/asm/entry-common.h, branch v7.1

x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core

2026-05-19T18:25:51+00:00

Move the VMX interrupt dispatch magic into the x86 core code. This isolates KVM from the FRED/IDT decisions and reduces the amount of EXPORT_SYMBOL_FOR_KVM(). Suggested-by: Sean Christopherson Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Tested-by: "Verma, Vishal L" Tested-by: Zhao Liu Tested-by: Zhao Liu Tested-by: Sean Christopherson Reviewed-by: Binbin Wu Acked-by: Sean Christopherson Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net

randomize_kstack: Unify random source across arches

2026-03-25T04:12:03+00:00

Previously different architectures were using random sources of differing strength and cost to decide the random kstack offset. A number of architectures (loongarch, powerpc, s390, x86) were using their timestamp counter, at whatever the frequency happened to be. Other arches (arm64, riscv) were using entropy from the crng via get_random_u16(). There have been concerns that in some cases the timestamp counters may be too weak, because they can be easily guessed or influenced by user space. And get_random_u16() has been shown to be too costly for the level of protection kstack offset randomization provides. So let's use a common, architecture-agnostic source of entropy; a per-cpu prng, seeded at boot-time from the crng. This has a few benefits: - We can remove choose_random_kstack_offset(); That was only there to try to make the timestamp counter value a bit harder to influence from user space [*]. - The architecture code is simplified. All it has to do now is call add_random_kstack_offset() in the syscall path. - The strength of the randomness can be reasoned about independently of the architecture. - Arches previously using get_random_u16() now have much faster syscall paths, see below results. [*] Additionally, this gets rid of some redundant work on s390 and x86. Before this patch, those architectures called choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(), which is also called for exception returns to userspace which were *not* syscalls (e.g. regular interrupts). Getting rid of choose_random_kstack_offset() avoids a small amount of redundant work for the non-syscall cases. In some configurations, add_random_kstack_offset() will now call instrumentable code, so for a couple of arches, I have moved the call a bit later to the first point where instrumentation is allowed. This doesn't impact the efficacy of the mechanism. There have been some claims that a prng may be less strong than the timestamp counter if not regularly reseeded. But the prng has a period of about 2^113. So as long as the prng state remains secret, it should not be possible to guess. If the prng state can be accessed, we have bigger problems. Additionally, we are only consuming 6 bits to randomize the stack, so there are only 64 possible random offsets. I assert that it would be trivial for an attacker to brute force by repeating their attack and waiting for the random stack offset to be the desired one. The prng approach seems entirely proportional to this level of protection. Performance data are provided below. The baseline is v6.18 with rndstack on for each respective arch. (I)/(R) indicate statistically significant improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal). x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge): +-----------------+--------------+---------------+---------------+ | Benchmark | Result Class | per-cpu-prng | per-cpu-prng | | | | arm64 (metal) | x86_64 (VM) | +=================+==============+===============+===============+ | syscall/getpid | mean (ns) | (I) -9.50% | (I) -17.65% | | | p99 (ns) | (I) -59.24% | (I) -24.41% | | | p99.9 (ns) | (I) -59.52% | (I) -28.52% | +-----------------+--------------+---------------+---------------+ | syscall/getppid | mean (ns) | (I) -9.52% | (I) -19.24% | | | p99 (ns) | (I) -59.25% | (I) -25.03% | | | p99.9 (ns) | (I) -59.50% | (I) -28.17% | +-----------------+--------------+---------------+---------------+ | syscall/invalid | mean (ns) | (I) -10.31% | (I) -18.56% | | | p99 (ns) | (I) -60.79% | (I) -20.06% | | | p99.9 (ns) | (I) -61.04% | (I) -25.04% | +-----------------+--------------+---------------+---------------+ I tested an earlier version of this change on x86 bare metal and it showed a smaller but still significant improvement. The bare metal system wasn't available this time around so testing was done in a VM instance. I'm guessing the cost of rdtsc is higher for VMs. Acked-by: Mark Rutland Signed-off-by: Ryan Roberts Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com Signed-off-by: Kees Cook

x86/vmscape: Add conditional IBPB mitigation

2025-08-14T17:37:18+00:00

VMSCAPE is a vulnerability that exploits insufficient branch predictor isolation between a guest and a userspace hypervisor (like QEMU). Existing mitigations already protect kernel/KVM from a malicious guest. Userspace can additionally be protected by flushing the branch predictors after a VMexit. Since it is the userspace that consumes the poisoned branch predictors, conditionally issue an IBPB after a VMexit and before returning to userspace. Workloads that frequently switch between hypervisor and userspace will incur the most overhead from the new IBPB. This new IBPB is not integrated with the existing IBPB sites. For instance, a task can use the existing speculation control prctl() to get an IBPB at context switch time. With this implementation, the IBPB is doubled up: one at context switch and another before running userspace. The intent is to integrate and optimize these cases post-embargo. [ dhansen: elaborate on suboptimal IBPB solution ] Suggested-by: Dave Hansen Signed-off-by: Pawan Gupta Signed-off-by: Dave Hansen Reviewed-by: Dave Hansen Reviewed-by: Borislav Petkov (AMD) Acked-by: Sean Christopherson

x86/fpu: Shift fpregs_assert_state_consistent() from arch_exit_work() to its caller

2025-05-04T08:29:25+00:00

If CONFIG_X86_DEBUG_FPU=Y, arch_exit_to_user_mode_prepare() calls arch_exit_work() even if ti_work == 0. There only reason is that we want to call fpregs_assert_state_consistent() if TIF_NEED_FPU_LOAD is not set. This looks confusing. arch_exit_to_user_mode_prepare() can just call fpregs_assert_state_consistent() unconditionally, it depends on CONFIG_X86_DEBUG_FPU and checks TIF_NEED_FPU_LOAD itself. Signed-off-by: Oleg Nesterov Signed-off-by: Ingo Molnar Cc: Chang S . Bae Cc: H. Peter Anvin Cc: Andy Lutomirski Cc: Brian Gerst Cc: Linus Torvalds Cc: Peter Zijlstra Link: https://lore.kernel.org/r/20250503143902.GA9012@redhat.com

x86/entry: Set FRED RSP0 on return to userspace instead of context switch

2024-08-25T17:23:00+00:00

The FRED RSP0 MSR points to the top of the kernel stack for user level event delivery. As this is the task stack it needs to be updated when a task is scheduled in. The update is done at context switch. That means it's also done when switching to kernel threads, which is pointless as those never go out to user space. For KVM threads this means there are two writes to FRED_RSP0 as KVM has to switch to the guest value before VMENTER. Defer the update to the exit to user space path and cache the per CPU FRED_RSP0 value, so redundant writes can be avoided. Provide fred_sync_rsp0() for KVM to keep the cache in sync with the actual MSR value after returning from guest to host mode. [ tglx: Massage change log ] Suggested-by: Sean Christopherson Suggested-by: Thomas Gleixner Signed-off-by: Xin Li (Intel) Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/all/20240822073906.2176342-4-xin@zytor.com

x86/entry: Test ti_work for zero before processing individual bits

2024-08-25T17:23:00+00:00

In most cases, ti_work values passed to arch_exit_to_user_mode_prepare() are zeros, e.g., 99% in kernel build tests. So an obvious optimization is to test ti_work for zero before processing individual bits in it. Omit the optimization when FPU debugging is enabled, otherwise the FPU consistency check is never executed. Intel 0day tests did not find a perfermance regression with this change. Suggested-by: H. Peter Anvin (Intel) Signed-off-by: Xin Li (Intel) Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/all/20240822073906.2176342-2-xin@zytor.com

randomize_kstack: Remove non-functional per-arch entropy filtering

2024-06-28T15:54:56+00:00

An unintended consequence of commit 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") was that the per-architecture entropy size filtering reduced how many bits were being added to the mix, rather than how many bits were being used during the offsetting. All architectures fell back to the existing default of 0x3FF (10 bits), which will consume at most 1KiB of stack space. It seems that this is working just fine, so let's avoid the confusion and update everything to use the default. The prior intent of the per-architecture limits were: arm64: capped at 0x1FF (9 bits), 5 bits effective powerpc: uncapped (10 bits), 6 or 7 bits effective riscv: uncapped (10 bits), 6 bits effective x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective s390: capped at 0xFF (8 bits), undocumented effective entropy Current discussion has led to just dropping the original per-architecture filters. The additional entropy appears to be safe for arm64, x86, and s390. Quoting Arnd, "There is no point pretending that 15.75KB is somehow safe to use while 15.00KB is not." Co-developed-by: Yuntao Liu Signed-off-by: Yuntao Liu Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com Reviewed-by: Arnd Bergmann Acked-by: Mark Rutland Acked-by: Heiko Carstens # s390 Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org Signed-off-by: Kees Cook

x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key

2024-02-20T00:31:49+00:00

The VERW mitigation at exit-to-user is enabled via a static branch mds_user_clear. This static branch is never toggled after boot, and can be safely replaced with an ALTERNATIVE() which is convenient to use in asm. Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user path. Also remove the now redundant VERW in exc_nmi() and arch_exit_to_user_mode(). Signed-off-by: Pawan Gupta Signed-off-by: Dave Hansen Link: https://lore.kernel.org/all/20240213-delay-verw-v8-4-a6216d83edb7%40linux.intel.com

x86/CPU/AMD: Fix the DIV(0) initial fix attempt

2023-08-14T09:02:50+00:00

Initially, it was thought that doing an innocuous division in the #DE handler would take care to prevent any leaking of old data from the divider but by the time the fault is raised, the speculation has already advanced too far and such data could already have been used by younger operations. Therefore, do the innocuous division on every exit to userspace so that userspace doesn't see any potentially old data from integer divisions in kernel space. Do the same before VMRUN too, to protect host data from leaking into the guest too. Fixes: 77245f1c3c64 ("x86/CPU/AMD: Do not leak quotient data after a division by 0") Signed-off-by: Borislav Petkov (AMD) Cc: Link: https://lore.kernel.org/r/20230811213824.10025-1-bp@alien8.de

x86/cpu: Remove unneeded 64-bit dependency in arch_enter_from_user_mode()

2022-11-22T15:11:57+00:00

The check for 64-bit mode when testing X86_FEATURE_XENPV isn't needed, as Xen PV guests are no longer supported in 32-bit mode, see a13f2ef168cb ("x86/xen: remove 32-bit Xen PV guest support"). While at it switch from boot_cpu_has() to cpu_feature_enabled(). Signed-off-by: Juergen Gross Signed-off-by: Borislav Petkov Acked-by: Dave Hansen Link: https://lore.kernel.org/r/20221104072701.20283-3-jgross@suse.com