kernel/linux.git/arch/x86/entry/syscall_64.c, branch linux-7.1.y

randomize_kstack: Unify random source across arches

2026-03-25T04:12:03+00:00

Previously different architectures were using random sources of differing strength and cost to decide the random kstack offset. A number of architectures (loongarch, powerpc, s390, x86) were using their timestamp counter, at whatever the frequency happened to be. Other arches (arm64, riscv) were using entropy from the crng via get_random_u16(). There have been concerns that in some cases the timestamp counters may be too weak, because they can be easily guessed or influenced by user space. And get_random_u16() has been shown to be too costly for the level of protection kstack offset randomization provides. So let's use a common, architecture-agnostic source of entropy; a per-cpu prng, seeded at boot-time from the crng. This has a few benefits: - We can remove choose_random_kstack_offset(); That was only there to try to make the timestamp counter value a bit harder to influence from user space [*]. - The architecture code is simplified. All it has to do now is call add_random_kstack_offset() in the syscall path. - The strength of the randomness can be reasoned about independently of the architecture. - Arches previously using get_random_u16() now have much faster syscall paths, see below results. [*] Additionally, this gets rid of some redundant work on s390 and x86. Before this patch, those architectures called choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(), which is also called for exception returns to userspace which were *not* syscalls (e.g. regular interrupts). Getting rid of choose_random_kstack_offset() avoids a small amount of redundant work for the non-syscall cases. In some configurations, add_random_kstack_offset() will now call instrumentable code, so for a couple of arches, I have moved the call a bit later to the first point where instrumentation is allowed. This doesn't impact the efficacy of the mechanism. There have been some claims that a prng may be less strong than the timestamp counter if not regularly reseeded. But the prng has a period of about 2^113. So as long as the prng state remains secret, it should not be possible to guess. If the prng state can be accessed, we have bigger problems. Additionally, we are only consuming 6 bits to randomize the stack, so there are only 64 possible random offsets. I assert that it would be trivial for an attacker to brute force by repeating their attack and waiting for the random stack offset to be the desired one. The prng approach seems entirely proportional to this level of protection. Performance data are provided below. The baseline is v6.18 with rndstack on for each respective arch. (I)/(R) indicate statistically significant improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal). x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge): +-----------------+--------------+---------------+---------------+ | Benchmark | Result Class | per-cpu-prng | per-cpu-prng | | | | arm64 (metal) | x86_64 (VM) | +=================+==============+===============+===============+ | syscall/getpid | mean (ns) | (I) -9.50% | (I) -17.65% | | | p99 (ns) | (I) -59.24% | (I) -24.41% | | | p99.9 (ns) | (I) -59.52% | (I) -28.52% | +-----------------+--------------+---------------+---------------+ | syscall/getppid | mean (ns) | (I) -9.52% | (I) -19.24% | | | p99 (ns) | (I) -59.25% | (I) -25.03% | | | p99.9 (ns) | (I) -59.50% | (I) -28.17% | +-----------------+--------------+---------------+---------------+ | syscall/invalid | mean (ns) | (I) -10.31% | (I) -18.56% | | | p99 (ns) | (I) -60.79% | (I) -20.06% | | | p99.9 (ns) | (I) -61.04% | (I) -25.04% | +-----------------+--------------+---------------+---------------+ I tested an earlier version of this change on x86 bare metal and it showed a smaller but still significant improvement. The bare metal system wasn't available this time around so testing was done in a VM instance. I'm guessing the cost of rdtsc is higher for VMs. Acked-by: Mark Rutland Signed-off-by: Ryan Roberts Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com Signed-off-by: Kees Cook

x86/syscall: Remove stray semicolons

2025-03-19T10:19:18+00:00

No functional change. Suggested-by: Sohil Mehta Signed-off-by: Brian Gerst Signed-off-by: Ingo Molnar Reviewed-by: Sohil Mehta Cc: Andy Lutomirski Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Josh Poimboeuf Link: https://lore.kernel.org/r/20250314151220.862768-7-brgerst@gmail.com

x86/syscall/x32: Move x32 syscall table

2025-03-19T10:19:15+00:00

Since commit: 2e958a8a510d ("x86/entry/x32: Rename __x32_compat_sys_* to __x64_compat_sys_*") the ABI prefix for x32 syscalls is the same as native 64-bit syscalls. Move the x32 syscall table to syscall_64.c No functional changes. Signed-off-by: Brian Gerst Signed-off-by: Ingo Molnar Reviewed-by: Sohil Mehta Cc: Andy Lutomirski Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Josh Poimboeuf Link: https://lore.kernel.org/r/20250314151220.862768-5-brgerst@gmail.com

x86/syscall/64: Move 64-bit syscall dispatch code

2025-03-19T10:19:04+00:00

Move the 64-bit syscall dispatch code to syscall_64.c. No functional changes. Signed-off-by: Brian Gerst Signed-off-by: Ingo Molnar Reviewed-by: Sohil Mehta Cc: Andy Lutomirski Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Josh Poimboeuf Link: https://lore.kernel.org/r/20250314151220.862768-4-brgerst@gmail.com

x86/syscall: Mark exit[_group] syscall handlers __noreturn

2024-06-28T13:23:38+00:00

The direct-call syscall dispatch function doesn't know that the exit() and exit_group() syscall handlers don't return, so the call sites aren't optimized accordingly. Fix that by marking the exit syscall declarations __noreturn. Fixes the following warnings: vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation Fixes: 1e3ad78334a6 ("x86/syscall: Don't force use of indirect calls for system calls") Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop Reported-by: Paul E. McKenney Signed-off-by: Josh Poimboeuf Signed-off-by: Borislav Petkov (AMD) Tested-by: Paul E. McKenney Link: https://lore.kernel.org/r/5d8882bc077d8eadcc7fd1740b56dfb781f12288.1719381528.git.jpoimboe@kernel.org

x86/syscall: Don't force use of indirect calls for system calls

2024-04-08T17:27:05+00:00

Make build a switch statement instead, and the compiler can either decide to generate an indirect jump, or - more likely these days due to mitigations - just a series of conditional branches. Yes, the conditional branches also have branch prediction, but the branch prediction is much more controlled, in that it just causes speculatively running the wrong system call (harmless), rather than speculatively running possibly wrong random less controlled code gadgets. This doesn't mitigate other indirect calls, but the system call indirection is the first and most easily triggered case. Signed-off-by: Linus Torvalds Signed-off-by: Daniel Sneddon Signed-off-by: Thomas Gleixner Reviewed-by: Josh Poimboeuf

x86/syscalls: Stop filling syscall arrays with *_sys_ni_syscall

2021-05-20T13:03:59+00:00

This is a follow-up cleanup after switching to the generic syscalltbl.sh. The old x86 specific script skipped non-existing syscalls. So, the generated syscalls_64.h, for example, had a big hole in the syscall numbers 335-423 range. That is why there exists [0 ... __NR_*_syscall_max] = &__*_sys_ni_cyscall. The new script, scripts/syscalltbl.sh automatically fills holes with __SYSCALL(, sys_ni_syscall), hence such ugly code can go away. The designated initializers, '[nr] =' are also unneeded. Also, there is no need to give __NR_*_syscall_max+1 because the array size is implied by the number of syscalls in the generated headers. Hence, there is no need to include , either. Signed-off-by: Masahiro Yamada Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20210517073815.97426-4-masahiroy@kernel.org

x86/syscalls: Switch to generic syscalltbl.sh

2021-05-20T13:03:58+00:00

Many architectures duplicate similar shell scripts. Convert x86 and UML to use scripts/syscalltbl.sh. The generic script generates seperate headers for x86/64 and x86/x32 syscalls, while the x86 specific script coalesced them into one. Adjust the code accordingly. Signed-off-by: Masahiro Yamada Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20210517073815.97426-3-masahiroy@kernel.org

x86/entry: Drop asmlinkage from syscalls

2020-03-21T15:03:25+00:00

asmlinkage is no longer required since the syscall ABI is now fully under x86 architecture control. This makes the 32-bit native syscalls a bit more effecient by passing in regs via EAX instead of on the stack. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Reviewed-by: Dominik Brodowski Reviewed-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20200313195144.164260-18-brgerst@gmail.com

x86/entry: Remove ABI prefixes from functions in syscall tables

2020-03-21T15:03:23+00:00

Move the ABI prefixes to the __SYSCALL_[abi]() macros. This allows removal of the need to strip the prefix for UML. Signed-off-by: Brian Gerst Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200313195144.164260-13-brgerst@gmail.com