summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-05-21tools/nolibc: fall back to sys_clock_gettime() in gettimeofday()Thomas Weißschuh1-1/+14
Newer architectures (like riscv32) do not implement sys_gettimeofday(). In those cases fall back to sys_clock_gettime(). While that does not support the timezone argument of sys_gettimeofday(), specifying this argument invokes undefined behaviour, so it's safe to ignore. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-14-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add fopen()Thomas Weißschuh2-0/+51
This is used in various selftests and will be handy when integrating those with nolibc. Only the standard POSIX modes are supported. No extensions nor the (noop) "b" from ISO C are accepted. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-13-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add namespace functionalityThomas Weißschuh5-0/+121
This is used in various selftests and will be handy when integrating those with nolibc. Not all configurations support namespaces, so skip the tests where necessary. Also if the tests are running without privileges. Enable the namespace configuration for those architectures where it is not enabled by default. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-12-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add difftime()Thomas Weißschuh2-0/+19
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-11-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add timerfd functionalityThomas Weißschuh4-0/+137
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-10-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add timer functionsThomas Weißschuh3-0/+138
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-9-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add clock_getres(), clock_gettime() and clock_settime()Thomas Weißschuh3-0/+99
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-8-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add support for access() and faccessat()Thomas Weißschuh2-0/+30
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-7-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add abs() and friendsThomas Weißschuh5-0/+53
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-6-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add getrandom()Thomas Weißschuh4-0/+58
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-5-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add mremap()Thomas Weißschuh2-3/+30
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-4-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add more stat() variantsThomas Weißschuh1-2/+23
Add fstat(), fstatat() and lstat(). All of them use the existing implementation based on statx(). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-3-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add %m printf formatThomas Weißschuh2-0/+25
The %m format can be used to format the current errno. It is non-standard but supported by other commonly used libcs like glibc and musl, so applications do rely on them. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-2-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: add strstr()Thomas Weißschuh2-0/+23
This is used in various selftests and will be handy when integrating those with nolibc. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-1-3c043eeab06c@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: use poll-related definitions from UAPI headersThomas Weißschuh2-15/+1
The UAPI headers already provide definitions for these symbols. Using them makes the code shorter, more robust and compatible with applications using linux/poll.h directly. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250430-poll-v1-2-44b5ceabdeee@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: move poll() to poll.hThomas Weißschuh4-37/+57
This is the location regular userspace expects the definition. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250430-poll-v1-1-44b5ceabdeee@linutronix.de Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21tools/nolibc: Add m68k supportDaniel Palmer4-0/+153
Add nolibc support for m68k. Should be helpful for nommu where linking libc can bloat even hello world to the point where you get an OOM just trying to load it. Signed-off-by: Daniel Palmer <daniel@thingy.jp> Link: https://lore.kernel.org/r/20250426224738.284874-1-daniel@0x0f.com Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-05-21selftests/nolibc: always run nolibc header checkThomas Weißschuh1-1/+1
Prevent regressions of issues validates by the header check by always running it together with the nolibc selftests. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250424-nolibc-header-check-v1-3-011576b6ed6f@linutronix.de
2025-05-21tools/nolibc: include nolibc.h early from all header filesThomas Weißschuh20-63/+60
Inclusion of any nolibc header file should also bring all other headers. On the other hand it should also be possible to include any nolibc header files in any order. Currently this is implemented by including the catch-all nolibc.h after the headers own definitions. This is problematic if one nolibc header depends on another one. The first header has to include the other one before defining any symbols. That in turn will include the rest of nolibc while the current header has not defined anything yet. If any other part of nolibc depends on definitions from the current header, errors are encountered. This is already the case today. Effectively nolibc can only be included in the order of nolibc.h. Restructure the way "nolibc.h" is included. Move it to the beginning of the header files and before the include guards. Now any header will behave exactly like "nolibc.h" while the include guards prevent any duplicate definitions. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250424-nolibc-header-check-v1-2-011576b6ed6f@linutronix.de
2025-05-21tools/nolibc: add target to check header usabilityThomas Weißschuh1-0/+9
Each nolibc header should be valid for inclusion irrespective of any special ordering requirements. Add a new make target, based on the old kbuild "make header_check" target to validate this requirement. For now the check fails, but the following commits will fix the issues. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250424-nolibc-header-check-v1-1-011576b6ed6f@linutronix.de
2025-05-21io_uring: fix overflow resched cqe reorderingPavel Begunkov1-0/+1
Leaving the CQ critical section in the middle of a overflow flushing can cause cqe reordering since the cache cq pointers are reset and any new cqe emitters that might get called in between are not going to be forced into io_cqe_cache_refill(). Fixes: eac2ca2d682f9 ("io_uring: check if we need to reschedule during overflow flush") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/90ba817f1a458f091f355f407de1c911d2b93bbf.1747483784.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-21nvme: avoid creating multipath sysfs group under namespace path devicesNilay Shroff1-0/+28
Commit 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy") introduced the creation of the multipath sysfs group under the NVMe head gendisk device node. However, it also inadvertently added the same sysfs group under each namespace path device which head node refers to and that is incorrect. The multipath sysfs group should only be exposed through the namespace head gendisk node. This is sufficient, as the head device already provides symbolic links to the individual namespace paths it manages. This patch fixes the issue by preventing the creation of the multipath sysfs group under namespace path devices, ensuring it only appears under the head disk node. Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-21Merge patch series "coredump: add coredump socket"Christian Brauner8-106/+916
Christian Brauner <brauner@kernel.org> says: Coredumping currently supports two modes: (1) Dumping directly into a file somewhere on the filesystem. (2) Dumping into a pipe connected to a usermode helper process spawned as a child of the system_unbound_wq or kthreadd. For simplicity I'm mostly ignoring (1). There's probably still some users of (1) out there but processing coredumps in this way can be considered adventurous especially in the face of set*id binaries. The most common option should be (2) by now. It works by allowing userspace to put a string into /proc/sys/kernel/core_pattern like: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h The "|" at the beginning indicates to the kernel that a pipe must be used. The path following the pipe indicator is a path to a binary that will be spawned as a usermode helper process. Any additional parameters pass information about the task that is generating the coredump to the binary that processes the coredump. In the example core_pattern shown above systemd-coredump is spawned as a usermode helper. There's various conceptual consequences of this (non-exhaustive list): - systemd-coredump is spawned with file descriptor number 0 (stdin) connected to the read-end of the pipe. All other file descriptors are closed. That specifically includes 1 (stdout) and 2 (stderr). This has already caused bugs because userspace assumed that this cannot happen (Whether or not this is a sane assumption is irrelevant.). - systemd-coredump will be spawned as a child of system_unbound_wq. So it is not a child of any userspace process and specifically not a child of PID 1. It cannot be waited upon and is in a weird hybrid upcall which are difficult for userspace to control correctly. - systemd-coredump is spawned with full kernel privileges. This necessitates all kinds of weird privilege dropping excercises in userspace to make this safe. - A new usermode helper has to be spawned for each crashing process. This series adds a new mode: (3) Dumping into an AF_UNIX socket. Userspace can set /proc/sys/kernel/core_pattern to: @/path/to/coredump.socket The "@" at the beginning indicates to the kernel that an AF_UNIX coredump socket will be used to process coredumps. The coredump socket must be located in the initial mount namespace. When a task coredumps it opens a client socket in the initial network namespace and connects to the coredump socket. - The coredump server should use SO_PEERPIDFD to get a stable handle on the connected crashing task. The retrieved pidfd will provide a stable reference even if the crashing task gets SIGKILLed while generating the coredump. - By setting core_pipe_limit non-zero userspace can guarantee that the crashing task cannot be reaped behind it's back and thus process all necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to detect whether /proc/<pid> still refers to the same process. The core_pipe_limit isn't used to rate-limit connections to the socket. This can simply be done via AF_UNIX socket directly. - The pidfd for the crashing task will contain information how the task coredumps. The PIDFD_GET_INFO ioctl gained a new flag PIDFD_INFO_COREDUMP which can be used to retreive the coredump information. If the coredump gets a new coredump client connection the kernel guarantees that PIDFD_INFO_COREDUMP information is available. Currently the following information is provided in the new @coredump_mask extension to struct pidfd_info: * PIDFD_COREDUMPED is raised if the task did actually coredump. * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g., undumpable). * PIDFD_COREDUMP_USER is raised if this is a regular coredump and doesn't need special care by the coredump server. * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be treated as sensitive and the coredump server should restrict access to the generated coredump to sufficiently privileged users. - The coredump server should mark itself as non-dumpable. - A container coredump server in a separate network namespace can simply bind to another well-know address and systemd-coredump fowards coredumps to the container. - Coredumps could in the future also be handled via per-user/session coredump servers that run only with that users privileges. The coredump server listens on the coredump socket and accepts a new coredump connection. It then retrieves SO_PEERPIDFD for the client, inspects uid/gid and hands the accepted client to the users own coredump handler which runs with the users privileges only (It must of coure pay close attention to not forward crashing suid binaries.). The new coredump socket will allow userspace to not have to rely on usermode helpers for processing coredumps and provides a safer way to handle them instead of relying on super privileged coredumping helpers. This will also be significantly more lightweight since no fork()+exec() for the usermodehelper is required for each crashing process. The coredump server in userspace can just keep a worker pool. * patches from https://lore.kernel.org/20250516-work-coredump-socket-v8-0-664f3caf2516@kernel.org: selftests/coredump: add tests for AF_UNIX coredumps selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure coredump: validate socket name as it is written coredump: show supported coredump modes pidfs, coredump: add PIDFD_INFO_COREDUMP coredump: add coredump socket coredump: reflow dump helpers a little coredump: massage do_coredump() coredump: massage format_corename() Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-0-664f3caf2516@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21selftests/coredump: add tests for AF_UNIX coredumpsChristian Brauner1-1/+466
Add a simple test for generating coredumps via AF_UNIX sockets. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-9-664f3caf2516@kernel.org Acked-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructureChristian Brauner1-0/+22
Add PIDFD_INFO_COREDUMP infrastructure so we can use it in tests. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-8-664f3caf2516@kernel.org Acked-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21coredump: validate socket name as it is writtenChristian Brauner1-3/+34
In contrast to other parameters written into /proc/sys/kernel/core_pattern that never fail we can validate enabling the new AF_UNIX support. This is obviously racy as hell but it's always been that way. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-7-664f3caf2516@kernel.org Acked-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21coredump: show supported coredump modesChristian Brauner1-0/+13
Allow userspace to discover what coredump modes are supported. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-6-664f3caf2516@kernel.org Acked-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21pidfs, coredump: add PIDFD_INFO_COREDUMPChristian Brauner4-0/+107
Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP mask flag. This adds the @coredump_mask field to struct pidfd_info. When a task coredumps the kernel will provide the following information to userspace in @coredump_mask: * PIDFD_COREDUMPED is raised if the task did actually coredump. * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g., undumpable). * PIDFD_COREDUMP_USER is raised if this is a regular coredump and doesn't need special care by the coredump server. * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be treated as sensitive and the coredump server should restrict to the generated coredump to sufficiently privileged users. The kernel guarantees that by the time the connection is made the all PIDFD_INFO_COREDUMP info is available. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-5-664f3caf2516@kernel.org Acked-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21coredump: add coredump socketChristian Brauner3-17/+177
Coredumping currently supports two modes: (1) Dumping directly into a file somewhere on the filesystem. (2) Dumping into a pipe connected to a usermode helper process spawned as a child of the system_unbound_wq or kthreadd. For simplicity I'm mostly ignoring (1). There's probably still some users of (1) out there but processing coredumps in this way can be considered adventurous especially in the face of set*id binaries. The most common option should be (2) by now. It works by allowing userspace to put a string into /proc/sys/kernel/core_pattern like: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h The "|" at the beginning indicates to the kernel that a pipe must be used. The path following the pipe indicator is a path to a binary that will be spawned as a usermode helper process. Any additional parameters pass information about the task that is generating the coredump to the binary that processes the coredump. In the example core_pattern shown above systemd-coredump is spawned as a usermode helper. There's various conceptual consequences of this (non-exhaustive list): - systemd-coredump is spawned with file descriptor number 0 (stdin) connected to the read-end of the pipe. All other file descriptors are closed. That specifically includes 1 (stdout) and 2 (stderr). This has already caused bugs because userspace assumed that this cannot happen (Whether or not this is a sane assumption is irrelevant.). - systemd-coredump will be spawned as a child of system_unbound_wq. So it is not a child of any userspace process and specifically not a child of PID 1. It cannot be waited upon and is in a weird hybrid upcall which are difficult for userspace to control correctly. - systemd-coredump is spawned with full kernel privileges. This necessitates all kinds of weird privilege dropping excercises in userspace to make this safe. - A new usermode helper has to be spawned for each crashing process. This series adds a new mode: (3) Dumping into an AF_UNIX socket. Userspace can set /proc/sys/kernel/core_pattern to: @/path/to/coredump.socket The "@" at the beginning indicates to the kernel that an AF_UNIX coredump socket will be used to process coredumps. The coredump socket must be located in the initial mount namespace. When a task coredumps it opens a client socket in the initial network namespace and connects to the coredump socket. - The coredump server uses SO_PEERPIDFD to get a stable handle on the connected crashing task. The retrieved pidfd will provide a stable reference even if the crashing task gets SIGKILLed while generating the coredump. - By setting core_pipe_limit non-zero userspace can guarantee that the crashing task cannot be reaped behind it's back and thus process all necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to detect whether /proc/<pid> still refers to the same process. The core_pipe_limit isn't used to rate-limit connections to the socket. This can simply be done via AF_UNIX sockets directly. - The pidfd for the crashing task will grow new information how the task coredumps. - The coredump server should mark itself as non-dumpable. - A container coredump server in a separate network namespace can simply bind to another well-know address and systemd-coredump fowards coredumps to the container. - Coredumps could in the future also be handled via per-user/session coredump servers that run only with that users privileges. The coredump server listens on the coredump socket and accepts a new coredump connection. It then retrieves SO_PEERPIDFD for the client, inspects uid/gid and hands the accepted client to the users own coredump handler which runs with the users privileges only (It must of coure pay close attention to not forward crashing suid binaries.). The new coredump socket will allow userspace to not have to rely on usermode helpers for processing coredumps and provides a safer way to handle them instead of relying on super privileged coredumping helpers that have and continue to cause significant CVEs. This will also be significantly more lightweight since no fork()+exec() for the usermodehelper is required for each crashing process. The coredump server in userspace can e.g., just keep a worker pool. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-4-664f3caf2516@kernel.org Acked-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-21mips/perf: Remove driver-specific throttle supportKan Liang1-2/+1
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250520181644.2673067-17-kan.liang@linux.intel.com
2025-05-21xtensa/perf: Remove driver-specific throttle supportKan Liang1-2/+1
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Max Filippov <jcmvbkbc@gmail.com> Link: https://lore.kernel.org/r/20250520181644.2673067-16-kan.liang@linux.intel.com
2025-05-21sparc/perf: Remove driver-specific throttle supportKan Liang1-2/+1
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250520181644.2673067-15-kan.liang@linux.intel.com
2025-05-21loongarch/perf: Remove driver-specific throttle supportKan Liang1-2/+1
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250520181644.2673067-14-kan.liang@linux.intel.com
2025-05-21csky/perf: Remove driver-specific throttle supportKan Liang1-2/+1
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/20250520181644.2673067-13-kan.liang@linux.intel.com
2025-05-21arc/perf: Remove driver-specific throttle supportKan Liang1-4/+2
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vineet Gupta <vgupta@kernel.org> Link: https://lore.kernel.org/r/20250520181644.2673067-12-kan.liang@linux.intel.com
2025-05-21alpha/perf: Remove driver-specific throttle supportKan Liang1-8/+3
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250520181644.2673067-11-kan.liang@linux.intel.com
2025-05-21perf/apple_m1: Remove driver-specific throttle supportKan Liang1-2/+1
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250520181644.2673067-10-kan.liang@linux.intel.com
2025-05-21perf/arm: Remove driver-specific throttle supportKan Liang4-10/+5
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Leo Yan <leo.yan@arm.com> Link: https://lore.kernel.org/r/20250520181644.2673067-9-kan.liang@linux.intel.com
2025-05-21s390/perf: Remove driver-specific throttle supportKan Liang2-6/+1
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Link: https://lore.kernel.org/r/20250520181644.2673067-8-kan.liang@linux.intel.com
2025-05-21powerpc/perf: Remove driver-specific throttle supportKan Liang2-6/+3
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250520181644.2673067-7-kan.liang@linux.intel.com
2025-05-21perf/x86/zhaoxin: Remove driver-specific throttle supportKan Liang1-2/+1
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250520181644.2673067-6-kan.liang@linux.intel.com
2025-05-21perf/x86/amd: Remove driver-specific throttle supportKan Liang2-5/+2
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250520181644.2673067-5-kan.liang@linux.intel.com
2025-05-21perf/x86/intel: Remove driver-specific throttle supportKan Liang5-14/+8
The throttle support has been added in the generic code. Remove the driver-specific throttle support. Besides the throttle, perf_event_overflow may return true because of event_limit. It already does an inatomic event disable. The pmu->stop is not required either. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250520181644.2673067-4-kan.liang@linux.intel.com
2025-05-21perf: Only dump the throttle log for the leaderKan Liang1-2/+4
The PERF_RECORD_THROTTLE records are dumped for all throttled events. It's not necessary for group events, which are throttled altogether. Optimize it by only dump the throttle log for the leader. The sample right after the THROTTLE record must be generated by the actual target event. It is good enough for the perf tool to locate the actual target event. Suggested-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250520181644.2673067-3-kan.liang@linux.intel.com
2025-05-21perf: Fix the throttle logic for a groupKan Liang1-20/+46
The current throttle logic doesn't work well with a group, e.g., the following sampling-read case. $ perf record -e "{cycles,cycles}:S" ... $ perf report -D | grep THROTTLE | tail -2 THROTTLE events: 426 ( 9.0%) UNTHROTTLE events: 425 ( 9.0%) $ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5 0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1): ... sample_read: .... group nr 2 ..... id 0000000000000327, value 000000000cbb993a, lost 0 ..... id 0000000000000328, value 00000002211c26df, lost 0 The second cycles event has a much larger value than the first cycles event in the same group. The current throttle logic in the generic code only logs the THROTTLE event. It relies on the specific driver implementation to disable events. For all ARCHs, the implementation is similar. Only the event is disabled, rather than the group. The logic to disable the group should be generic for all ARCHs. Add the logic in the generic code. The following patch will remove the buggy driver-specific implementation. The throttle only happens when an event is overflowed. Stop the entire group when any event in the group triggers the throttle. The MAX_INTERRUPTS is set to all throttle events. The unthrottled could happen in 3 places. - event/group sched. All events in the group are scheduled one by one. All of them will be unthrottled eventually. Nothing needs to be changed. - The perf_adjust_freq_unthr_events for each tick. Needs to restart the group altogether. - The __perf_event_period(). The whole group needs to be restarted altogether as well. With the fix, $ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5 0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2): ... sample_read: .... group nr 2 ..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0 ..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0 Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250520181644.2673067-2-kan.liang@linux.intel.com
2025-05-21selftests/futex: Fix spelling mistake "unitiliazed" -> "uninitialized"Colin Ian King1-1/+1
There is a spelling mistake in a fail error message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20250520080657.30726-1-colin.i.king@gmail.com
2025-05-21futex: Correct the kernedoc return value for futex_wait_setup().Sebastian Andrzej Siewior1-1/+2
The kerneldoc for futex_wait_setup() states it can return "0" or "<1". This isn't true because the error case is "<0" not less than 1. Document that <0 is returned on error. Drop the possible return values and state possible reasons. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20250517151455.1065363-6-bigeasy@linutronix.de
2025-05-21tools headers: Synchronize prctl.h ABI headerSebastian Andrzej Siewior3-11/+15
The prctl.h ABI header was slightly updated during the development of the interface. In particular the "immutable" parameter became a bit in the option argument. Synchronize prctl.h ABI header again and make use of the definition in the testsuite and "perf bench futex". Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20250517151455.1065363-5-bigeasy@linutronix.de
2025-05-21futex: Use RCU_INIT_POINTER() in futex_mm_init().Sebastian Andrzej Siewior1-8/+1
There is no need for an explicit NULL pointer initialisation plus a comment why it is okay. RCU_INIT_POINTER() can be used for NULL initialisations and it is documented. This has been build tested with gcc version 9.3.0 (Debian 9.3.0-22) on a x86-64 defconfig. Fixes: 094ac8cff7858 ("futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250517151455.1065363-4-bigeasy@linutronix.de
2025-05-21selftests/futex: Use TAP output in futex_numa_mpolSebastian Andrzej Siewior1-33/+32
Use TAP output for easier automated testing. Suggested-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20250517151455.1065363-3-bigeasy@linutronix.de