From a9194f88782afa1386641451a6c76beaa60485a0 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 16 May 2025 13:25:31 +0200 Subject: coredump: add coredump socket Coredumping currently supports two modes: (1) Dumping directly into a file somewhere on the filesystem. (2) Dumping into a pipe connected to a usermode helper process spawned as a child of the system_unbound_wq or kthreadd. For simplicity I'm mostly ignoring (1). There's probably still some users of (1) out there but processing coredumps in this way can be considered adventurous especially in the face of set*id binaries. The most common option should be (2) by now. It works by allowing userspace to put a string into /proc/sys/kernel/core_pattern like: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h The "|" at the beginning indicates to the kernel that a pipe must be used. The path following the pipe indicator is a path to a binary that will be spawned as a usermode helper process. Any additional parameters pass information about the task that is generating the coredump to the binary that processes the coredump. In the example core_pattern shown above systemd-coredump is spawned as a usermode helper. There's various conceptual consequences of this (non-exhaustive list): - systemd-coredump is spawned with file descriptor number 0 (stdin) connected to the read-end of the pipe. All other file descriptors are closed. That specifically includes 1 (stdout) and 2 (stderr). This has already caused bugs because userspace assumed that this cannot happen (Whether or not this is a sane assumption is irrelevant.). - systemd-coredump will be spawned as a child of system_unbound_wq. So it is not a child of any userspace process and specifically not a child of PID 1. It cannot be waited upon and is in a weird hybrid upcall which are difficult for userspace to control correctly. - systemd-coredump is spawned with full kernel privileges. This necessitates all kinds of weird privilege dropping excercises in userspace to make this safe. - A new usermode helper has to be spawned for each crashing process. This series adds a new mode: (3) Dumping into an AF_UNIX socket. Userspace can set /proc/sys/kernel/core_pattern to: @/path/to/coredump.socket The "@" at the beginning indicates to the kernel that an AF_UNIX coredump socket will be used to process coredumps. The coredump socket must be located in the initial mount namespace. When a task coredumps it opens a client socket in the initial network namespace and connects to the coredump socket. - The coredump server uses SO_PEERPIDFD to get a stable handle on the connected crashing task. The retrieved pidfd will provide a stable reference even if the crashing task gets SIGKILLed while generating the coredump. - By setting core_pipe_limit non-zero userspace can guarantee that the crashing task cannot be reaped behind it's back and thus process all necessary information in /proc/. The SO_PEERPIDFD can be used to detect whether /proc/ still refers to the same process. The core_pipe_limit isn't used to rate-limit connections to the socket. This can simply be done via AF_UNIX sockets directly. - The pidfd for the crashing task will grow new information how the task coredumps. - The coredump server should mark itself as non-dumpable. - A container coredump server in a separate network namespace can simply bind to another well-know address and systemd-coredump fowards coredumps to the container. - Coredumps could in the future also be handled via per-user/session coredump servers that run only with that users privileges. The coredump server listens on the coredump socket and accepts a new coredump connection. It then retrieves SO_PEERPIDFD for the client, inspects uid/gid and hands the accepted client to the users own coredump handler which runs with the users privileges only (It must of coure pay close attention to not forward crashing suid binaries.). The new coredump socket will allow userspace to not have to rely on usermode helpers for processing coredumps and provides a safer way to handle them instead of relying on super privileged coredumping helpers that have and continue to cause significant CVEs. This will also be significantly more lightweight since no fork()+exec() for the usermodehelper is required for each crashing process. The coredump server in userspace can e.g., just keep a worker pool. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-4-664f3caf2516@kernel.org Acked-by: Luca Boccassi Reviewed-by: Kuniyuki Iwashima Reviewed-by: Alexander Mikhalitsyn Reviewed-by: Jann Horn Signed-off-by: Christian Brauner --- include/linux/net.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include') diff --git a/include/linux/net.h b/include/linux/net.h index 0ff950eecc6b..139c85d0f2ea 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -81,6 +81,7 @@ enum sock_type { #ifndef SOCK_NONBLOCK #define SOCK_NONBLOCK O_NONBLOCK #endif +#define SOCK_COREDUMP O_NOCTTY #endif /* ARCH_HAS_SOCKET_TYPES */ -- cgit v1.2.3 From 1d8db6fd698de1f73b1a7d72aea578fdd18d9a87 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 16 May 2025 13:25:32 +0200 Subject: pidfs, coredump: add PIDFD_INFO_COREDUMP Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP mask flag. This adds the @coredump_mask field to struct pidfd_info. When a task coredumps the kernel will provide the following information to userspace in @coredump_mask: * PIDFD_COREDUMPED is raised if the task did actually coredump. * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g., undumpable). * PIDFD_COREDUMP_USER is raised if this is a regular coredump and doesn't need special care by the coredump server. * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be treated as sensitive and the coredump server should restrict to the generated coredump to sufficiently privileged users. The kernel guarantees that by the time the connection is made the all PIDFD_INFO_COREDUMP info is available. Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-5-664f3caf2516@kernel.org Acked-by: Luca Boccassi Reviewed-by: Alexander Mikhalitsyn Reviewed-by: Jann Horn Signed-off-by: Christian Brauner --- fs/coredump.c | 31 ++++++++++++++++++++++++++ fs/pidfs.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/pidfs.h | 5 +++++ include/uapi/linux/pidfd.h | 16 ++++++++++++++ 4 files changed, 107 insertions(+) (limited to 'include') diff --git a/fs/coredump.c b/fs/coredump.c index af4aa57353ba..b0ee99992ea5 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -46,7 +46,9 @@ #include #include #include +#include #include +#include #include #include @@ -599,6 +601,8 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new) if (IS_ERR(pidfs_file)) return PTR_ERR(pidfs_file); + pidfs_coredump(cp); + /* * Usermode helpers are childen of either * system_unbound_wq or of kthreadd. So we know that @@ -875,8 +879,31 @@ void do_coredump(const kernel_siginfo_t *siginfo) if (IS_ERR(file)) goto close_fail; + /* + * Set the thread-group leader pid which is used for the + * peer credentials during connect() below. Then + * immediately register it in pidfs... + */ + cprm.pid = task_tgid(current); + retval = pidfs_register_pid(cprm.pid); + if (retval) + goto close_fail; + + /* + * ... and set the coredump information so userspace + * has it available after connect()... + */ + pidfs_coredump(&cprm); + retval = kernel_connect(socket, (struct sockaddr *)(&addr), addr_len, O_NONBLOCK | SOCK_COREDUMP); + + /* + * ... Make sure to only put our reference after connect() took + * its own reference keeping the pidfs entry alive ... + */ + pidfs_put_pid(cprm.pid); + if (retval) { if (retval == -EAGAIN) coredump_report_failure("Coredump socket %s receive queue full", addr.sun_path); @@ -885,6 +912,10 @@ void do_coredump(const kernel_siginfo_t *siginfo) goto close_fail; } + /* ... and validate that @sk_peer_pid matches @cprm.pid. */ + if (WARN_ON_ONCE(unix_peer(socket->sk)->sk_peer_pid != cprm.pid)) + goto close_fail; + cprm.limit = RLIM_INFINITY; cprm.file = no_free_ptr(file); #else diff --git a/fs/pidfs.c b/fs/pidfs.c index 3b39e471840b..c0069fd52dd4 100644 --- a/fs/pidfs.c +++ b/fs/pidfs.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "internal.h" #include "mount.h" @@ -33,6 +34,7 @@ static struct kmem_cache *pidfs_cachep __ro_after_init; struct pidfs_exit_info { __u64 cgroupid; __s32 exit_code; + __u32 coredump_mask; }; struct pidfs_inode { @@ -240,6 +242,22 @@ static inline bool pid_in_current_pidns(const struct pid *pid) return false; } +static __u32 pidfs_coredump_mask(unsigned long mm_flags) +{ + switch (__get_dumpable(mm_flags)) { + case SUID_DUMP_USER: + return PIDFD_COREDUMP_USER; + case SUID_DUMP_ROOT: + return PIDFD_COREDUMP_ROOT; + case SUID_DUMP_DISABLE: + return PIDFD_COREDUMP_SKIP; + default: + WARN_ON_ONCE(true); + } + + return 0; +} + static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) { struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; @@ -280,6 +298,11 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) } } + if (mask & PIDFD_INFO_COREDUMP) { + kinfo.mask |= PIDFD_INFO_COREDUMP; + kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask); + } + task = get_pid_task(pid, PIDTYPE_PID); if (!task) { /* @@ -296,6 +319,13 @@ static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) if (!c) return -ESRCH; + if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) { + task_lock(task); + if (task->mm) + kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags); + task_unlock(task); + } + /* Unconditionally return identifiers and credentials, the rest only on request */ user_ns = current_user_ns(); @@ -559,6 +589,31 @@ void pidfs_exit(struct task_struct *tsk) } } +#ifdef CONFIG_COREDUMP +void pidfs_coredump(const struct coredump_params *cprm) +{ + struct pid *pid = cprm->pid; + struct pidfs_exit_info *exit_info; + struct dentry *dentry; + struct inode *inode; + __u32 coredump_mask = 0; + + dentry = pid->stashed; + if (WARN_ON_ONCE(!dentry)) + return; + + inode = d_inode(dentry); + exit_info = &pidfs_i(inode)->__pei; + /* Note how we were coredumped. */ + coredump_mask = pidfs_coredump_mask(cprm->mm_flags); + /* Note that we actually did coredump. */ + coredump_mask |= PIDFD_COREDUMPED; + /* If coredumping is set to skip we should never end up here. */ + VFS_WARN_ON_ONCE(coredump_mask & PIDFD_COREDUMP_SKIP); + smp_store_release(&exit_info->coredump_mask, coredump_mask); +} +#endif + static struct vfsmount *pidfs_mnt __ro_after_init; /* diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h index 2676890c4d0d..77e7db194914 100644 --- a/include/linux/pidfs.h +++ b/include/linux/pidfs.h @@ -2,11 +2,16 @@ #ifndef _LINUX_PID_FS_H #define _LINUX_PID_FS_H +struct coredump_params; + struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags); void __init pidfs_init(void); void pidfs_add_pid(struct pid *pid); void pidfs_remove_pid(struct pid *pid); void pidfs_exit(struct task_struct *tsk); +#ifdef CONFIG_COREDUMP +void pidfs_coredump(const struct coredump_params *cprm); +#endif extern const struct dentry_operations pidfs_dentry_operations; int pidfs_register_pid(struct pid *pid); void pidfs_get_pid(struct pid *pid); diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index 8c1511edd0e9..c27a4e238e4b 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -25,9 +25,23 @@ #define PIDFD_INFO_CREDS (1UL << 1) /* Always returned, even if not requested */ #define PIDFD_INFO_CGROUPID (1UL << 2) /* Always returned if available, even if not requested */ #define PIDFD_INFO_EXIT (1UL << 3) /* Only returned if requested. */ +#define PIDFD_INFO_COREDUMP (1UL << 4) /* Only returned if requested. */ #define PIDFD_INFO_SIZE_VER0 64 /* sizeof first published struct */ +/* + * Values for @coredump_mask in pidfd_info. + * Only valid if PIDFD_INFO_COREDUMP is set in @mask. + * + * Note, the @PIDFD_COREDUMP_ROOT flag indicates that the generated + * coredump should be treated as sensitive and access should only be + * granted to privileged users. + */ +#define PIDFD_COREDUMPED (1U << 0) /* Did crash and... */ +#define PIDFD_COREDUMP_SKIP (1U << 1) /* coredumping generation was skipped. */ +#define PIDFD_COREDUMP_USER (1U << 2) /* coredump was done as the user. */ +#define PIDFD_COREDUMP_ROOT (1U << 3) /* coredump was done as root. */ + /* * The concept of process and threads in userland and the kernel is a confusing * one - within the kernel every thread is a 'task' with its own individual PID, @@ -92,6 +106,8 @@ struct pidfd_info { __u32 fsuid; __u32 fsgid; __s32 exit_code; + __u32 coredump_mask; + __u32 __spare1; }; #define PIDFS_IOCTL_MAGIC 0xFF -- cgit v1.2.3 From 4e83ae6ec87dddac070ba349d3b839589b1bb957 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 23 May 2025 10:47:06 +0200 Subject: mips, net: ensure that SOCK_COREDUMP is defined For historical reasons mips has to override the socket enum values but the defines are all the same. So simply move the ARCH_HAS_SOCKET_TYPES scope. Fixes: a9194f88782a ("coredump: add coredump socket") Suggested-by: Arnd Bergmann Signed-off-by: Christian Brauner --- arch/mips/include/asm/socket.h | 9 --------- include/linux/net.h | 3 +-- 2 files changed, 1 insertion(+), 11 deletions(-) (limited to 'include') diff --git a/arch/mips/include/asm/socket.h b/arch/mips/include/asm/socket.h index 4724a563c5bf..43a09f0dd3ff 100644 --- a/arch/mips/include/asm/socket.h +++ b/arch/mips/include/asm/socket.h @@ -36,15 +36,6 @@ enum sock_type { SOCK_PACKET = 10, }; -#define SOCK_MAX (SOCK_PACKET + 1) -/* Mask which covers at least up to SOCK_MASK-1. The - * * remaining bits are used as flags. */ -#define SOCK_TYPE_MASK 0xf - -/* Flags for socket, socketpair, paccept */ -#define SOCK_CLOEXEC O_CLOEXEC -#define SOCK_NONBLOCK O_NONBLOCK - #define ARCH_HAS_SOCKET_TYPES 1 #endif /* _ASM_SOCKET_H */ diff --git a/include/linux/net.h b/include/linux/net.h index 139c85d0f2ea..f60fff91e1df 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -70,6 +70,7 @@ enum sock_type { SOCK_DCCP = 6, SOCK_PACKET = 10, }; +#endif /* ARCH_HAS_SOCKET_TYPES */ #define SOCK_MAX (SOCK_PACKET + 1) /* Mask which covers at least up to SOCK_MASK-1. The @@ -83,8 +84,6 @@ enum sock_type { #endif #define SOCK_COREDUMP O_NOCTTY -#endif /* ARCH_HAS_SOCKET_TYPES */ - /** * enum sock_shutdown_cmd - Shutdown types * @SHUT_RD: shutdown receptions -- cgit v1.2.3