kernel/linux.git/fs/namespace.c, branch v7.1-rc5

mount: always duplicate mount

2026-04-14T07:30:15+00:00

In the OPEN_TREE_NAMESPACE path vfs_open_tree() resolves a path via filename_lookup() without holding namespace_lock. Between the lookup and create_new_namespace() acquiring namespace_lock via LOCK_MOUNT_EXACT_COPY() another thread can unmount the mount, setting mnt->mnt_ns to NULL. When create_new_namespace() then checks !mnt->mnt_ns it incorrectly takes the swap-and-mntget path that was designed for fsmount()'s detached mounts. This reuses a mount whose mnt_mp_list is in an inconsistent state from the concurrent unmount, causing a general protection fault in __umount_mnt() -> hlist_del_init(&mnt->mnt_mp_list) during namespace teardown. Remove the !mnt->mnt_ns special case entirely. Instead, always duplicate the mount: - For OPEN_TREE_NAMESPACE use __do_loopback() which will properly clone the mount or reject it via may_copy_tree() if it was unmounted in the race window. - For fsmount() use clone_mnt() directly (via the new MOUNT_COPY_NEW flag) since the mount is freshly created by vfs_create_mount() and not in any namespace so __do_loopback()'s IS_MNT_UNBINDABLE, may_copy_tree, and __has_locked_children checks don't apply. Reported-by: syzbot+e4470cc28308f2081ec8@syzkaller.appspotmail.com Signed-off-by: Christian Brauner

move_mount: allow MOVE_MOUNT_BENEATH on the rootfs

2026-03-12T12:34:59+00:00

Allow MOVE_MOUNT_BENEATH to target the caller's rootfs. When the target of a mount-beneath operation is the caller's root mount, verify that: (1) The caller is located at the root of the mount, as enforced by path_mounted() in do_lock_mount(). (2) Propagation from the parent mount would not overmount the target, to avoid propagating beneath the rootfs of other mount namespaces. The root-switching is decomposed into individually atomic, locally-scoped steps: mount-beneath inserts the new root under the old one, chroot(".") switches the caller's root, and umount2(".", MNT_DETACH) removes the old root. Since each step only modifies the caller's own state, this avoids cross-namespace vulnerabilities and inherent fork/unshare/setns races that a chroot_fs_refs()-based approach would have. Userspace can use the following workflow to switch roots: fd_tree = open_tree(-EBADF, "/newroot", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC); fchdir(fd_tree); move_mount(fd_tree, "", AT_FDCWD, "/", MOVE_MOUNT_BENEATH | MOVE_MOUNT_F_EMPTY_PATH); chroot("."); umount2(".", MNT_DETACH); Link: https://patch.msgid.link/20260224-work-mount-beneath-rootfs-v1-2-8c58bf08488f@kernel.org Signed-off-by: Christian Brauner

move_mount: transfer MNT_LOCKED

2026-03-12T12:34:58+00:00

When performing a mount-beneath operation the target mount can often be locked: unshare(CLONE_NEWUSER | CLONE_NEWNS); mount --beneath -t tmpfs tmpfs /proc will fail because the procfs mount on /proc became locked when the mount namespace was created from the parent mount namespace. Same logic for: unshare(CLONE_NEWUSER | CLONE_NEWNS); mount --beneath -t tmpfs tmpfs / MNT_LOCKED is raised to prevent an unprivileged mount namespace from revealing whatever is under a given mount. To replace the rootfs we need to handle that case though. We can simply transfer the locked mount property from the top mount to the mount beneath. The new mount we mounted beneath the top mount takes over the job of the top mount in protecting the parent mount from being revealed. This leaves us free to allow the top mount to be unmounted. This also works during mount propagation and also works for the non-MOVE_MOUNT_BENEATH case: (1) move_mount(MOVE_MOUNT_BENEATH): @source_mnt->overmount always NULL (2) move_mount(): @source_mnt->overmount maybe !NULL For (1) can_move_mount_beneath() rejects overmounted @source_mnt (We could allow this but whatever it's not really a use-case and it's fugly to move an overmounted mount stack around. What are you even doing? So let's keep that restriction. For (2) we can have @source_mnt overmounted (Someone overmounted us while we locked the target mount.). Both are fine. @source_mnt will be mounted on whatever @q was mounted on and @q will be mounted on the top of the @source_mnt mount stack. Even in such cases we can unlock @q and lock @source_mnt if @q was locked. This effectively makes mount propagation useful in cases where a mount namespace has a locked mount somewhere and we propagate a new mount beneath it but the mount namespace could never get at it because the old top mount remains locked. Again, we just let the newly propagated mount take over the protection and unlock the top mount. Link: https://patch.msgid.link/20260224-work-mount-beneath-rootfs-v1-1-8c58bf08488f@kernel.org Signed-off-by: Christian Brauner

namespace: allow creating empty mount namespaces

2026-03-12T12:33:55+00:00

Add support for creating a mount namespace that contains only a copy of the root mount from the caller's mount namespace, with none of the child mounts. This is useful for containers and sandboxes that want to start with a minimal mount table and populate it from scratch rather than inheriting and then tearing down the full mount tree. Two new flags are introduced: - CLONE_EMPTY_MNTNS for clone3(), using the 64-bit flag space. - UNSHARE_EMPTY_MNTNS for unshare(), reusing the CLONE_PARENT_SETTID bit which has no meaning for unshare. Both flags imply CLONE_NEWNS. For the unshare path, UNSHARE_EMPTY_MNTNS is converted to CLONE_EMPTY_MNTNS in unshare_nsproxy_namespaces() before it reaches copy_mnt_ns(), so the mount namespace code only needs to handle a single flag. In copy_mnt_ns(), when CLONE_EMPTY_MNTNS is set, clone_mnt() is used instead of copy_tree() to clone only the root mount. The caller's root and working directory are both reset to the root dentry of the new mount. The cleanup variables are changed from vfsmount pointers with __free(mntput) to struct path with __free(path_put) because the empty mount namespace path needs to release both mount and dentry references when replacing the caller's root and pwd. In the normal (non-empty) path only the mount component is set, and dput(NULL) is a no-op so path_put remains correct there as well. Link: https://patch.msgid.link/20260306-work-empty-mntns-consolidated-v1-1-6eb30529bbb0@kernel.org Signed-off-by: Christian Brauner

mount: add FSMOUNT_NAMESPACE

2026-03-12T12:33:54+00:00

Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount namespace with the newly created filesystem attached to a copy of the real rootfs. This returns a namespace file descriptor instead of an O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for open_tree(). This allows creating a new filesystem and immediately placing it in a new mount namespace in a single operation, which is useful for container runtimes and other namespace-based isolation mechanisms. The rootfs mount is created before copying the real rootfs for the new namespace meaning that the mount namespace id for the mount of the root of the namespace is bigger than the child mounted on top of it. We've never explicitly given the guarantee for such ordering and I doubt anyone relies on it. Accepting that lets us avoid copying the mount again and also avoids having to massage may_copy_tree() to grant an exception for fsmount->mnt->mnt_ns being NULL. Link: https://patch.msgid.link/20260122-work-fsmount-namespace-v1-3-5ef0a886e646@kernel.org Signed-off-by: Christian Brauner

mount: simplify __do_loopback()

2026-03-12T12:33:53+00:00

Remove the OPEN_TREE_NAMESPACE flag checking from __do_loopback() and instead have callers pass CL_COPY_MNT_NS_FILE directly in copy_flags. Link: https://patch.msgid.link/20260122-work-fsmount-namespace-v1-2-5ef0a886e646@kernel.org Signed-off-by: Christian Brauner

mount: start iterating from start of rbtree

2026-03-12T12:33:50+00:00

If the root of the namespace has an id that's greater than the child we'd not find it. Handle that case. Link: https://patch.msgid.link/20260122-work-fsmount-namespace-v1-1-5ef0a886e646@kernel.org Signed-off-by: Christian Brauner

Merge tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-02-25T18:34:23+00:00

Pull vfs fixes from Christian Brauner: - Fix an uninitialized variable in file_getattr(). The flags_valid field wasn't initialized before calling vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse - Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is not enabled. sysctl_hung_task_timeout_secs is 0 in that case causing spurious "waiting for writeback completion for more than 1 seconds" warnings - Fix a null-ptr-deref in do_statmount() when the mount is internal - Add missing kernel-doc description for the @private parameter in iomap_readahead() - Fix mount namespace creation to hold namespace_sem across the mount copy in create_new_namespace(). The previous drop-and-reacquire pattern was fragile and failed to clean up mount propagation links if the real rootfs was a shared or dependent mount - Fix /proc mount iteration where m->index wasn't updated when m->show() overflows, causing a restart to repeatedly show the same mount entry in a rapidly expanding mount table - Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the inode number is out of range - Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared. copy_mnt_ns() received the live fs_struct so if a subsequent namespace creation failed the rollback would leave pwd and root pointing to detached mounts. Always allocate a new fs_struct when CLONE_NEWNS is requested - fserror bug fixes: - Remove the unused fsnotify_sb_error() helper now that all callers have been converted to fserror_report_metadata - Fix a lockdep splat in fserror_report() where igrab() takes inode::i_lock which can be held in IRQ context. Replace igrab() with a direct i_count bump since filesystems should not report inodes that are about to be freed or not yet exposed - Handle error pointer in procfs for try_lookup_noperm() - Fix an integer overflow in ep_loop_check_proc() where recursive calls returning INT_MAX would overflow when +1 is added, breaking the recursion depth check - Fix a misleading break in pidfs * tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: avoid misleading break eventpoll: Fix integer overflow in ep_loop_check_proc() proc: Fix pointer error dereference fserror: fix lockdep complaint when igrabbing inode fsnotify: drop unused helper unshare: fix unshare_fs() handling minix: Correct errno in minix_new_inode namespace: fix proc mount iteration mount: hold namespace_sem across copy in create_new_namespace() iomap: Describe @private in iomap_readahead() statmount: Fix the null-ptr-deref in do_statmount() writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK fs: init flags_valid before calling vfs_fileattr_get

Convert 'alloc_obj' family to use the new default GFP_KERNEL argument

2026-02-22T01:09:51+00:00

This was done entirely with mindless brute force, using git grep -l '\

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

2026-02-21T09:02:28+00:00

This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook