kernel/linux.git/fs, branch v6.18.37

virtiofs: fix UAF on submount umount

2026-06-27T10:06:50+00:00

commit 06b41351779e9289e8785694ade9042ae85e41ea upstream. iput() called from fuse_release_end() can Oops if the super block has already been destroyed. Normally this is prevented by waiting for num_waiting to go down to zero before commencing with super block shutdown. This only works, however, for the last submount instance, as the wait counter is per connection, not per superblock. Revert to using synchronous release requests for the auto_submounts case, which is virtiofs only at this time. Reported-by: Aurélien Bombo Reported-by: Zhihao Cheng Cc: Greg Kurz Closes: https://github.com/kata-containers/kata-containers/issues/12589 Fixes: 26e5c67deb2e ("fuse: fix livelock in synchronous file put from fuseblk workers") Cc: stable@vger.kernel.org Reviewed-by: Greg Kurz Signed-off-by: Miklos Szeredi Signed-off-by: Greg Kroah-Hartman

ksmbd: reject non-VALID session in compound request branch

2026-06-27T10:06:50+00:00

commit 609ca17d869d04ba249e32cdcbf13c0b1c66f43c upstream. smb2_check_user_session() takes a shortcut for any operation that is not the first in a COMPOUND request: it reuses work->sess (the session bound by the first operation) and validates only the SessionId, then returns "valid". It never re-checks work->sess->state == SMB2_SESSION_VALID, and a SessionId of 0xFFFFFFFFFFFFFFFF (ULLONG_MAX, the MS-SMB2 related-operation value) skips even the id comparison. The standalone path (ksmbd_session_lookup_all() plus the SESSION_SETUP state machine) does enforce the VALID state; the compound branch bypasses all of it. A SESSION_SETUP carrying only an NTLM Type-1 (NtLmNegotiate) blob publishes a fresh SMB2_SESSION_IN_PROGRESS session whose sess->user is still NULL (->user is assigned later, by ntlm_authenticate()). Used as operation 1 of a COMPOUND with operation 2 = TREE_CONNECT (related, SessionId=ULLONG_MAX, \\host\IPC$), the tree-connect then runs on that IN_PROGRESS session and reaches ksmbd_ipc_tree_connect_request(), which dereferences user_name(sess->user) with sess->user == NULL (transport_ipc.c:687/701/704) -> remote NULL-pointer dereference and a kernel Oops that wedges the ksmbd worker for all clients. Reject any non-first compound operation that lands on a session which is not SMB2_SESSION_VALID, mirroring the validity the standalone lookup path enforces. SESSION_SETUP itself legitimately runs on an IN_PROGRESS session, but it is never carried as a non-first compound operation, so multi-leg authentication is unaffected by this check. Fixes: 5005bcb42191 ("ksmbd: validate session id and tree id in the compound request") Cc: stable@vger.kernel.org Signed-off-by: Gil Portnoy Acked-by: Namjae Jeon Signed-off-by: Steve French Signed-off-by: Greg Kroah-Hartman

mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps

2026-06-27T10:06:48+00:00

commit 5dba5cc2e0ffa76f2f6c8922a04469dc9602c396 upstream. Patch series "introduce VM_MAYBE_GUARD and make it sticky", v4. Currently, guard regions are not visible to users except through /proc/$pid/pagemap, with no explicit visibility at the VMA level. This makes the feature less useful, as it isn't entirely apparent which VMAs may have these entries present, especially when performing actions which walk through memory regions such as those performed by CRIU. This series addresses this issue by introducing the VM_MAYBE_GUARD flag which fulfils this role, updating the smaps logic to display an entry for these. The semantics of this flag are that a guard region MAY be present if set (we cannot be sure, as we can't efficiently track whether an MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if not set the VMA definitely does NOT have any guard regions present. It's problematic to establish this flag without further action, because that means that VMAs with guard regions in them become non-mergeable with adjacent VMAs for no especially good reason. To work around this, this series also introduces the concept of 'sticky' VMA flags - that is flags which: a. if set in one VMA and not in another still permit those VMAs to be merged (if otherwise compatible). b. When they are merged, the resultant VMA must have the flag set. The VMA logic is updated to propagate these flags correctly. Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve an issue with file-backed guard regions - previously these established an anon_vma object for file-backed mappings solely to have vma_needs_copy() correctly propagate guard region mappings to child processes. We introduce a new flag alias VM_COPY_ON_FORK (which currently only specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly for this flag and to copy page tables if it is present, which resolves this issue. Additionally, we add the ability for allow-listed VMA flags to be atomically writable with only mmap/VMA read locks held. The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure does not cause any races by being allowed to do so. This allows us to maintain guard region installation as a read-locked operation and not endure the overhead of obtaining a write lock here. Finally we introduce extensive VMA userland tests to assert that the sticky VMA logic behaves correctly as well as guard region self tests to assert that smaps visibility is correctly implemented. This patch (of 9): Currently, if a user needs to determine if guard regions are present in a range, they have to scan all VMAs (or have knowledge of which ones might have guard regions). Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to pagemap") and the related commit a516403787e0 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions"), users can use either /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this operation at a virtual address level. This is not ideal, and it gives no visibility at a /proc/$pid/smaps level that guard regions exist in ranges. This patch remedies the situation by establishing a new VMA flag, VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is uncertain because we cannot reasonably determine whether a MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and additionally VMAs may change across merge/split). We utilise 0x800 for this flag which makes it available to 32-bit architectures also, a flag that was previously used by VM_DENYWRITE, which was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't bee reused yet. We also update the smaps logic and documentation to identify these VMAs. Another major use of this functionality is that we can use it to identify that we ought to copy page tables on fork. We do not actually implement usage of this flag in mm/madvise.c yet as we need to allow some VMA flags to be applied atomically under mmap/VMA read lock in order to avoid the need to acquire a write lock for this purpose. Link: https://lkml.kernel.org/r/cover.1763460113.git.ljs@kernel.org Link: https://lkml.kernel.org/r/cf8ef821eba29b6c5b5e138fffe95d6dcabdedb9.1763460113.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes Reviewed-by: Pedro Falcato Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Lance Yang Cc: Andrei Vagin Cc: Baolin Wang Cc: Barry Song Cc: Dev Jain Cc: Jann Horn Cc: Jonathan Corbet Cc: Liam Howlett Cc: "Masami Hiramatsu (Google)" Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Mike Rapoport Cc: Nico Pache Cc: Ryan Roberts Cc: Steven Rostedt Cc: Suren Baghdasaryan Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Ahmed Elaidy Signed-off-by: Greg Kroah-Hartman

Revert "NFSD: Defer sub-object cleanup in export put callbacks"

2026-06-27T10:06:46+00:00

commit 516403d4d85607fdef3ca41d4a56b54e5566fa9a upstream. This reverts commit 48db892356d6cb80f6942885545de4a6dd8d2a29. Commit 48db892356d6 ("NFSD: Defer sub-object cleanup in export put callbacks") moved path_put() and auth_domain_put() out of svc_export_put() and expkey_put() and behind queue_rcu_work() to close a claimed use-after-free in e_show() and c_show() against ex_path and ex_client->name. Discussion in [1] shows neither the diagnosis nor the remedy survives review. The downstream teardown of both sub-objects is already RCU-deferred. auth_domain_put() reaches svcauth_unix_domain_release(), which frees the unix_domain and its ->name through call_rcu(). path_put() reaches dentry_free(), which frees the dentry through call_rcu(), and prepend_path() is already structured to tolerate concurrent dentry teardown. A reader in cache_seq_start_rcu() therefore observes both sub-objects through the next grace period regardless of whether svc_export_put() runs synchronously, so the synchronous form was never unsafe. The crash signature in the report cited by commit 48db892356d6 ("NFSD: Defer sub-object cleanup in export put callbacks") has a different root cause: a /proc/net/rpc cache file held open across network-namespace exit lets cache_destroy_net() free cd->hash_table while a reader is still walking it. The correct fix pins cd->net for the open fd's lifetime and does not require any deferral inside svc_export_put(). Meanwhile, deferring path_put() out of svc_export_put() reintroduces the regression that commit 69d803c40ede ("nfsd: Revert "nfsd: release svc_expkey/svc_export with rcu_work"") repaired: after "exportfs -r" drops the last cache reference, the mount reference held through ex_path lingers in the workqueue, so a subsequent umount fails with EBUSY. Restore the synchronous path_put() and auth_domain_put() in svc_export_put() and expkey_put() and the call_rcu()/kfree_rcu() free of the containing structures. The unrelated fix for ex_uuid/ex_stats from commit 2530766492ec ("nfsd: fix UAF when access ex_uuid or ex_stats") is preserved. Link: https://lore.kernel.org/all/10019b42-4589-4f9f-8d5b-d8197db1ce3c@huawei.com/ [1] Fixes: 48db892356d6 ("NFSD: Defer sub-object cleanup in export put callbacks") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton Tested-by: Alexandr Alexandrov Signed-off-by: Yang Erkun Signed-off-by: Chuck Lever Signed-off-by: Greg Kroah-Hartman

fuse: re-lock request before replacing page cache folio

2026-06-27T10:06:46+00:00

commit a078484921052d0badd827fcc2770b5cfc1d4120 upstream. fuse_try_move_folio() unlocks the request on entry but does not re-lock it on the success path. This means fuse_chan_abort() can end the request and free the fuse_io_args (eg fuse_readpages_end()) while the subsequent copy chain logic after fuse_try_move_folio() accesses the fuse_io_args, leading to use-after-free issues. Fix this by calling lock_request() before replace_page_cache_folio(). This ensures the request is locked on the success path which will prevent the fuse_io_args from being freed while the later copying logic runs, and also ensures that the ap->folios[i]->mapping is never null since ap->folios[i] will always point to the newfolio after replace_page_cache_folio(). Fixes: ce534fb05292 ("fuse: allow splice to move pages") Cc: stable@vger.kernel.org Reported-by: Lei Lu Signed-off-by: Joanne Koong Signed-off-by: Miklos Szeredi Signed-off-by: Greg Kroah-Hartman

fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling

2026-06-19T11:44:14+00:00

commit 00633c4683828acd5256fa8d5163f440d74bbe71 upstream. A SOFTIRQ-safe to SOFTIRQ-unsafe lock order deadlock can occur in send_sigio() and send_sigurg() when a process group receives a signal. When FASYNC is configured for a process group (PIDTYPE_PGID), both functions use read_lock(&tasklist_lock) to traverse the task list. However, they are frequently called from softirq context: - send_sigio() via input_inject_event -> kill_fasync - send_sigurg() via tcp_check_urg -> sk_send_sigurg (NET_RX_SOFTIRQ) The deadlock is caused by the rwlock writer fairness mechanism: 1. CPU 0 (process context) holds read_lock(&tasklist_lock) in do_wait(). 2. CPU 1 (process context) attempts write_lock(&tasklist_lock) in fork() or exit() and spins, which blocks all new readers. 3. CPU 0 is interrupted by a softirq (e.g., TCP URG packet reception). 4. The softirq calls send_sigurg() and attempts to acquire read_lock(&tasklist_lock), deadlocking because CPU 1 is waiting. Since PID hashing and do_each_pid_task() traversals are already RCU-protected, the read_lock on tasklist_lock is no longer strictly required for safe traversal. Fix this by replacing tasklist_lock with rcu_read_lock(), aligning the process group signaling path with the single-PID path. This also mitigates a potential remote denial of service vector via TCP URG packets. Lockdep splat: ===================================================== WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected [...] Chain exists of: &dev->event_lock --> &f_owner->lock --> tasklist_lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(tasklist_lock); local_irq_disable(); lock(&dev->event_lock); lock(&f_owner->lock); lock(&dev->event_lock); *** DEADLOCK *** Reviewed-by: Jeff Layton Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn> Link: https://patch.msgid.link/20260523135210.590928-1-w15303746062@163.com Signed-off-by: Christian Brauner (Amutable) Signed-off-by: Greg Kroah-Hartman

fuse: limit FUSE_NOTIFY_RETRIEVE to uptodate folios

2026-06-19T11:44:07+00:00

commit 4e3d1b2c48ca6c55f1e9ca7f8dccc76f120f276c upstream. FUSE_NOTIFY_RETRIEVE must be limited to uptodate folios; !uptodate folios can contain uninitialized data. Since FUSE_NOTIFY_RETRIEVE is intended to only return data that is already in the page cache and not wait for data from the FUSE daemon, treat !uptodate folios as if they weren't present. This only has security impact on systems that don't enable automatic zero-initialization of all page allocations via CONFIG_INIT_ON_ALLOC_DEFAULT_ON or init_on_alloc=1. Cc: stable@kernel.org Fixes: 2d45ba381a74 ("fuse: add retrieve request") Signed-off-by: Jann Horn Link: https://patch.msgid.link/20260519-fuse-retrieve-uptodate-v1-1-a7a1912a37f9@google.com Acked-by: Miklos Szeredi Signed-off-by: Christian Brauner (Amutable) Signed-off-by: Greg Kroah-Hartman

fuse: reject fuse_notify() pagecache ops on directories

2026-06-19T11:44:07+00:00

commit 9c954499d43aefac01c5dfb57a82b13d2dcf4b94 upstream. The operations FUSE_NOTIFY_STORE and FUSE_NOTIFY_RETRIEVE allow the FUSE daemon to actively write/read pagecache contents. For directories with FOPEN_CACHE_DIR, the pagecache is used as kernel-internal cache storage, and userspace is not supposed to have direct access to this cache - in particular, fuse_parse_cache() will hit WARN_ON() if the cache contains bogus data. Reject FUSE_NOTIFY_STORE and FUSE_NOTIFY_RETRIEVE on anything other than regular files with -EINVAL. Fixes: 5d7bc7e8680c ("fuse: allow using readdir cache") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn Link: https://patch.msgid.link/20260519-fuse-dir-pagecache-v2-1-5428fa48e175@google.com Acked-by: Miklos Szeredi Signed-off-by: Christian Brauner (Amutable) Signed-off-by: Greg Kroah-Hartman

fs/qnx6: fix pointer arithmetic in directory iteration

2026-06-19T11:44:07+00:00

commit 89c4a1167f3a0a0efd2ec3e1801036d2eb65ae1a upstream. The conversion to qnx6_get_folio() in commit b2aa61556fcf ("qnx6: Convert qnx6_get_page() to qnx6_get_folio()") introduced a regression in directory iteration. The pointer 'de' and the 'limit' address were calculated using byte offsets from a char pointer without scaling by the size of a QNX6 directory entry. This causes the driver to read from incorrect memory offsets, leading to "invalid direntry size" errors and premature termination of directory scans. Fix this by casting 'kaddr' to 'struct qnx6_dir_entry *' before applying the offset and last_entry(...) increments. This allows the compiler to correctly scale the pointer arithmetic by the 32-byte stride of the directory entry structure. Fixes: b2aa61556fcf ("qnx6: Convert qnx6_get_page() to qnx6_get_folio()") Cc: stable@vger.kernel.org Signed-off-by: Arpith Kalaginanavoor Link: https://patch.msgid.link/20260526123858.1683035-1-arpithk@nvidia.com Signed-off-by: Christian Brauner (Amutable) Signed-off-by: Greg Kroah-Hartman

fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

2026-06-19T11:44:07+00:00

commit 40ab6644b99685755f740b872c00ef40d9aa870e upstream. may_decode_fh() accesses mount::mnt_ns without holding any locks; that means the mount can concurrently be unmounted, and the mnt_namespace can concurrently be freed after an RCU grace period. This race can happens as follows, assuming that the mount point was created by open_tree(..., OPEN_TREE_CLONE): thread 1 thread 2 RCU __do_sys_open_by_handle_at do_handle_open handle_to_path may_decode_fh is_mounted [mount::mnt_ns access] [mount::mnt_ns access] __do_sys_close fput_close_sync __fput dissolve_on_fput umount_tree class_namespace_excl_destructor namespace_unlock free_mnt_ns mnt_ns_tree_remove call_rcu(mnt_ns_release_rcu) mnt_ns_release_rcu mnt_ns_release kfree [mnt_namespace::user_ns access] **UAF** Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like in __prepend_path(). Additionally, document the semantics of mount::mnt_ns, and use WRITE_ONCE() for writers that can race with lockless readers. This bug is unreachable unless one of the following is set: - CONFIG_PREEMPTION - CONFIG_RCU_STRICT_GRACE_PERIOD because it requires an RCU grace period to happen during a syscall without an explicit preemption. This doesn't seem to have interesting security impact; worst-case, it could leak the result of an integer comparison to userspace (from the level check in cap_capable()), cause an endless loop, or crash the kernel by dereferencing an invalid address. Fixes: 620c266f3949 ("fhandle: relax open_by_handle_at() permission checks") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn Link: https://patch.msgid.link/20260603-vfs-fhandle-uaf-fix-v2-1-d05db76a5084@google.com Signed-off-by: Christian Brauner (Amutable) Signed-off-by: Greg Kroah-Hartman