kernel/linux.git/fs/eventpoll.c, branch v7.0-rc7

Merge tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-02-25T18:34:23+00:00

Pull vfs fixes from Christian Brauner: - Fix an uninitialized variable in file_getattr(). The flags_valid field wasn't initialized before calling vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse - Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is not enabled. sysctl_hung_task_timeout_secs is 0 in that case causing spurious "waiting for writeback completion for more than 1 seconds" warnings - Fix a null-ptr-deref in do_statmount() when the mount is internal - Add missing kernel-doc description for the @private parameter in iomap_readahead() - Fix mount namespace creation to hold namespace_sem across the mount copy in create_new_namespace(). The previous drop-and-reacquire pattern was fragile and failed to clean up mount propagation links if the real rootfs was a shared or dependent mount - Fix /proc mount iteration where m->index wasn't updated when m->show() overflows, causing a restart to repeatedly show the same mount entry in a rapidly expanding mount table - Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the inode number is out of range - Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared. copy_mnt_ns() received the live fs_struct so if a subsequent namespace creation failed the rollback would leave pwd and root pointing to detached mounts. Always allocate a new fs_struct when CLONE_NEWNS is requested - fserror bug fixes: - Remove the unused fsnotify_sb_error() helper now that all callers have been converted to fserror_report_metadata - Fix a lockdep splat in fserror_report() where igrab() takes inode::i_lock which can be held in IRQ context. Replace igrab() with a direct i_count bump since filesystems should not report inodes that are about to be freed or not yet exposed - Handle error pointer in procfs for try_lookup_noperm() - Fix an integer overflow in ep_loop_check_proc() where recursive calls returning INT_MAX would overflow when +1 is added, breaking the recursion depth check - Fix a misleading break in pidfs * tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: avoid misleading break eventpoll: Fix integer overflow in ep_loop_check_proc() proc: Fix pointer error dereference fserror: fix lockdep complaint when igrabbing inode fsnotify: drop unused helper unshare: fix unshare_fs() handling minix: Correct errno in minix_new_inode namespace: fix proc mount iteration mount: hold namespace_sem across copy in create_new_namespace() iomap: Describe @private in iomap_readahead() statmount: Fix the null-ptr-deref in do_statmount() writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK fs: init flags_valid before calling vfs_fileattr_get

eventpoll: Fix integer overflow in ep_loop_check_proc()

2026-02-24T09:21:30+00:00

If a recursive call to ep_loop_check_proc() hits the `result = INT_MAX`, an integer overflow will occur in the calling ep_loop_check_proc() at `result = max(result, ep_loop_check_proc(ep_tovisit, depth + 1) + 1)`, breaking the recursion depth check. Fix it by using a different placeholder value that can't lead to an overflow. Reported-by: Guenter Roeck Fixes: f2e467a48287 ("eventpoll: Fix semi-unbounded recursion") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn Link: https://patch.msgid.link/20260223-epoll-int-overflow-v1-1-452f35132224@google.com Signed-off-by: Christian Brauner

Convert 'alloc_obj' family to use the new default GFP_KERNEL argument

2026-02-22T01:09:51+00:00

This was done entirely with mindless brute force, using git grep -l '\

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

2026-02-21T09:02:28+00:00

This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook

eventpoll: convert do_epoll_create() to FD_PREPARE()

2025-11-28T11:42:32+00:00

Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-12-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner

eventpoll: Replace rwlock with spinlock

2025-09-05T13:51:24+00:00

The ready event list of an epoll object is protected by read-write semaphore: - The consumer (waiter) acquires the write lock and takes items. - the producer (waker) takes the read lock and adds items. The point of this design is enabling epoll to scale well with large number of producers, as multiple producers can hold the read lock at the same time. Unfortunately, this implementation may cause scheduling priority inversion problem. Suppose the consumer has higher scheduling priority than the producer. The consumer needs to acquire the write lock, but may be blocked by the producer holding the read lock. Since read-write semaphore does not support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), we have a case of priority inversion: a higher priority consumer is blocked by a lower priority producer. This problem was reported in [1]. Furthermore, this could also cause stall problem, as described in [2]. Fix this problem by replacing rwlock with spinlock. This reduces the event bandwidth, as the producers now have to contend with each other for the spinlock. According to the benchmark from https://github.com/rouming/test-tools/blob/master/stress-epoll.c: On 12 x86 CPUs: Before After Diff threads events/ms events/ms 8 7162 4956 -31% 16 8733 5383 -38% 32 7968 5572 -30% 64 10652 5739 -46% 128 11236 5931 -47% On 4 riscv CPUs: Before After Diff threads events/ms events/ms 8 2958 2833 -4% 16 3323 3097 -7% 32 3451 3240 -6% 64 3554 3178 -11% 128 3601 3235 -10% Although the numbers look bad, it should be noted that this benchmark creates multiple threads who do nothing except constantly generating new epoll events, thus contention on the spinlock is high. For real workload, the event rate is likely much lower, and the performance drop is not as bad. Using another benchmark (perf bench epoll wait) where spinlock contention is lower, improvement is even observed on x86: On 12 x86 CPUs: Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8 After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8 On 4 riscv CPUs: Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8 After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8 In conclusion, no one is likely to be upset over this change. After all, spinlock was used originally for years, and the commit which converted to rwlock didn't mention a real workload, just that the benchmark numbers are nice. This patch is not exactly the revert of commit a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention"), because git revert conflicts in some places which are not obvious on the resolution. This patch is intended to be backported, therefore go with the obvious approach: - Replace rwlock_t with spinlock_t one to one - Delete list_add_tail_lockless() and chain_epi_lockless(). These were introduced to allow producers to concurrently add items to the list. But now that spinlock no longer allows producers to touch the event list concurrently, these two functions are not necessary anymore. Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") Signed-off-by: Nam Cao Link: https://lore.kernel.org/ec92458ea357ec503c737ead0f10b2c6e4c37d47.1752581388.git.namcao@linutronix.de Tested-by: K Prateek Nayak Cc: stable@vger.kernel.org Reported-by: Frederic Weisbecker Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1] Reported-by: Valentin Schneider Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2] Signed-off-by: Christian Brauner

Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2025-07-28T18:22:56+00:00

Pull misc VFS updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add ext4 IOCB_DONTCACHE support This refactors the address_space_operations write_begin() and write_end() callbacks to take const struct kiocb * as their first argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate to the filesystem's buffered I/O path. Ext4 is updated to implement handling of the IOCB_DONTCACHE flag and advertises support via the FOP_DONTCACHE file operation flag. Additionally, the i915 driver's shmem write paths are updated to bypass the legacy write_begin/write_end interface in favor of directly calling write_iter() with a constructed synchronous kiocb. Another i915 change replaces a manual write loop with kernel_write() during GEM shmem object creation. Cleanups: - don't duplicate vfs_open() in kernel_file_open() - proc_fd_getattr(): don't bother with S_ISDIR() check - fs/ecryptfs: replace snprintf with sysfs_emit in show function - vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() - filelock: add new locks_wake_up_waiter() helper - fs: Remove three arguments from block_write_end() - VFS: change old_dir and new_dir in struct renamedata to dentrys - netfs: Remove unused declaration netfs_queue_write_request() Fixes: - eventpoll: Fix semi-unbounded recursion - eventpoll: fix sphinx documentation build warning - fs/read_write: Fix spelling typo - fs: annotate data race between poll_schedule_timeout() and pollwake() - fs/pipe: set FMODE_NOWAIT in create_pipe_files() - docs/vfs: update references to i_mutex to i_rwsem - fs/buffer: remove comment about hard sectorsize - fs/buffer: remove the min and max limit checks in __getblk_slow() - fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable - fs_context: fix parameter name in infofc() macro - fs: Prevent file descriptor table allocations exceeding INT_MAX" * tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) netfs: Remove unused declaration netfs_queue_write_request() eventpoll: fix sphinx documentation build warning ext4: support uncached buffered I/O mm/pagemap: add write_begin_get_folio() helper function fs: change write_begin/write_end interface to take struct kiocb * drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter drm/i915: Use kernel_write() in shmem object create eventpoll: Fix semi-unbounded recursion vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable fs/buffer: remove the min and max limit checks in __getblk_slow() fs: Prevent file descriptor table allocations exceeding INT_MAX fs: Remove three arguments from block_write_end() fs/ecryptfs: replace snprintf with sysfs_emit in show function fs: annotate suspected data race between poll_schedule_timeout() and pollwake() docs/vfs: update references to i_mutex to i_rwsem fs/buffer: remove comment about hard sectorsize fs_context: fix parameter name in infofc() macro VFS: change old_dir and new_dir in struct renamedata to dentrys proc_fd_getattr(): don't bother with S_ISDIR() check ...

eventpoll: fix sphinx documentation build warning

2025-07-23T13:00:13+00:00

Sphinx complains that ep_get_upwards_depth_proc() has a kerneldoc-style comment without documenting its parameters. This is an internal function that was not meant to show up in kernel documentation, so fix the warning by changing the comment to a non-kerneldoc one. Fixes: 22bacca48a17 ("epoll: prevent creating circular epoll structures") Reported-by: Stephen Rothwell Closes: https://lore.kernel.org/r/20250717173655.10ecdce6@canb.auug.org.au Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202507171958.aMcW08Cn-lkp@intel.com/ Signed-off-by: Jann Horn Link: https://lore.kernel.org/20250721-epoll-sphinx-fix-v1-1-b695c92bf009@google.com Tested-by: Randy Dunlap Acked-by: Randy Dunlap Signed-off-by: Christian Brauner

eventpoll: Fix semi-unbounded recursion

2025-07-16T12:48:03+00:00

Ensure that epoll instances can never form a graph deeper than EP_MAX_NESTS+1 links. Currently, ep_loop_check_proc() ensures that the graph is loop-free and does some recursion depth checks, but those recursion depth checks don't limit the depth of the resulting tree for two reasons: - They don't look upwards in the tree. - If there are multiple downwards paths of different lengths, only one of the paths is actually considered for the depth check since commit 28d82dc1c4ed ("epoll: limit paths"). Essentially, the current recursion depth check in ep_loop_check_proc() just serves to prevent it from recursing too deeply while checking for loops. A more thorough check is done in reverse_path_check() after the new graph edge has already been created; this checks, among other things, that no paths going upwards from any non-epoll file with a length of more than 5 edges exist. However, this check does not apply to non-epoll files. As a result, it is possible to recurse to a depth of at least roughly 500, tested on v6.15. (I am unsure if deeper recursion is possible; and this may have changed with commit 8c44dac8add7 ("eventpoll: Fix priority inversion problem").) To fix it: 1. In ep_loop_check_proc(), note the subtree depth of each visited node, and use subtree depths for the total depth calculation even when a subtree has already been visited. 2. Add ep_get_upwards_depth_proc() for similarly determining the maximum depth of an upwards walk. 3. In ep_loop_check(), use these values to limit the total path length between epoll nodes to EP_MAX_NESTS edges. Fixes: 22bacca48a17 ("epoll: prevent creating circular epoll structures") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn Link: https://lore.kernel.org/20250711-epoll-recursion-fix-v1-1-fb2457c33292@google.com Signed-off-by: Christian Brauner

Revert "eventpoll: Fix priority inversion problem"

2025-07-12T00:10:32+00:00

This reverts commit 8c44dac8add7503c345c0f6c7962e4863b88ba42. I haven't figured out what the actual bug in this commit is, but I did spend a lot of time chasing it down and eventually succeeded in bisecting it down to this. For some reason, this eventpoll commit ends up causing delays and stuck user space processes, but it only happens on one of my machines, and only during early boot or during the flurry of initial activity when logging in. I must be triggering some very subtle timing issue, but once I figured out the behavior pattern that made it reasonably reliable to trigger, it did bisect right to this, and reverting the commit fixes the problem. Of course, that was only after I had failed at bisecting it several times, and had flailed around blaming both the drm people and the netlink people for the odd problems. The most obvious of which happened at the time of the first graphical login (the most common symptom being that some gnome app aborted due to a 30s timeout, often leading to the whole session then failing if it was some critical component like gnome-shell or similar). Acked-by: Nam Cao Cc: Frederic Weisbecker Cc: Valentin Schneider Cc: Christian Brauner Signed-off-by: Linus Torvalds