kernel/linux.git/fs/eventpoll.c, branch v6.6.142

eventpoll: defer struct eventpoll free to RCU grace period

2026-04-27T13:23:27+00:00

[ Upstream commit 07712db80857d5d09ae08f3df85a708ecfc3b61f ] In certain situations, ep_free() in eventpoll.c will kfree the epi->ep eventpoll struct while it still being used by another concurrent thread. Defer the kfree() to an RCU callback to prevent UAF. Fixes: f2e467a48287 ("eventpoll: Fix semi-unbounded recursion") Signed-off-by: Nicholas Carlini Signed-off-by: Christian Brauner Signed-off-by: Sasha Levin

eventpoll: Fix integer overflow in ep_loop_check_proc()

2026-03-25T10:05:35+00:00

commit fdcfce93073d990ed4b71752e31ad1c1d6e9d58b upstream. If a recursive call to ep_loop_check_proc() hits the `result = INT_MAX`, an integer overflow will occur in the calling ep_loop_check_proc() at `result = max(result, ep_loop_check_proc(ep_tovisit, depth + 1) + 1)`, breaking the recursion depth check. Fix it by using a different placeholder value that can't lead to an overflow. Reported-by: Guenter Roeck Fixes: f2e467a48287 ("eventpoll: Fix semi-unbounded recursion") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn Link: https://patch.msgid.link/20260223-epoll-int-overflow-v1-1-452f35132224@google.com Signed-off-by: Christian Brauner Signed-off-by: Greg Kroah-Hartman

eventpoll: Replace rwlock with spinlock

2025-10-23T14:16:27+00:00

[ Upstream commit 0c43094f8cc9d3d99d835c0ac9c4fe1ccc62babd ] The ready event list of an epoll object is protected by read-write semaphore: - The consumer (waiter) acquires the write lock and takes items. - the producer (waker) takes the read lock and adds items. The point of this design is enabling epoll to scale well with large number of producers, as multiple producers can hold the read lock at the same time. Unfortunately, this implementation may cause scheduling priority inversion problem. Suppose the consumer has higher scheduling priority than the producer. The consumer needs to acquire the write lock, but may be blocked by the producer holding the read lock. Since read-write semaphore does not support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), we have a case of priority inversion: a higher priority consumer is blocked by a lower priority producer. This problem was reported in [1]. Furthermore, this could also cause stall problem, as described in [2]. Fix this problem by replacing rwlock with spinlock. This reduces the event bandwidth, as the producers now have to contend with each other for the spinlock. According to the benchmark from https://github.com/rouming/test-tools/blob/master/stress-epoll.c: On 12 x86 CPUs: Before After Diff threads events/ms events/ms 8 7162 4956 -31% 16 8733 5383 -38% 32 7968 5572 -30% 64 10652 5739 -46% 128 11236 5931 -47% On 4 riscv CPUs: Before After Diff threads events/ms events/ms 8 2958 2833 -4% 16 3323 3097 -7% 32 3451 3240 -6% 64 3554 3178 -11% 128 3601 3235 -10% Although the numbers look bad, it should be noted that this benchmark creates multiple threads who do nothing except constantly generating new epoll events, thus contention on the spinlock is high. For real workload, the event rate is likely much lower, and the performance drop is not as bad. Using another benchmark (perf bench epoll wait) where spinlock contention is lower, improvement is even observed on x86: On 12 x86 CPUs: Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8 After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8 On 4 riscv CPUs: Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8 After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8 In conclusion, no one is likely to be upset over this change. After all, spinlock was used originally for years, and the commit which converted to rwlock didn't mention a real workload, just that the benchmark numbers are nice. This patch is not exactly the revert of commit a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention"), because git revert conflicts in some places which are not obvious on the resolution. This patch is intended to be backported, therefore go with the obvious approach: - Replace rwlock_t with spinlock_t one to one - Delete list_add_tail_lockless() and chain_epi_lockless(). These were introduced to allow producers to concurrently add items to the list. But now that spinlock no longer allows producers to touch the event list concurrently, these two functions are not necessary anymore. Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") Signed-off-by: Nam Cao Link: https://lore.kernel.org/ec92458ea357ec503c737ead0f10b2c6e4c37d47.1752581388.git.namcao@linutronix.de Tested-by: K Prateek Nayak Cc: stable@vger.kernel.org Reported-by: Frederic Weisbecker Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1] Reported-by: Valentin Schneider Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2] Signed-off-by: Christian Brauner Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

epoll: Remove ep_scan_ready_list() in comments

2025-10-23T14:16:27+00:00

[ Upstream commit e6f7958042a7b1dc9a4dfc19fca74217bc0c4865 ] Since commit 443f1a042233 ("lift the calls of ep_send_events_proc() into the callers"), ep_scan_ready_list() has been removed. But there are still several in comments. All of them should be replaced with other caller functions. Signed-off-by: Huang Xiaojia Link: https://lore.kernel.org/r/20240206014353.4191262-1-huangxiaojia2@huawei.com Reviewed-by: Jan Kara Signed-off-by: Christian Brauner Stable-dep-of: 0c43094f8cc9 ("eventpoll: Replace rwlock with spinlock") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

eventpoll: Fix semi-unbounded recursion

2025-08-28T14:28:12+00:00

commit f2e467a48287c868818085aa35389a224d226732 upstream. Ensure that epoll instances can never form a graph deeper than EP_MAX_NESTS+1 links. Currently, ep_loop_check_proc() ensures that the graph is loop-free and does some recursion depth checks, but those recursion depth checks don't limit the depth of the resulting tree for two reasons: - They don't look upwards in the tree. - If there are multiple downwards paths of different lengths, only one of the paths is actually considered for the depth check since commit 28d82dc1c4ed ("epoll: limit paths"). Essentially, the current recursion depth check in ep_loop_check_proc() just serves to prevent it from recursing too deeply while checking for loops. A more thorough check is done in reverse_path_check() after the new graph edge has already been created; this checks, among other things, that no paths going upwards from any non-epoll file with a length of more than 5 edges exist. However, this check does not apply to non-epoll files. As a result, it is possible to recurse to a depth of at least roughly 500, tested on v6.15. (I am unsure if deeper recursion is possible; and this may have changed with commit 8c44dac8add7 ("eventpoll: Fix priority inversion problem").) To fix it: 1. In ep_loop_check_proc(), note the subtree depth of each visited node, and use subtree depths for the total depth calculation even when a subtree has already been visited. 2. Add ep_get_upwards_depth_proc() for similarly determining the maximum depth of an upwards walk. 3. In ep_loop_check(), use these values to limit the total path length between epoll nodes to EP_MAX_NESTS edges. Fixes: 22bacca48a17 ("epoll: prevent creating circular epoll structures") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn Link: https://lore.kernel.org/20250711-epoll-recursion-fix-v1-1-fb2457c33292@google.com Signed-off-by: Christian Brauner Signed-off-by: Greg Kroah-Hartman

eventpoll: don't decrement ep refcount while still holding the ep mutex

2025-07-17T16:35:07+00:00

commit 8c2e52ebbe885c7eeaabd3b7ddcdc1246fc400d2 upstream. Jann Horn points out that epoll is decrementing the ep refcount and then doing a mutex_unlock(&ep->mtx); afterwards. That's very wrong, because it can lead to a use-after-free. That pattern is actually fine for the very last reference, because the code in question will delay the actual call to "ep_free(ep)" until after it has unlocked the mutex. But it's wrong for the much subtler "next to last" case when somebody *else* may also be dropping their reference and free the ep while we're still using the mutex. Note that this is true even if that other user is also using the same ep mutex: mutexes, unlike spinlocks, can not be used for object ownership, even if they guarantee mutual exclusion. A mutex "unlock" operation is not atomic, and as one user is still accessing the mutex as part of unlocking it, another user can come in and get the now released mutex and free the data structure while the first user is still cleaning up. See our mutex documentation in Documentation/locking/mutex-design.rst, in particular the section [1] about semantics: "mutex_unlock() may access the mutex structure even after it has internally released the lock already - so it's not safe for another context to acquire the mutex and assume that the mutex_unlock() context is not using the structure anymore" So if we drop our ep ref before the mutex unlock, but we weren't the last one, we may then unlock the mutex, another user comes in, drops _their_ reference and releases the 'ep' as it now has no users - all while the mutex_unlock() is still accessing it. Fix this by simply moving the ep refcount dropping to outside the mutex: the refcount itself is atomic, and doesn't need mutex protection (that's the whole _point_ of refcounts: unlike mutexes, they are inherently about object lifetimes). Reported-by: Jann Horn Link: https://docs.kernel.org/locking/mutex-design.html#semantics [1] Cc: Alexander Viro Cc: Christian Brauner Cc: Jan Kara Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

epoll: Add synchronous wakeup support for ep_poll_callback

2024-12-27T12:58:57+00:00

commit 900bbaae67e980945dec74d36f8afe0de7556d5a upstream. Now, the epoll only use wake_up() interface to wake up task. However, sometimes, there are epoll users which want to use the synchronous wakeup flag to hint the scheduler, such as Android binder driver. So add a wake_up_sync() define, and use the wake_up_sync() when the sync is true in ep_poll_callback(). Co-developed-by: Jing Xia Signed-off-by: Jing Xia Signed-off-by: Xuewen Yan Link: https://lore.kernel.org/r/20240426080548.8203-1-xuewen.yan@unisoc.com Tested-by: Brian Geffon Reviewed-by: Brian Geffon Reported-by: Benoit Lize Signed-off-by: Christian Brauner Cc: Brian Geffon Signed-off-by: Greg Kroah-Hartman

epoll: annotate racy check

2024-12-14T18:59:58+00:00

[ Upstream commit 6474353a5e3d0b2cf610153cea0c61f576a36d0a ] Epoll relies on a racy fastpath check during __fput() in eventpoll_release() to avoid the hit of pointlessly acquiring a semaphore. Annotate that race by using WRITE_ONCE() and READ_ONCE(). Link: https://lore.kernel.org/r/66edfb3c.050a0220.3195df.001a.GAE@google.com Link: https://lore.kernel.org/r/20240925-fungieren-anbauen-79b334b00542@brauner Reviewed-by: Jan Kara Reported-by: syzbot+3b6b32dc50537a49bb4a@syzkaller.appspotmail.com Signed-off-by: Christian Brauner Signed-off-by: Sasha Levin

epoll: be better about file lifetimes

2024-06-12T09:11:30+00:00

[ Upstream commit 4efaa5acf0a1d2b5947f98abb3acf8bfd966422b ] epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless. Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines. Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com Reviewed-by: Jens Axboe Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin

epoll: simplify ep_alloc()

2023-07-26T12:56:07+00:00

The get_current_user() does not fail, and moving it after kzalloc() can simplify the code a bit. Signed-off-by: Zhen Lei Message-Id: <20230726032135.933-1-thunder.leizhen@huaweicloud.com> Signed-off-by: Christian Brauner