<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/eventpoll.c, branch v6.1.168</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.168</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.168'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-03-25T10:02:57+00:00</updated>
<entry>
<title>eventpoll: Fix integer overflow in ep_loop_check_proc()</title>
<updated>2026-03-25T10:02:57+00:00</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2026-02-23T19:59:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a1ff0c2e17d900264bccca6446e43cd952f437c6'/>
<id>urn:sha1:a1ff0c2e17d900264bccca6446e43cd952f437c6</id>
<content type='text'>
commit fdcfce93073d990ed4b71752e31ad1c1d6e9d58b upstream.

If a recursive call to ep_loop_check_proc() hits the `result = INT_MAX`,
an integer overflow will occur in the calling ep_loop_check_proc() at
`result = max(result, ep_loop_check_proc(ep_tovisit, depth + 1) + 1)`,
breaking the recursion depth check.

Fix it by using a different placeholder value that can't lead to an
overflow.

Reported-by: Guenter Roeck &lt;linux@roeck-us.net&gt;
Fixes: f2e467a48287 ("eventpoll: Fix semi-unbounded recursion")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Link: https://patch.msgid.link/20260223-epoll-int-overflow-v1-1-452f35132224@google.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>eventpoll: Replace rwlock with spinlock</title>
<updated>2025-12-06T21:12:36+00:00</updated>
<author>
<name>Nam Cao</name>
<email>namcao@linutronix.de</email>
</author>
<published>2025-11-21T08:12:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=26494fa2c8b738405d8dedbd86cab4b8341e2451'/>
<id>urn:sha1:26494fa2c8b738405d8dedbd86cab4b8341e2451</id>
<content type='text'>
[ Upstream commit 0c43094f8cc9d3d99d835c0ac9c4fe1ccc62babd ]

The ready event list of an epoll object is protected by read-write
semaphore:

  - The consumer (waiter) acquires the write lock and takes items.
  - the producer (waker) takes the read lock and adds items.

The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.

Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].

Furthermore, this could also cause stall problem, as described in [2].

Fix this problem by replacing rwlock with spinlock.

This reduces the event bandwidth, as the producers now have to contend with
each other for the spinlock. According to the benchmark from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c:

    On 12 x86 CPUs:
                  Before     After        Diff
        threads  events/ms  events/ms
              8       7162       4956     -31%
             16       8733       5383     -38%
             32       7968       5572     -30%
             64      10652       5739     -46%
            128      11236       5931     -47%

    On 4 riscv CPUs:
                  Before     After        Diff
        threads  events/ms  events/ms
              8       2958       2833      -4%
             16       3323       3097      -7%
             32       3451       3240      -6%
             64       3554       3178     -11%
            128       3601       3235     -10%

Although the numbers look bad, it should be noted that this benchmark
creates multiple threads who do nothing except constantly generating new
epoll events, thus contention on the spinlock is high. For real workload,
the event rate is likely much lower, and the performance drop is not as
bad.

Using another benchmark (perf bench epoll wait) where spinlock contention
is lower, improvement is even observed on x86:

    On 12 x86 CPUs:
        Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8
        After:  Averaged 114577 operations/sec (+- 2.25%), total secs = 8

    On 4 riscv CPUs:
        Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8
        After:  Averaged 167396 operations/sec (+- 0.23%), total secs = 8

In conclusion, no one is likely to be upset over this change. After all,
spinlock was used originally for years, and the commit which converted to
rwlock didn't mention a real workload, just that the benchmark numbers are
nice.

This patch is not exactly the revert of commit a218cc491420 ("epoll: use
rwlock in order to reduce ep_poll_callback() contention"), because git
revert conflicts in some places which are not obvious on the resolution.
This patch is intended to be backported, therefore go with the obvious
approach:

  - Replace rwlock_t with spinlock_t one to one

  - Delete list_add_tail_lockless() and chain_epi_lockless(). These were
    introduced to allow producers to concurrently add items to the list.
    But now that spinlock no longer allows producers to touch the event
    list concurrently, these two functions are not necessary anymore.

Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Signed-off-by: Nam Cao &lt;namcao@linutronix.de&gt;
Link: https://lore.kernel.org/ec92458ea357ec503c737ead0f10b2c6e4c37d47.1752581388.git.namcao@linutronix.de
Tested-by: K Prateek Nayak &lt;kprateek.nayak@amd.com&gt;
Cc: stable@vger.kernel.org
Reported-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2]
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Florian Bezdeka &lt;florian.bezdeka@siemens.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>eventpoll: Fix semi-unbounded recursion</title>
<updated>2025-08-28T14:25:49+00:00</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2025-07-11T16:33:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1b13b033062824495554e836a1ff5f85ccf6b039'/>
<id>urn:sha1:1b13b033062824495554e836a1ff5f85ccf6b039</id>
<content type='text'>
commit f2e467a48287c868818085aa35389a224d226732 upstream.

Ensure that epoll instances can never form a graph deeper than
EP_MAX_NESTS+1 links.

Currently, ep_loop_check_proc() ensures that the graph is loop-free and
does some recursion depth checks, but those recursion depth checks don't
limit the depth of the resulting tree for two reasons:

 - They don't look upwards in the tree.
 - If there are multiple downwards paths of different lengths, only one of
   the paths is actually considered for the depth check since commit
   28d82dc1c4ed ("epoll: limit paths").

Essentially, the current recursion depth check in ep_loop_check_proc() just
serves to prevent it from recursing too deeply while checking for loops.

A more thorough check is done in reverse_path_check() after the new graph
edge has already been created; this checks, among other things, that no
paths going upwards from any non-epoll file with a length of more than 5
edges exist. However, this check does not apply to non-epoll files.

As a result, it is possible to recurse to a depth of at least roughly 500,
tested on v6.15. (I am unsure if deeper recursion is possible; and this may
have changed with commit 8c44dac8add7 ("eventpoll: Fix priority inversion
problem").)

To fix it:

1. In ep_loop_check_proc(), note the subtree depth of each visited node,
and use subtree depths for the total depth calculation even when a subtree
has already been visited.
2. Add ep_get_upwards_depth_proc() for similarly determining the maximum
depth of an upwards walk.
3. In ep_loop_check(), use these values to limit the total path length
between epoll nodes to EP_MAX_NESTS edges.

Fixes: 22bacca48a17 ("epoll: prevent creating circular epoll structures")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Link: https://lore.kernel.org/20250711-epoll-recursion-fix-v1-1-fb2457c33292@google.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>epoll: Add synchronous wakeup support for ep_poll_callback</title>
<updated>2024-12-27T12:53:01+00:00</updated>
<author>
<name>Xuewen Yan</name>
<email>xuewen.yan@unisoc.com</email>
</author>
<published>2024-04-26T08:05:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=400a2ae671fa4a364bb40a2ff3ca468e2277b757'/>
<id>urn:sha1:400a2ae671fa4a364bb40a2ff3ca468e2277b757</id>
<content type='text'>
commit 900bbaae67e980945dec74d36f8afe0de7556d5a upstream.

Now, the epoll only use wake_up() interface to wake up task.
However, sometimes, there are epoll users which want to use
the synchronous wakeup flag to hint the scheduler, such as
Android binder driver.
So add a wake_up_sync() define, and use the wake_up_sync()
when the sync is true in ep_poll_callback().

Co-developed-by: Jing Xia &lt;jing.xia@unisoc.com&gt;
Signed-off-by: Jing Xia &lt;jing.xia@unisoc.com&gt;
Signed-off-by: Xuewen Yan &lt;xuewen.yan@unisoc.com&gt;
Link: https://lore.kernel.org/r/20240426080548.8203-1-xuewen.yan@unisoc.com
Tested-by: Brian Geffon &lt;bgeffon@google.com&gt;
Reviewed-by: Brian Geffon &lt;bgeffon@google.com&gt;
Reported-by: Benoit Lize &lt;lizeb@google.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>epoll: annotate racy check</title>
<updated>2024-12-14T18:54:37+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-09-25T09:05:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=670fc6d14f1076152553c2805437fd11860562df'/>
<id>urn:sha1:670fc6d14f1076152553c2805437fd11860562df</id>
<content type='text'>
[ Upstream commit 6474353a5e3d0b2cf610153cea0c61f576a36d0a ]

Epoll relies on a racy fastpath check during __fput() in
eventpoll_release() to avoid the hit of pointlessly acquiring a
semaphore. Annotate that race by using WRITE_ONCE() and READ_ONCE().

Link: https://lore.kernel.org/r/66edfb3c.050a0220.3195df.001a.GAE@google.com
Link: https://lore.kernel.org/r/20240925-fungieren-anbauen-79b334b00542@brauner
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Reported-by: syzbot+3b6b32dc50537a49bb4a@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>epoll: be better about file lifetimes</title>
<updated>2024-06-12T09:03:03+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-05-03T20:36:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=559214eb4e5c3d05e69428af2fae2691ba1eb784'/>
<id>urn:sha1:559214eb4e5c3d05e69428af2fae2691ba1eb784</id>
<content type='text'>
[ Upstream commit 4efaa5acf0a1d2b5947f98abb3acf8bfd966422b ]

epoll can call out to vfs_poll() with a file pointer that may race with
the last 'fput()'. That would make f_count go down to zero, and while
the ep-&gt;mtx locking means that the resulting file pointer tear-down will
be blocked until the poll returns, it means that f_count is already
dead, and any use of it won't actually get a reference to the file any
more: it's dead regardless.

Make sure we have a valid ref on the file pointer before we call down to
vfs_poll() from the epoll routines.

Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/
Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com
Reviewed-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>epoll: ep_autoremove_wake_function should use list_del_init_careful</title>
<updated>2023-06-21T14:00:54+00:00</updated>
<author>
<name>Benjamin Segall</name>
<email>bsegall@google.com</email>
</author>
<published>2023-05-30T18:32:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3a340c63c0c34d867ced7045d473535687a17491'/>
<id>urn:sha1:3a340c63c0c34d867ced7045d473535687a17491</id>
<content type='text'>
commit 2192bba03d80f829233bfa34506b428f71e531e7 upstream.

autoremove_wake_function uses list_del_init_careful, so should epoll's
more aggressive variant.  It only doesn't because it was copied from an
older wait.c rather than the most recent.

[bsegall@google.com: add comment]
  Link: https://lkml.kernel.org/r/xm26bki0ulsr.fsf_-_@google.com
Link: https://lkml.kernel.org/r/xm26pm6hvfer.fsf@google.com
Fixes: a16ceb139610 ("epoll: autoremove wakers even more aggressively")
Signed-off-by: Ben Segall &lt;bsegall@google.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>eventpoll: add EPOLL_URING_WAKE poll wakeup flag</title>
<updated>2023-01-04T10:28:47+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2022-11-20T17:10:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=717ed90a661494b895609fc536382dec40f71de6'/>
<id>urn:sha1:717ed90a661494b895609fc536382dec40f71de6</id>
<content type='text'>
[ Upstream commit caf1aeaffc3b09649a56769e559333ae2c4f1802 ]

We can have dependencies between epoll and io_uring. Consider an epoll
context, identified by the epfd file descriptor, and an io_uring file
descriptor identified by iofd. If we add iofd to the epfd context, and
arm a multishot poll request for epfd with iofd, then the multishot
poll request will repeatedly trigger and generate events until terminated
by CQ ring overflow. This isn't a desired behavior.

Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup
key, and io_uring can check for that to detect a potential recursive
invocation.

Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Stable-dep-of: 4464853277d0 ("io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>epoll: use try_cmpxchg in list_add_tail_lockless</title>
<updated>2022-09-12T04:55:07+00:00</updated>
<author>
<name>Uros Bizjak</name>
<email>ubizjak@gmail.com</email>
</author>
<published>2022-07-14T17:32:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=693fc06e98514c2d5951ead4aca40cf8b21100b1'/>
<id>urn:sha1:693fc06e98514c2d5951ead4aca40cf8b21100b1</id>
<content type='text'>
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
list_add_tail_lockless.  x86 CMPXCHG instruction returns success in ZF
flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

No functional change intended.

Link: https://lkml.kernel.org/r/20220714173255.12987-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>epoll: autoremove wakers even more aggressively</title>
<updated>2022-07-18T00:31:40+00:00</updated>
<author>
<name>Benjamin Segall</name>
<email>bsegall@google.com</email>
</author>
<published>2022-06-15T21:24:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a16ceb13961068f7209e34d7984f8e42d2c06159'/>
<id>urn:sha1:a16ceb13961068f7209e34d7984f8e42d2c06159</id>
<content type='text'>
If a process is killed or otherwise exits while having active network
connections and many threads waiting on epoll_wait, the threads will all
be woken immediately, but not removed from ep-&gt;wq.  Then when network
traffic scans ep-&gt;wq in wake_up, every wakeup attempt will fail, and will
not remove the entries from the list.

This means that the cost of the wakeup attempt is far higher than usual,
does not decrease, and this also competes with the dying threads trying to
actually make progress and remove themselves from the wq.

Handle this by removing visited epoll wq entries unconditionally, rather
than only when the wakeup succeeds - the structure of ep_poll means that
the only potential loss is the timed_out-&gt;eavail heuristic, which now can
race and result in a redundant ep_send_events attempt.  (But only when
incoming data and a timeout actually race, not on every timeout)

Shakeel added:

: We are seeing this issue in production with real workloads and it has
: caused hard lockups.  Particularly network heavy workloads with a lot
: of threads in epoll_wait() can easily trigger this issue if they get
: killed (oom-killed in our case).

Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com
Signed-off-by: Ben Segall &lt;bsegall@google.com&gt;
Tested-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Roman Penyaev &lt;rpenyaev@suse.de&gt;
Cc: Jason Baron &lt;jbaron@akamai.com&gt;
Cc: Khazhismel Kumykov &lt;khazhy@google.com&gt;
Cc: Heiher &lt;r@hev.cc&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
