<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/eventpoll.c, branch v6.12.80</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-03-13T16:20:31+00:00</updated>
<entry>
<title>eventpoll: Fix integer overflow in ep_loop_check_proc()</title>
<updated>2026-03-13T16:20:31+00:00</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2026-02-23T19:59:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=029be23f553bdc544680b1fe7d16c4d2a75ee7a0'/>
<id>urn:sha1:029be23f553bdc544680b1fe7d16c4d2a75ee7a0</id>
<content type='text'>
commit fdcfce93073d990ed4b71752e31ad1c1d6e9d58b upstream.

If a recursive call to ep_loop_check_proc() hits the `result = INT_MAX`,
an integer overflow will occur in the calling ep_loop_check_proc() at
`result = max(result, ep_loop_check_proc(ep_tovisit, depth + 1) + 1)`,
breaking the recursion depth check.

Fix it by using a different placeholder value that can't lead to an
overflow.

Reported-by: Guenter Roeck &lt;linux@roeck-us.net&gt;
Fixes: f2e467a48287 ("eventpoll: Fix semi-unbounded recursion")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Link: https://patch.msgid.link/20260223-epoll-int-overflow-v1-1-452f35132224@google.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>eventpoll: Replace rwlock with spinlock</title>
<updated>2025-10-19T14:33:49+00:00</updated>
<author>
<name>Nam Cao</name>
<email>namcao@linutronix.de</email>
</author>
<published>2025-07-15T12:46:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b99bc5a48ddcc6148957fc0662ffac86740ae9f8'/>
<id>urn:sha1:b99bc5a48ddcc6148957fc0662ffac86740ae9f8</id>
<content type='text'>
commit 0c43094f8cc9d3d99d835c0ac9c4fe1ccc62babd upstream.

The ready event list of an epoll object is protected by read-write
semaphore:

  - The consumer (waiter) acquires the write lock and takes items.
  - the producer (waker) takes the read lock and adds items.

The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.

Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].

Furthermore, this could also cause stall problem, as described in [2].

Fix this problem by replacing rwlock with spinlock.

This reduces the event bandwidth, as the producers now have to contend with
each other for the spinlock. According to the benchmark from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c:

    On 12 x86 CPUs:
                  Before     After        Diff
        threads  events/ms  events/ms
              8       7162       4956     -31%
             16       8733       5383     -38%
             32       7968       5572     -30%
             64      10652       5739     -46%
            128      11236       5931     -47%

    On 4 riscv CPUs:
                  Before     After        Diff
        threads  events/ms  events/ms
              8       2958       2833      -4%
             16       3323       3097      -7%
             32       3451       3240      -6%
             64       3554       3178     -11%
            128       3601       3235     -10%

Although the numbers look bad, it should be noted that this benchmark
creates multiple threads who do nothing except constantly generating new
epoll events, thus contention on the spinlock is high. For real workload,
the event rate is likely much lower, and the performance drop is not as
bad.

Using another benchmark (perf bench epoll wait) where spinlock contention
is lower, improvement is even observed on x86:

    On 12 x86 CPUs:
        Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8
        After:  Averaged 114577 operations/sec (+- 2.25%), total secs = 8

    On 4 riscv CPUs:
        Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8
        After:  Averaged 167396 operations/sec (+- 0.23%), total secs = 8

In conclusion, no one is likely to be upset over this change. After all,
spinlock was used originally for years, and the commit which converted to
rwlock didn't mention a real workload, just that the benchmark numbers are
nice.

This patch is not exactly the revert of commit a218cc491420 ("epoll: use
rwlock in order to reduce ep_poll_callback() contention"), because git
revert conflicts in some places which are not obvious on the resolution.
This patch is intended to be backported, therefore go with the obvious
approach:

  - Replace rwlock_t with spinlock_t one to one

  - Delete list_add_tail_lockless() and chain_epi_lockless(). These were
    introduced to allow producers to concurrently add items to the list.
    But now that spinlock no longer allows producers to touch the event
    list concurrently, these two functions are not necessary anymore.

Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Signed-off-by: Nam Cao &lt;namcao@linutronix.de&gt;
Link: https://lore.kernel.org/ec92458ea357ec503c737ead0f10b2c6e4c37d47.1752581388.git.namcao@linutronix.de
Tested-by: K Prateek Nayak &lt;kprateek.nayak@amd.com&gt;
Cc: stable@vger.kernel.org
Reported-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider &lt;vschneid@redhat.com&gt;
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2]
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>eventpoll: Fix semi-unbounded recursion</title>
<updated>2025-08-20T16:30:15+00:00</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2025-07-11T16:33:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7a2125962c42d5336ca0495a9ce4cb38a63e9161'/>
<id>urn:sha1:7a2125962c42d5336ca0495a9ce4cb38a63e9161</id>
<content type='text'>
commit f2e467a48287c868818085aa35389a224d226732 upstream.

Ensure that epoll instances can never form a graph deeper than
EP_MAX_NESTS+1 links.

Currently, ep_loop_check_proc() ensures that the graph is loop-free and
does some recursion depth checks, but those recursion depth checks don't
limit the depth of the resulting tree for two reasons:

 - They don't look upwards in the tree.
 - If there are multiple downwards paths of different lengths, only one of
   the paths is actually considered for the depth check since commit
   28d82dc1c4ed ("epoll: limit paths").

Essentially, the current recursion depth check in ep_loop_check_proc() just
serves to prevent it from recursing too deeply while checking for loops.

A more thorough check is done in reverse_path_check() after the new graph
edge has already been created; this checks, among other things, that no
paths going upwards from any non-epoll file with a length of more than 5
edges exist. However, this check does not apply to non-epoll files.

As a result, it is possible to recurse to a depth of at least roughly 500,
tested on v6.15. (I am unsure if deeper recursion is possible; and this may
have changed with commit 8c44dac8add7 ("eventpoll: Fix priority inversion
problem").)

To fix it:

1. In ep_loop_check_proc(), note the subtree depth of each visited node,
and use subtree depths for the total depth calculation even when a subtree
has already been visited.
2. Add ep_get_upwards_depth_proc() for similarly determining the maximum
depth of an upwards walk.
3. In ep_loop_check(), use these values to limit the total path length
between epoll nodes to EP_MAX_NESTS edges.

Fixes: 22bacca48a17 ("epoll: prevent creating circular epoll structures")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Link: https://lore.kernel.org/20250711-epoll-recursion-fix-v1-1-fb2457c33292@google.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>eventpoll: don't decrement ep refcount while still holding the ep mutex</title>
<updated>2025-07-17T16:37:01+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-07-09T17:38:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6dee745bd0aec9d399df674256e7b1ecdb615444'/>
<id>urn:sha1:6dee745bd0aec9d399df674256e7b1ecdb615444</id>
<content type='text'>
commit 8c2e52ebbe885c7eeaabd3b7ddcdc1246fc400d2 upstream.

Jann Horn points out that epoll is decrementing the ep refcount and then
doing a

    mutex_unlock(&amp;ep-&gt;mtx);

afterwards. That's very wrong, because it can lead to a use-after-free.

That pattern is actually fine for the very last reference, because the
code in question will delay the actual call to "ep_free(ep)" until after
it has unlocked the mutex.

But it's wrong for the much subtler "next to last" case when somebody
*else* may also be dropping their reference and free the ep while we're
still using the mutex.

Note that this is true even if that other user is also using the same ep
mutex: mutexes, unlike spinlocks, can not be used for object ownership,
even if they guarantee mutual exclusion.

A mutex "unlock" operation is not atomic, and as one user is still
accessing the mutex as part of unlocking it, another user can come in
and get the now released mutex and free the data structure while the
first user is still cleaning up.

See our mutex documentation in Documentation/locking/mutex-design.rst,
in particular the section [1] about semantics:

	"mutex_unlock() may access the mutex structure even after it has
	 internally released the lock already - so it's not safe for
	 another context to acquire the mutex and assume that the
	 mutex_unlock() context is not using the structure anymore"

So if we drop our ep ref before the mutex unlock, but we weren't the
last one, we may then unlock the mutex, another user comes in, drops
_their_ reference and releases the 'ep' as it now has no users - all
while the mutex_unlock() is still accessing it.

Fix this by simply moving the ep refcount dropping to outside the mutex:
the refcount itself is atomic, and doesn't need mutex protection (that's
the whole _point_ of refcounts: unlike mutexes, they are inherently
about object lifetimes).

Reported-by: Jann Horn &lt;jannh@google.com&gt;
Link: https://docs.kernel.org/locking/mutex-design.html#semantics [1]
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>epoll: Add synchronous wakeup support for ep_poll_callback</title>
<updated>2024-12-27T13:02:20+00:00</updated>
<author>
<name>Xuewen Yan</name>
<email>xuewen.yan@unisoc.com</email>
</author>
<published>2024-04-26T08:05:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2d129beb8dd49f12d23f27ecc92cdf544d1e9472'/>
<id>urn:sha1:2d129beb8dd49f12d23f27ecc92cdf544d1e9472</id>
<content type='text'>
commit 900bbaae67e980945dec74d36f8afe0de7556d5a upstream.

Now, the epoll only use wake_up() interface to wake up task.
However, sometimes, there are epoll users which want to use
the synchronous wakeup flag to hint the scheduler, such as
Android binder driver.
So add a wake_up_sync() define, and use the wake_up_sync()
when the sync is true in ep_poll_callback().

Co-developed-by: Jing Xia &lt;jing.xia@unisoc.com&gt;
Signed-off-by: Jing Xia &lt;jing.xia@unisoc.com&gt;
Signed-off-by: Xuewen Yan &lt;xuewen.yan@unisoc.com&gt;
Link: https://lore.kernel.org/r/20240426080548.8203-1-xuewen.yan@unisoc.com
Tested-by: Brian Geffon &lt;bgeffon@google.com&gt;
Reviewed-by: Brian Geffon &lt;bgeffon@google.com&gt;
Reported-by: Benoit Lize &lt;lizeb@google.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>epoll: annotate racy check</title>
<updated>2024-12-14T19:03:35+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-09-25T09:05:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d222934627b9a9fbd943386988c5f923c95f0773'/>
<id>urn:sha1:d222934627b9a9fbd943386988c5f923c95f0773</id>
<content type='text'>
[ Upstream commit 6474353a5e3d0b2cf610153cea0c61f576a36d0a ]

Epoll relies on a racy fastpath check during __fput() in
eventpoll_release() to avoid the hit of pointlessly acquiring a
semaphore. Annotate that race by using WRITE_ONCE() and READ_ONCE().

Link: https://lore.kernel.org/r/66edfb3c.050a0220.3195df.001a.GAE@google.com
Link: https://lore.kernel.org/r/20240925-fungieren-anbauen-79b334b00542@brauner
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Reported-by: syzbot+3b6b32dc50537a49bb4a@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2024-09-23T16:35:36+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-09-23T16:35:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f8ffbc365f703d74ecca8ca787318d05bbee2bf7'/>
<id>urn:sha1:f8ffbc365f703d74ecca8ca787318d05bbee2bf7</id>
<content type='text'>
Pull 'struct fd' updates from Al Viro:
 "Just the 'struct fd' layout change, with conversion to accessor
  helpers"

* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add struct fd constructors, get rid of __to_fd()
  struct fd: representation change
  introduce fd_file(), convert all accessors to it.
</content>
</entry>
<entry>
<title>eventpoll: Annotate data-race of busy_poll_usecs</title>
<updated>2024-08-30T06:22:34+00:00</updated>
<author>
<name>Martin Karsten</name>
<email>mkarsten@uwaterloo.ca</email>
</author>
<published>2024-08-06T12:33:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b9ca079dd6b09e08863aa998edf5c47597806c05'/>
<id>urn:sha1:b9ca079dd6b09e08863aa998edf5c47597806c05</id>
<content type='text'>
A struct eventpoll's busy_poll_usecs field can be modified via a user
ioctl at any time. All reads of this field should be annotated with
READ_ONCE.

Fixes: 85455c795c07 ("eventpoll: support busy poll per epoll instance")
Cc: stable@vger.kernel.org
Signed-off-by: Martin Karsten &lt;mkarsten@uwaterloo.ca&gt;
Link: https://lore.kernel.org/r/20240806123301.167557-1-jdamato@fastly.com
Reviewed-by: Joe Damato &lt;jdamato@fastly.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>eventpoll: Don't re-zero eventpoll fields</title>
<updated>2024-08-30T06:22:34+00:00</updated>
<author>
<name>Joe Damato</name>
<email>jdamato@fastly.com</email>
</author>
<published>2024-08-07T10:52:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4f98f380f4662829e394c840062ea677956b57ec'/>
<id>urn:sha1:4f98f380f4662829e394c840062ea677956b57ec</id>
<content type='text'>
Remove redundant and unnecessary code.

ep_alloc uses kzalloc to create struct eventpoll, so there is no need to
set fields to defaults of 0. This was accidentally introduced in commit
85455c795c07 ("eventpoll: support busy poll per epoll instance") and
expanded on in follow-up commits.

Signed-off-by: Joe Damato &lt;jdamato@fastly.com&gt;
Link: https://lore.kernel.org/r/20240807105231.179158-1-jdamato@fastly.com
Reviewed-by: Martin Karsten &lt;mkarsten@uwaterloo.ca&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>introduce fd_file(), convert all accessors to it.</title>
<updated>2024-08-13T02:00:43+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2024-05-31T18:12:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1da91ea87aefe2c25b68c9f96947a9271ba6325d'/>
<id>urn:sha1:1da91ea87aefe2c25b68c9f96947a9271ba6325d</id>
<content type='text'>
	For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
	Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
	This commit converts (almost) all of f.file to
fd_file(f).  It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).

	NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).

[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]

Reviewed-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
</feed>
