<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/pipe.c, branch v5.15.209</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v5.15.209</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v5.15.209'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2024-04-10T14:19:42+00:00</updated>
<entry>
<title>fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()</title>
<updated>2024-04-10T14:19:42+00:00</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2023-11-24T15:08:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=de48795233cc07eb0be885688ff469fcb18de330'/>
<id>urn:sha1:de48795233cc07eb0be885688ff469fcb18de330</id>
<content type='text'>
[ Upstream commit 055ca83559912f2cfd91c9441427bac4caf3c74e ]

When you try to splice between a normal pipe and a notification pipe,
get_pipe_info(..., true) fails, so splice() falls back to treating the
notification pipe like a normal pipe - so we end up in
iter_file_splice_write(), which first locks the input pipe, then calls
vfs_iter_write(), which locks the output pipe.

Lockdep complains about that, because we're taking a pipe lock while
already holding another pipe lock.

I think this probably (?) can't actually lead to deadlocks, since you'd
need another way to nest locking a normal pipe into locking a
watch_queue pipe, but the lockdep annotations don't make that clear.

Bail out earlier in pipe_write() for notification pipes, before taking
the pipe lock.

Reported-and-tested-by: &lt;syzbot+011e4ea1da6692cf881c@syzkaller.appspotmail.com&gt;
Closes: https://syzkaller.appspot.com/bug?extid=011e4ea1da6692cf881c
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Link: https://lore.kernel.org/r/20231124150822.2121798-1-jannh@google.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>pipe: wakeup wr_wait after setting max_usage</title>
<updated>2024-02-23T07:54:33+00:00</updated>
<author>
<name>Lukas Schauer</name>
<email>lukas@schauer.dev</email>
</author>
<published>2023-12-01T10:11:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3efbd114b91525bb095b8ae046382197d92126b9'/>
<id>urn:sha1:3efbd114b91525bb095b8ae046382197d92126b9</id>
<content type='text'>
[ Upstream commit e95aada4cb93d42e25c30a0ef9eb2923d9711d4a ]

Commit c73be61cede5 ("pipe: Add general notification queue support") a
regression was introduced that would lock up resized pipes under certain
conditions. See the reproducer in [1].

The commit resizing the pipe ring size was moved to a different
function, doing that moved the wakeup for pipe-&gt;wr_wait before actually
raising pipe-&gt;max_usage. If a pipe was full before the resize occured it
would result in the wakeup never actually triggering pipe_write.

Set @max_usage and @nr_accounted before waking writers if this isn't a
watch queue.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=212295 [1]
Link: https://lore.kernel.org/r/20231201-orchideen-modewelt-e009de4562c6@brauner
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reviewed-by: David Howells &lt;dhowells@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Lukas Schauer &lt;lukas@schauer.dev&gt;
[Christian Brauner &lt;brauner@kernel.org&gt;: rewrite to account for watch queues]
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs/pipe: move check to pipe_has_watch_queue()</title>
<updated>2024-02-23T07:54:33+00:00</updated>
<author>
<name>Max Kellermann</name>
<email>max.kellermann@ionos.com</email>
</author>
<published>2023-09-21T07:57:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=26bfccac21fcf1cf16c6be6941e0a50cb4f1152d'/>
<id>urn:sha1:26bfccac21fcf1cf16c6be6941e0a50cb4f1152d</id>
<content type='text'>
[ Upstream commit b4bd6b4bac8edd61eb8f7b836969d12c0c6af165 ]

This declutters the code by reducing the number of #ifdefs and makes
the watch_queue checks simpler.  This has no runtime effect; the
machine code is identical.

Signed-off-by: Max Kellermann &lt;max.kellermann@ionos.com&gt;
Message-Id: &lt;20230921075755.1378787-2-max.kellermann@ionos.com&gt;
Reviewed-by: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Stable-dep-of: e95aada4cb93 ("pipe: wakeup wr_wait after setting max_usage")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>pipe: Fix missing lock in pipe_resize_ring()</title>
<updated>2022-06-06T06:43:37+00:00</updated>
<author>
<name>David Howells</name>
<email>dhowells@redhat.com</email>
</author>
<published>2022-05-26T06:34:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cf2fbc56c478a34a68ff1fa6ad08460054dfd499'/>
<id>urn:sha1:cf2fbc56c478a34a68ff1fa6ad08460054dfd499</id>
<content type='text'>
commit 189b0ddc245139af81198d1a3637cac74f96e13a upstream.

pipe_resize_ring() needs to take the pipe-&gt;rd_wait.lock spinlock to
prevent post_one_notification() from trying to insert into the ring
whilst the ring is being replaced.

The occupancy check must be done after the lock is taken, and the lock
must be taken after the new ring is allocated.

The bug can lead to an oops looking something like:

 BUG: KASAN: use-after-free in post_one_notification.isra.0+0x62e/0x840
 Read of size 4 at addr ffff88801cc72a70 by task poc/27196
 ...
 Call Trace:
  post_one_notification.isra.0+0x62e/0x840
  __post_watch_notification+0x3b7/0x650
  key_create_or_update+0xb8b/0xd20
  __do_sys_add_key+0x175/0x340
  __x64_sys_add_key+0xbe/0x140
  do_syscall_64+0x5c/0xc0
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Reported by Selim Enes Karaduman @Enesdex working with Trend Micro Zero
Day Initiative.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17291
Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: make poll_usage boolean and annotate its access</title>
<updated>2022-06-06T06:43:37+00:00</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.co.jp</email>
</author>
<published>2022-04-29T21:38:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e6acf868ff0e09efd0a4bee109989640ac792ccf'/>
<id>urn:sha1:e6acf868ff0e09efd0a4bee109989640ac792ccf</id>
<content type='text'>
commit f485922d8fe4e44f6d52a5bb95a603b7c65554bb upstream.

Patch series "Fix data-races around epoll reported by KCSAN."

This series suppresses a false positive KCSAN's message and fixes a real
data-race.


This patch (of 2):

pipe_poll() runs locklessly and assigns 1 to poll_usage.  Once poll_usage
is set to 1, it never changes in other places.  However, concurrent writes
of a value trigger KCSAN, so let's make KCSAN happy.

BUG: KCSAN: data-race in pipe_poll / pipe_poll

write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3:
 pipe_poll (fs/pipe.c:656)
 ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853)
 do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234)
 __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241)
 do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)

write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1:
 pipe_poll (fs/pipe.c:656)
 ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853)
 do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234)
 __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241)
 do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6
Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014

Link: https://lkml.kernel.org/r/20220322002653.33865-1-kuniyu@amazon.co.jp
Link: https://lkml.kernel.org/r/20220322002653.33865-2-kuniyu@amazon.co.jp
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.co.jp&gt;
Cc: Alexander Duyck &lt;alexander.h.duyck@intel.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: Kuniyuki Iwashima &lt;kuni1840@gmail.com&gt;
Cc: "Soheil Hassas Yeganeh" &lt;soheil@google.com&gt;
Cc: "Sridhar Samudrala" &lt;sridhar.samudrala@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>watch_queue: Fix lack of barrier/sync/lock between post and read</title>
<updated>2022-03-16T13:23:44+00:00</updated>
<author>
<name>David Howells</name>
<email>dhowells@redhat.com</email>
</author>
<published>2022-03-11T13:24:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=eb38c2e9fc740b5218dc94d2a6f43802d49e811c'/>
<id>urn:sha1:eb38c2e9fc740b5218dc94d2a6f43802d49e811c</id>
<content type='text'>
commit 2ed147f015af2b48f41c6f0b6746aa9ea85c19f3 upstream.

There's nothing to synchronise post_one_notification() versus
pipe_read().  Whilst posting is done under pipe-&gt;rd_wait.lock, the
reader only takes pipe-&gt;mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.

Fix this by setting pipe-&gt;head with a barrier in post_one_notification()
and reading pipe-&gt;head with a barrier in pipe_read().

If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a -&gt;confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>watch_queue, pipe: Free watchqueue state after clearing pipe ring</title>
<updated>2022-03-16T13:23:44+00:00</updated>
<author>
<name>David Howells</name>
<email>dhowells@redhat.com</email>
</author>
<published>2022-03-11T13:23:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8275b6699c6dda4d4cec748bc2e879cac532d193'/>
<id>urn:sha1:8275b6699c6dda4d4cec748bc2e879cac532d193</id>
<content type='text'>
commit db8facfc9fafacefe8a835416a6b77c838088f8b upstream.

In free_pipe_info(), free the watchqueue state after clearing the pipe
ring as each pipe ring descriptor has a release function, and in the
case of a notification message, this is watch_queue_pipe_buf_release()
which tries to mark the allocation bitmap that was previously released.

Fix this by moving the put of the pipe's ref on the watch queue to after
the ring has been cleared.  We still need to call watch_queue_clear()
before doing that to make sure that the pipe is disconnected from any
notification sources first.

Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly"</title>
<updated>2021-09-07T18:03:45+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-09-07T18:03:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cd1adf1b63a112d762832e9c64b0a886fbb840d6'/>
<id>urn:sha1:cd1adf1b63a112d762832e9c64b0a886fbb840d6</id>
<content type='text'>
This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.

That commit was completely broken, and I should have caught on to it
earlier.  But happily, the kernel test robot noticed the breakage fairly
quickly.

The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".

In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.

So all the commentary about how

 "try_get_page() has fallen a little behind in terms of maintenance,
  try_get_compound_head() handles speculative page references more
  thoroughly"

was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.

So there's no lack of maintainance - there are fundamentally different
semantics.

A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()".  It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".

The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().

Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/gup: remove try_get_page(), call try_get_compound_head() directly</title>
<updated>2021-09-03T16:58:11+00:00</updated>
<author>
<name>John Hubbard</name>
<email>jhubbard@nvidia.com</email>
</author>
<published>2021-09-02T21:53:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9857a17f206ff374aea78bccfb687f145368be2e'/>
<id>urn:sha1:9857a17f206ff374aea78bccfb687f145368be2e</id>
<content type='text'>
try_get_page() is very similar to try_get_compound_head(), and in fact
try_get_page() has fallen a little behind in terms of maintenance:
try_get_compound_head() handles speculative page references more
thoroughly.

There are only two try_get_page() callsites, so just call
try_get_compound_head() directly from those, and remove try_get_page()
entirely.

Also, seeing as how this changes try_get_compound_head() into a non-static
function, provide some kerneldoc documentation for it.

Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com
Signed-off-by: John Hubbard &lt;jhubbard@nvidia.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Christian Borntraeger &lt;borntraeger@de.ibm.com&gt;
Cc: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Vasily Gorbik &lt;gor@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: do FASYNC notifications for every pipe IO, not just state changes</title>
<updated>2021-08-25T17:27:16+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-08-24T17:39:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fe67f4dd8daa252eb9aa7acb61555f3cc3c1ce4c'/>
<id>urn:sha1:fe67f4dd8daa252eb9aa7acb61555f3cc3c1ce4c</id>
<content type='text'>
It turns out that the SIGIO/FASYNC situation is almost exactly the same
as the EPOLLET case was: user space really wants to be notified after
every operation.

Now, in a perfect world it should be sufficient to only notify user
space on "state transitions" when the IO state changes (ie when a pipe
goes from unreadable to readable, or from unwritable to writable).  User
space should then do as much as possible - fully emptying the buffer or
what not - and we'll notify it again the next time the state changes.

But as with EPOLLET, we have at least one case (stress-ng) where the
kernel sent SIGIO due to the pipe being marked for asynchronous
notification, but the user space signal handler then didn't actually
necessarily read it all before returning (it read more than what was
written, but since there could be multiple writes, it could leave data
pending).

The user space code then expected to get another SIGIO for subsequent
writes - even though the pipe had been readable the whole time - and
would only then read more.

This is arguably a user space bug - and Colin King already fixed the
stress-ng code in question - but the kernel regression rules are clear:
it doesn't matter if kernel people think that user space did something
silly and wrong.  What matters is that it used to work.

So if user space depends on specific historical kernel behavior, it's a
regression when that behavior changes.  It's on us: we were silly to
have that non-optimal historical behavior, and our old kernel behavior
was what user space was tested against.

Because of how the FASYNC notification was tied to wakeup behavior, this
was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix
and clarify pipe read/write wakeup logic"), but at the time it seems
nobody noticed.  Probably because the stress-ng problem case ends up
being timing-dependent too.

It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe
writes always wake up readers") only to be broken again when by commit
3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal
loads").

And at that point the kernel test robot noticed the performance
refression in the stress-ng.sigio.ops_per_sec case.  So the "Fixes" tag
below is somewhat ad hoc, but it matches when the issue was noticed.

Fix it for good (knock wood) by simply making the kill_fasync() case
separate from the wakeup case.  FASYNC is quite rare, and we clearly
shouldn't even try to use the "avoid unnecessary wakeups" logic for it.

Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Tested-by: Oliver Sang &lt;oliver.sang@intel.com&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Colin Ian King &lt;colin.king@canonical.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
