<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/file.c, branch v6.6.132</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.132</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.132'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-10-19T14:30:39+00:00</updated>
<entry>
<title>fs: always return zero on success from replace_fd()</title>
<updated>2025-10-19T14:30:39+00:00</updated>
<author>
<name>Thomas Weißschuh</name>
<email>thomas.weissschuh@linutronix.de</email>
</author>
<published>2025-08-05T12:38:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bc7372c05a07f6ecec8f4f6ffe7a7be40179b2f3'/>
<id>urn:sha1:bc7372c05a07f6ecec8f4f6ffe7a7be40179b2f3</id>
<content type='text'>
commit 708c04a5c2b78e22f56e2350de41feba74dfccd9 upstream.

replace_fd() returns the number of the new file descriptor through the
return value of do_dup2(). However its callers never care about the
specific returned number. In fact the caller in receive_fd_replace() treats
any non-zero return value as an error and therefore never calls
__receive_sock() for most file descriptors, which is a bug.

To fix the bug in receive_fd_replace() and to avoid the same issue
happening in future callers, signal success through a plain zero.

Suggested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Link: https://lore.kernel.org/lkml/20250801220215.GS222315@ZenIV/
Fixes: 173817151b15 ("fs: Expand __receive_fd() to accept existing fd")
Fixes: 42eb0d54c08a ("fs: split receive_fd_replace from __receive_fd")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh &lt;thomas.weissschuh@linutronix.de&gt;
Link: https://lore.kernel.org/20250805-fix-receive_fd_replace-v3-1-b72ba8b34bac@linutronix.de
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>alloc_fdtable(): change calling conventions.</title>
<updated>2025-08-28T14:28:50+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2024-08-07T02:14:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=40b36d9a612b8c099b639fe5a10059ad3638d9d1'/>
<id>urn:sha1:40b36d9a612b8c099b639fe5a10059ad3638d9d1</id>
<content type='text'>
[ Upstream commit 1d3b4bec3ce55e0c46cdce7d0402dbd6b4af3a3d ]

First of all, tell it how many slots do we want, not which slot
is wanted.  It makes one caller (dup_fd()) more straightforward
and doesn't harm another (expand_fdtable()).

Furthermore, make it return ERR_PTR() on failure rather than
returning NULL.  Simplifies the callers.

Simplify the size calculation, while we are at it - note that we
always have slots_wanted greater than BITS_PER_LONG.  What the
rules boil down to is
	* use the smallest power of two large enough to give us
that many slots
	* on 32bit skip 64 and 128 - the minimal capacity we want
there is 256 slots (i.e. 1Kb fd array).
	* on 64bit don't skip anything, the minimal capacity is
128 - and we'll never be asked for 64 or less.  128 slots means
1Kb fd array, again.
	* on 128bit, if that ever happens, don't skip anything -
we'll never be asked for 128 or less, so the fd array allocation
will be at least 2Kb.

Reviewed-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: Prevent file descriptor table allocations exceeding INT_MAX</title>
<updated>2025-08-28T14:28:12+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sashal@kernel.org</email>
</author>
<published>2025-06-29T07:40:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=237e416eb62101f21b28c9e6e564d10efe1ecc6f'/>
<id>urn:sha1:237e416eb62101f21b28c9e6e564d10efe1ecc6f</id>
<content type='text'>
commit 04a2c4b4511d186b0fce685da21085a5d4acd370 upstream.

When sysctl_nr_open is set to a very high value (for example, 1073741816
as set by systemd), processes attempting to use file descriptors near
the limit can trigger massive memory allocation attempts that exceed
INT_MAX, resulting in a WARNING in mm/slub.c:

  WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288

This happens because kvmalloc_array() and kvmalloc() check if the
requested size exceeds INT_MAX and emit a warning when the allocation is
not flagged with __GFP_NOWARN.

Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a
process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: &gt; 8GB (exceeding INT_MAX = 2,147,483,647)

Reproducer:
1. Set /proc/sys/fs/nr_open to 1073741816:
   # echo 1073741816 &gt; /proc/sys/fs/nr_open

2. Run a program that uses a high file descriptor:
   #include &lt;unistd.h&gt;
   #include &lt;sys/resource.h&gt;

   int main() {
       struct rlimit rlim = {1073741824, 1073741824};
       setrlimit(RLIMIT_NOFILE, &amp;rlim);
       dup2(2, 1073741880);  // Triggers the warning
       return 0;
   }

3. Observe WARNING in dmesg at mm/slub.c:5027

systemd commit a8b627a introduced automatic bumping of fs.nr_open to the
maximum possible value. The rationale was that systems with memory
control groups (memcg) no longer need separate file descriptor limits
since memory is properly accounted. However, this change overlooked
that:

1. The kernel's allocation functions still enforce INT_MAX as a maximum
   size regardless of memcg accounting
2. Programs and tests that legitimately test file descriptor limits can
   inadvertently trigger massive allocations
3. The resulting allocations (&gt;8GB) are impractical and will always fail

systemd's algorithm starts with INT_MAX and keeps halving the value
until the kernel accepts it. On most systems, this results in nr_open
being set to 1073741816 (0x3ffffff8), which is just under 1GB of file
descriptors.

While processes rarely use file descriptors near this limit in normal
operation, certain selftests (like
tools/testing/selftests/core/unshare_test.c) and programs that test file
descriptor limits can trigger this issue.

Fix this by adding a check in alloc_fdtable() to ensure the requested
allocation size does not exceed INT_MAX. This causes the operation to
fail with -EMFILE instead of triggering a kernel warning and avoids the
impractical &gt;8GB memory allocation request.

Fixes: 9cfe015aa424 ("get rid of NR_OPEN and introduce a sysctl_nr_open")
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
Link: https://lore.kernel.org/20250629074021.1038845-1-sashal@kernel.org
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>fs: consistently deref the files table with rcu_dereference_raw()</title>
<updated>2025-04-25T08:45:10+00:00</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-03-13T13:57:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=323960a187a9622d9006484ff6635680d3a0118f'/>
<id>urn:sha1:323960a187a9622d9006484ff6635680d3a0118f</id>
<content type='text'>
[ Upstream commit f381640e1bd4f2de7ccafbfe8703d33c3718aad9 ]

... except when the table is known to be only used by one thread.

A file pointer can get installed at any moment despite the -&gt;file_lock
being held since the following:
8a81252b774b53e6 ("fs/file.c: don't acquire files-&gt;file_lock in fd_install()")

Accesses subject to such a race can in principle suffer load tearing.

While here redo the comment in dup_fd -- it only covered a race against
files showing up, still assuming fd_install() takes the lock.

Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250313135725.1320914-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: fix missing declaration of init_files</title>
<updated>2025-01-23T16:21:13+00:00</updated>
<author>
<name>Zhang Kunbo</name>
<email>zhangkunbo@huawei.com</email>
</author>
<published>2024-12-17T07:18:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=62861a5d4dd60e06b4c86d67e35219efc2428833'/>
<id>urn:sha1:62861a5d4dd60e06b4c86d67e35219efc2428833</id>
<content type='text'>
[ Upstream commit 2b2fc0be98a828cf33a88a28e9745e8599fb05cf ]

fs/file.c should include include/linux/init_task.h  for
 declaration of init_files. This fixes the sparse warning:

fs/file.c:501:21: warning: symbol 'init_files' was not declared. Should it be static?

Signed-off-by: Zhang Kunbo &lt;zhangkunbo@huawei.com&gt;
Link: https://lore.kernel.org/r/20241217071836.2634868-1-zhangkunbo@huawei.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>close_range(): fix the logics in descriptor table trimming</title>
<updated>2024-10-10T09:58:00+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2024-08-16T19:17:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a8023f8b55988da64ce245f6bada754c806a834a'/>
<id>urn:sha1:a8023f8b55988da64ce245f6bada754c806a834a</id>
<content type='text'>
commit 678379e1d4f7443b170939525d3312cfc37bf86b upstream.

Cloning a descriptor table picks the size that would cover all currently
opened files.  That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
	close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.

Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().

The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.

That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.

It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size().  There we are under -&gt;files_lock,
so the race is trivially avoided.

So we do the following:
	* switch close_files() from calling unshare_fd() to calling
dup_fd().
	* undo the calling convention change done to unshare_fd() in
60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
	* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting.  NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
	* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
	* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.

Fixes: 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE</title>
<updated>2024-08-29T15:33:14+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2024-08-03T22:02:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dd72ae8b0fce9c0bbe9582b9b50820f0407f8d8a'/>
<id>urn:sha1:dd72ae8b0fce9c0bbe9582b9b50820f0407f8d8a</id>
<content type='text'>
commit 9a2fa1472083580b6c66bdaf291f591e1170123a upstream.

copy_fd_bitmaps(new, old, count) is expected to copy the first
count/BITS_PER_LONG bits from old-&gt;full_fds_bits[] and fill
the rest with zeroes.  What it does is copying enough words
(BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest.
That works fine, *if* all bits past the cutoff point are
clear.  Otherwise we are risking garbage from the last word
we'd copied.

For most of the callers that is true - expand_fdtable() has
count equal to old-&gt;max_fds, so there's no open descriptors
past count, let alone fully occupied words in -&gt;open_fds[],
which is what bits in -&gt;full_fds_bits[] correspond to.

The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds),
which is the smallest multiple of BITS_PER_LONG that covers all
opened descriptors below max_fds.  In the common case (copying on
fork()) max_fds is ~0U, so all opened descriptors will be below
it and we are fine, by the same reasons why the call in expand_fdtable()
is safe.

Unfortunately, there is a case where max_fds is less than that
and where we might, indeed, end up with junk in -&gt;full_fds_bits[] -
close_range(from, to, CLOSE_RANGE_UNSHARE) with
	* descriptor table being currently shared
	* 'to' being above the current capacity of descriptor table
	* 'from' being just under some chunk of opened descriptors.
In that case we end up with observably wrong behaviour - e.g. spawn
a child with CLONE_FILES, get all descriptors in range 0..127 open,
then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending
up with descriptor #128, despite #64 being observably not open.

The minimally invasive fix would be to deal with that in dup_fd().
If this proves to add measurable overhead, we can go that way, but
let's try to fix copy_fd_bitmaps() first.

* new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size).
* make copy_fd_bitmaps() take the bitmap size in words, rather than
bits; it's 'count' argument is always a multiple of BITS_PER_LONG,
so we are not losing any information, and that way we can use the
same helper for all three bitmaps - compiler will see that count
is a multiple of BITS_PER_LONG for the large ones, so it'll generate
plain memcpy()+memset().

Reproducer added to tools/testing/selftests/core/close_range_test.c

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>protect the fetch of -&gt;fd[fd] in do_dup2() from mispredictions</title>
<updated>2024-08-11T10:47:25+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2024-08-01T19:22:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=da72e783afd27d9f487836b2e6738146c0edd149'/>
<id>urn:sha1:da72e783afd27d9f487836b2e6738146c0edd149</id>
<content type='text'>
commit 8aa37bde1a7b645816cda8b80df4753ecf172bf1 upstream.

both callers have verified that fd is not greater than -&gt;max_fds;
however, misprediction might end up with
        tofree = fdt-&gt;fd[fd];
being speculatively executed.  That's wrong for the same reasons
why it's wrong in close_fd()/file_close_fd_locked(); the same
solution applies - array_index_nospec(fd, fdt-&gt;max_fds) could differ
from fd only in case of speculative execution on mispredicted path.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>fs/file: fix the check in find_next_fd()</title>
<updated>2024-07-25T07:50:45+00:00</updated>
<author>
<name>Yuntao Wang</name>
<email>yuntao.wang@linux.dev</email>
</author>
<published>2024-05-29T16:06:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bd56b910fa94fbbc7d0fa521de50c77145b47373'/>
<id>urn:sha1:bd56b910fa94fbbc7d0fa521de50c77145b47373</id>
<content type='text'>
[ Upstream commit ed8c7fbdfe117abbef81f65428ba263118ef298a ]

The maximum possible return value of find_next_zero_bit(fdt-&gt;full_fds_bits,
maxbit, bitbit) is maxbit. This return value, multiplied by BITS_PER_LONG,
gives the value of bitbit, which can never be greater than maxfd, it can
only be equal to maxfd at most, so the following check 'if (bitbit &gt; maxfd)'
will never be true.

Moreover, when bitbit equals maxfd, it indicates that there are no unused
fds, and the function can directly return.

Fix this check.

Signed-off-by: Yuntao Wang &lt;yuntao.wang@linux.dev&gt;
Link: https://lore.kernel.org/r/20240529160656.209352-1-yuntao.wang@linux.dev
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2023-08-28T17:17:14+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-08-28T17:17:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=de16588a7737b12e63ec646d72b45befb2b1f8f7'/>
<id>urn:sha1:de16588a7737b12e63ec646d72b45befb2b1f8f7</id>
<content type='text'>
Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual filesystems.

  Features:

   - Block mode changes on symlinks and rectify our broken semantics

   - Report file modifications via fsnotify() for splice

   - Allow specifying an explicit timeout for the "rootwait" kernel
     command line option. This allows to timeout and reboot instead of
     always waiting indefinitely for the root device to show up

   - Use synchronous fput for the close system call

  Cleanups:

   - Get rid of open-coded lockdep workarounds for async io submitters
     and replace it all with a single consolidated helper

   - Simplify epoll allocation helper

   - Convert simple_write_begin and simple_write_end to use a folio

   - Convert page_cache_pipe_buf_confirm() to use a folio

   - Simplify __range_close to avoid pointless locking

   - Disable per-cpu buffer head cache for isolated cpus

   - Port ecryptfs to kmap_local_page() api

   - Remove redundant initialization of pointer buf in pipe code

   - Unexport the d_genocide() function which is only used within core
     vfs

   - Replace printk(KERN_ERR) and WARN_ON() with WARN()

  Fixes:

   - Fix various kernel-doc issues

   - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE

   - Fix a mainly theoretical issue in devpts

   - Check the return value of __getblk() in reiserfs

   - Fix a racy assert in i_readcount_dec

   - Fix integer conversion issues in various functions

   - Fix LSM security context handling during automounts that prevented
     NFS superblock sharing"

* tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  cachefiles: use kiocb_{start,end}_write() helpers
  ovl: use kiocb_{start,end}_write() helpers
  aio: use kiocb_{start,end}_write() helpers
  io_uring: use kiocb_{start,end}_write() helpers
  fs: create kiocb_{start,end}_write() helpers
  fs: add kerneldoc to file_{start,end}_write() helpers
  io_uring: rename kiocb_end_write() local helper
  splice: Convert page_cache_pipe_buf_confirm() to use a folio
  libfs: Convert simple_write_begin and simple_write_end to use a folio
  fs/dcache: Replace printk and WARN_ON by WARN
  fs/pipe: remove redundant initialization of pointer buf
  fs: Fix kernel-doc warnings
  devpts: Fix kernel-doc warnings
  doc: idmappings: fix an error and rephrase a paragraph
  init: Add support for rootwait timeout parameter
  vfs: fix up the assert in i_readcount_dec
  fs: Fix one kernel-doc comment
  docs: filesystems: idmappings: clarify from where idmappings are taken
  fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
  vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
  ...
</content>
</entry>
</feed>
