<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/file.c, branch v6.18.21</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.21</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.21'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-08-11T12:52:25+00:00</updated>
<entry>
<title>fs: always return zero on success from replace_fd()</title>
<updated>2025-08-11T12:52:25+00:00</updated>
<author>
<name>Thomas Weißschuh</name>
<email>thomas.weissschuh@linutronix.de</email>
</author>
<published>2025-08-05T12:38:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=708c04a5c2b78e22f56e2350de41feba74dfccd9'/>
<id>urn:sha1:708c04a5c2b78e22f56e2350de41feba74dfccd9</id>
<content type='text'>
replace_fd() returns the number of the new file descriptor through the
return value of do_dup2(). However its callers never care about the
specific returned number. In fact the caller in receive_fd_replace() treats
any non-zero return value as an error and therefore never calls
__receive_sock() for most file descriptors, which is a bug.

To fix the bug in receive_fd_replace() and to avoid the same issue
happening in future callers, signal success through a plain zero.

Suggested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Link: https://lore.kernel.org/lkml/20250801220215.GS222315@ZenIV/
Fixes: 173817151b15 ("fs: Expand __receive_fd() to accept existing fd")
Fixes: 42eb0d54c08a ("fs: split receive_fd_replace from __receive_fd")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh &lt;thomas.weissschuh@linutronix.de&gt;
Link: https://lore.kernel.org/20250805-fix-receive_fd_replace-v3-1-b72ba8b34bac@linutronix.de
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2025-07-28T18:22:56+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-07-28T18:22:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7879d7aff0ffd969fcb1a59e3f87ebb353e47b7f'/>
<id>urn:sha1:7879d7aff0ffd969fcb1a59e3f87ebb353e47b7f</id>
<content type='text'>
Pull misc VFS updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Add ext4 IOCB_DONTCACHE support

     This refactors the address_space_operations write_begin() and
     write_end() callbacks to take const struct kiocb * as their first
     argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate
     to the filesystem's buffered I/O path.

     Ext4 is updated to implement handling of the IOCB_DONTCACHE flag
     and advertises support via the FOP_DONTCACHE file operation flag.

     Additionally, the i915 driver's shmem write paths are updated to
     bypass the legacy write_begin/write_end interface in favor of
     directly calling write_iter() with a constructed synchronous kiocb.
     Another i915 change replaces a manual write loop with
     kernel_write() during GEM shmem object creation.

  Cleanups:

   - don't duplicate vfs_open() in kernel_file_open()

   - proc_fd_getattr(): don't bother with S_ISDIR() check

   - fs/ecryptfs: replace snprintf with sysfs_emit in show function

   - vfs: Remove unnecessary list_for_each_entry_safe() from
     evict_inodes()

   - filelock: add new locks_wake_up_waiter() helper

   - fs: Remove three arguments from block_write_end()

   - VFS: change old_dir and new_dir in struct renamedata to dentrys

   - netfs: Remove unused declaration netfs_queue_write_request()

  Fixes:

   - eventpoll: Fix semi-unbounded recursion

   - eventpoll: fix sphinx documentation build warning

   - fs/read_write: Fix spelling typo

   - fs: annotate data race between poll_schedule_timeout() and
     pollwake()

   - fs/pipe: set FMODE_NOWAIT in create_pipe_files()

   - docs/vfs: update references to i_mutex to i_rwsem

   - fs/buffer: remove comment about hard sectorsize

   - fs/buffer: remove the min and max limit checks in __getblk_slow()

   - fs/libfs: don't assume blocksize &lt;= PAGE_SIZE in
     generic_check_addressable

   - fs_context: fix parameter name in infofc() macro

   - fs: Prevent file descriptor table allocations exceeding INT_MAX"

* tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits)
  netfs: Remove unused declaration netfs_queue_write_request()
  eventpoll: fix sphinx documentation build warning
  ext4: support uncached buffered I/O
  mm/pagemap: add write_begin_get_folio() helper function
  fs: change write_begin/write_end interface to take struct kiocb *
  drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter
  drm/i915: Use kernel_write() in shmem object create
  eventpoll: Fix semi-unbounded recursion
  vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes()
  fs/libfs: don't assume blocksize &lt;= PAGE_SIZE in generic_check_addressable
  fs/buffer: remove the min and max limit checks in __getblk_slow()
  fs: Prevent file descriptor table allocations exceeding INT_MAX
  fs: Remove three arguments from block_write_end()
  fs/ecryptfs: replace snprintf with sysfs_emit in show function
  fs: annotate suspected data race between poll_schedule_timeout() and pollwake()
  docs/vfs: update references to i_mutex to i_rwsem
  fs/buffer: remove comment about hard sectorsize
  fs_context: fix parameter name in infofc() macro
  VFS: change old_dir and new_dir in struct renamedata to dentrys
  proc_fd_getattr(): don't bother with S_ISDIR() check
  ...
</content>
</entry>
<entry>
<title>fs: Prevent file descriptor table allocations exceeding INT_MAX</title>
<updated>2025-07-08T14:48:12+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sashal@kernel.org</email>
</author>
<published>2025-06-29T07:40:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=04a2c4b4511d186b0fce685da21085a5d4acd370'/>
<id>urn:sha1:04a2c4b4511d186b0fce685da21085a5d4acd370</id>
<content type='text'>
When sysctl_nr_open is set to a very high value (for example, 1073741816
as set by systemd), processes attempting to use file descriptors near
the limit can trigger massive memory allocation attempts that exceed
INT_MAX, resulting in a WARNING in mm/slub.c:

  WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288

This happens because kvmalloc_array() and kvmalloc() check if the
requested size exceeds INT_MAX and emit a warning when the allocation is
not flagged with __GFP_NOWARN.

Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a
process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: &gt; 8GB (exceeding INT_MAX = 2,147,483,647)

Reproducer:
1. Set /proc/sys/fs/nr_open to 1073741816:
   # echo 1073741816 &gt; /proc/sys/fs/nr_open

2. Run a program that uses a high file descriptor:
   #include &lt;unistd.h&gt;
   #include &lt;sys/resource.h&gt;

   int main() {
       struct rlimit rlim = {1073741824, 1073741824};
       setrlimit(RLIMIT_NOFILE, &amp;rlim);
       dup2(2, 1073741880);  // Triggers the warning
       return 0;
   }

3. Observe WARNING in dmesg at mm/slub.c:5027

systemd commit a8b627a introduced automatic bumping of fs.nr_open to the
maximum possible value. The rationale was that systems with memory
control groups (memcg) no longer need separate file descriptor limits
since memory is properly accounted. However, this change overlooked
that:

1. The kernel's allocation functions still enforce INT_MAX as a maximum
   size regardless of memcg accounting
2. Programs and tests that legitimately test file descriptor limits can
   inadvertently trigger massive allocations
3. The resulting allocations (&gt;8GB) are impractical and will always fail

systemd's algorithm starts with INT_MAX and keeps halving the value
until the kernel accepts it. On most systems, this results in nr_open
being set to 1073741816 (0x3ffffff8), which is just under 1GB of file
descriptors.

While processes rarely use file descriptors near this limit in normal
operation, certain selftests (like
tools/testing/selftests/core/unshare_test.c) and programs that test file
descriptor limits can trigger this issue.

Fix this by adding a check in alloc_fdtable() to ensure the requested
allocation size does not exceed INT_MAX. This causes the operation to
fail with -EMFILE instead of triggering a kernel warning and avoids the
impractical &gt;8GB memory allocation request.

Fixes: 9cfe015aa424 ("get rid of NR_OPEN and introduce a sysctl_nr_open")
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
Link: https://lore.kernel.org/20250629074021.1038845-1-sashal@kernel.org
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: drop assert in file_seek_cur_needs_f_lock</title>
<updated>2025-06-16T07:59:24+00:00</updated>
<author>
<name>Luis Henriques</name>
<email>luis@igalia.com</email>
</author>
<published>2025-06-13T10:11:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dd2d6b7f6f519d078a866a36a625b0297d81c5bc'/>
<id>urn:sha1:dd2d6b7f6f519d078a866a36a625b0297d81c5bc</id>
<content type='text'>
The assert in function file_seek_cur_needs_f_lock() can be triggered very
easily because there are many users of vfs_llseek() (such as overlayfs)
that do their custom locking around llseek instead of relying on
fdget_pos(). Just drop the overzealous assertion.

Fixes: da06e3c51794 ("fs: don't needlessly acquire f_lock")
Suggested-by: Jan Kara &lt;jack@suse.cz&gt;
Suggested-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Signed-off-by: Luis Henriques &lt;luis@igalia.com&gt;
Link: https://lore.kernel.org/20250613101111.17716-1-luis@igalia.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: fall back to file_ref_put() for non-last reference</title>
<updated>2025-04-22T16:16:09+00:00</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-04-18T12:57:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d1f7256a5a525a44ac6a81d0a8ff931317b2acbf'/>
<id>urn:sha1:d1f7256a5a525a44ac6a81d0a8ff931317b2acbf</id>
<content type='text'>
This reduces the slowdown in face of multiple callers issuing close on
what turns out to not be the last reference.

Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/20250418125756.59677-1-mjguzik@gmail.com
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Closes: https://lore.kernel.org/oe-lkp/202504171513.6d6f8a16-lkp@intel.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'vfs-6.15-rc1.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2025-03-24T20:19:17+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-03-24T20:19:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=912b82dc0b27abc407c831e74fbcbdebfe19997b'/>
<id>urn:sha1:912b82dc0b27abc407c831e74fbcbdebfe19997b</id>
<content type='text'>
Pull vfs file handling updates from Christian Brauner:
 "This contains performance improvements for struct file's new refcount
  mechanism and various other performance work:

   - The stock kernel transitioning the file to no refs held penalizes
     the caller with an extra atomic to block any increments. For cases
     where the file is highly likely to be going away this is easily
     avoidable.

     Add file_ref_put_close() to better handle the common case where
     closing a file descriptor also operates on the last reference and
     build fput_close_sync() and fput_close() on top of it. This brings
     about 1% performance improvement by eliding one atomic in the
     common case.

   - Predict no error in close() since the vast majority of the time
     system call returns 0.

   - Reduce the work done in fdget_pos() by predicting that the file was
     found and by explicitly comparing the reference count to one and
     ignoring the dead zone"

* tag 'vfs-6.15-rc1.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: reduce work in fdget_pos()
  fs: use fput_close() in path_openat()
  fs: use fput_close() in filp_close()
  fs: use fput_close_sync() in close()
  file: add fput and file_ref_put routines optimized for use when closing a fd
  fs: predict no error in close()
</content>
</entry>
<entry>
<title>fs: sort out fd allocation vs dup2 race commentary, take 2</title>
<updated>2025-03-20T14:17:56+00:00</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-03-20T10:26:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4dec4f91359c456a5eea26817ea151b42953432e'/>
<id>urn:sha1:4dec4f91359c456a5eea26817ea151b42953432e</id>
<content type='text'>
fd_install() has a questionable comment above it.

While it correctly points out a possible race against dup2(), it states:
&gt; We need to detect this and fput() the struct file we are about to
&gt; overwrite in this case.
&gt;
&gt; It should never happen - if we allow dup2() do it, _really_ bad things
&gt; will follow.

I have difficulty parsing the above. The first sentence would suggest
fd_install() tries to detect and recover from the race (it does not),
the next one claims the race needs to be dealt with (it is, by dup2()).

Given that fd_install() does not suffer the burden, this patch removes
the above and instead expands on the race in dup2() commentary.

While here tidy up the docs around fd_install().

Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250320102637.1924183-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: reduce work in fdget_pos()</title>
<updated>2025-03-20T08:45:39+00:00</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-03-19T21:58:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5370b43e4bcf60049bfd44db83ba8d2ec43d0152'/>
<id>urn:sha1:5370b43e4bcf60049bfd44db83ba8d2ec43d0152</id>
<content type='text'>
1. predict the file was found
2. explicitly compare the ref to "one", ignoring the dead zone

The latter arguably improves the behavior to begin with. Suppose the
count turned bad -- the previously used ref routine is going to check
for it and return 0, indicating the count does not necessitate taking
-&gt;f_pos_lock. But there very well may be several users.

i.e. not paying for special-casing the dead zone improves semantics.

While here spell out each condition in a dedicated if statement. This
has no effect on generated code.

Sizes are as follows (in bytes; gcc 13, x86-64):
stock:		321
likely(): 	298
likely()+ref:	280

Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250319215801.1870660-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: consistently deref the files table with rcu_dereference_raw()</title>
<updated>2025-03-18T14:34:27+00:00</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-03-13T13:57:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f381640e1bd4f2de7ccafbfe8703d33c3718aad9'/>
<id>urn:sha1:f381640e1bd4f2de7ccafbfe8703d33c3718aad9</id>
<content type='text'>
... except when the table is known to be only used by one thread.

A file pointer can get installed at any moment despite the -&gt;file_lock
being held since the following:
8a81252b774b53e6 ("fs/file.c: don't acquire files-&gt;file_lock in fd_install()")

Accesses subject to such a race can in principle suffer load tearing.

While here redo the comment in dup_fd -- it only covered a race against
files showing up, still assuming fd_install() takes the lock.

Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250313135725.1320914-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>fs: use debug-only asserts around fd allocation and install</title>
<updated>2025-03-13T08:54:15+00:00</updated>
<author>
<name>Mateusz Guzik</name>
<email>mjguzik@gmail.com</email>
</author>
<published>2025-03-12T16:19:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dc530c44cd64f9788ec4b2efa04ee7499a04a3f4'/>
<id>urn:sha1:dc530c44cd64f9788ec4b2efa04ee7499a04a3f4</id>
<content type='text'>
This also restores the check which got removed in 52732bb9abc9ee5b
("fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()")
for performance reasons -- they no longer apply with a debug-only
variant.

Signed-off-by: Mateusz Guzik &lt;mjguzik@gmail.com&gt;
Link: https://lore.kernel.org/r/20250312161941.1261615-1-mjguzik@gmail.com
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
</feed>
