<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/ceph/dir.c, branch v6.18.21</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.21</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.21'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-03-19T15:08:30+00:00</updated>
<entry>
<title>ceph: fix i_nlink underrun during async unlink</title>
<updated>2026-03-19T15:08:30+00:00</updated>
<author>
<name>Max Kellermann</name>
<email>max.kellermann@ionos.com</email>
</author>
<published>2025-09-05T21:15:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7db008e85a5d17b64bc5390b828bf457ae91a415'/>
<id>urn:sha1:7db008e85a5d17b64bc5390b828bf457ae91a415</id>
<content type='text'>
commit ce0123cbb4a40a2f1bbb815f292b26e96088639f upstream.

During async unlink, we drop the `i_nlink` counter before we receive
the completion (that will eventually update the `i_nlink`) because "we
assume that the unlink will succeed".  That is not a bad idea, but it
races against deletions by other clients (or against the completion of
our own unlink) and can lead to an underrun which emits a WARNING like
this one:

 WARNING: CPU: 85 PID: 25093 at fs/inode.c:407 drop_nlink+0x50/0x68
 Modules linked in:
 CPU: 85 UID: 3221252029 PID: 25093 Comm: php-cgi8.1 Not tainted 6.14.11-cm4all1-ampere #655
 Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023
 pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : drop_nlink+0x50/0x68
 lr : ceph_unlink+0x6c4/0x720
 sp : ffff80012173bc90
 x29: ffff80012173bc90 x28: ffff086d0a45aaf8 x27: ffff0871d0eb5680
 x26: ffff087f2a64a718 x25: 0000020000000180 x24: 0000000061c88647
 x23: 0000000000000002 x22: ffff07ff9236d800 x21: 0000000000001203
 x20: ffff07ff9237b000 x19: ffff088b8296afc0 x18: 00000000f3c93365
 x17: 0000000000070000 x16: ffff08faffcbdfe8 x15: ffff08faffcbdfec
 x14: 0000000000000000 x13: 45445f65645f3037 x12: 34385f6369706f74
 x11: 0000a2653104bb20 x10: ffffd85f26d73290 x9 : ffffd85f25664f94
 x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 0000000000000002
 x5 : 0000000000000081 x4 : 0000000000000481 x3 : 0000000000000000
 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff08727d3f91e8
 Call trace:
  drop_nlink+0x50/0x68 (P)
  vfs_unlink+0xb0/0x2e8
  do_unlinkat+0x204/0x288
  __arm64_sys_unlinkat+0x3c/0x80
  invoke_syscall.constprop.0+0x54/0xe8
  do_el0_svc+0xa4/0xc8
  el0_svc+0x18/0x58
  el0t_64_sync_handler+0x104/0x130
  el0t_64_sync+0x154/0x158

In ceph_unlink(), a call to ceph_mdsc_submit_request() submits the
CEPH_MDS_OP_UNLINK to the MDS, but does not wait for completion.

Meanwhile, between this call and the following drop_nlink() call, a
worker thread may process a CEPH_CAP_OP_IMPORT, CEPH_CAP_OP_GRANT or
just a CEPH_MSG_CLIENT_REPLY (the latter of which could be our own
completion).  These will lead to a set_nlink() call, updating the
`i_nlink` counter to the value received from the MDS.  If that new
`i_nlink` value happens to be zero, it is illegal to decrement it
further.  But that is exactly what ceph_unlink() will do then.

The WARNING can be reproduced this way:

1. Force async unlink; only the async code path is affected.  Having
   no real clue about Ceph internals, I was unable to find out why the
   MDS wouldn't give me the "Fxr" capabilities, so I patched
   get_caps_for_async_unlink() to always succeed.

   (Note that the WARNING dump above was found on an unpatched kernel,
   without this kludge - this is not a theoretical bug.)

2. Add a sleep call after ceph_mdsc_submit_request() so the unlink
   completion gets handled by a worker thread before drop_nlink() is
   called.  This guarantees that the `i_nlink` is already zero before
   drop_nlink() runs.

The solution is to skip the counter decrement when it is already zero,
but doing so without a lock is still racy (TOCTOU).  Since
ceph_fill_inode() and handle_cap_grant() both hold the
`ceph_inode_info.i_ceph_lock` spinlock while set_nlink() runs, this
seems like the proper lock to protect the `i_nlink` updates.

I found prior art in NFS and SMB (using `inode.i_lock`) and AFS (using
`afs_vnode.cb_lock`).  All three have the zero check as well.

Cc: stable@vger.kernel.org
Fixes: 2ccb45462aea ("ceph: perform asynchronous unlink if we have sufficient caps")
Signed-off-by: Max Kellermann &lt;max.kellermann@ionos.com&gt;
Reviewed-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ceph: add a bunch of missing ceph_path_info initializers</title>
<updated>2026-03-19T15:08:29+00:00</updated>
<author>
<name>Max Kellermann</name>
<email>max.kellermann@ionos.com</email>
</author>
<published>2026-02-24T13:10:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8be8911f590813e6f90bc6407ced1b23e50bc5da'/>
<id>urn:sha1:8be8911f590813e6f90bc6407ced1b23e50bc5da</id>
<content type='text'>
commit 43323a5934b660afae687e8e4e95ac328615a5c4 upstream.

ceph_mdsc_build_path() must be called with a zero-initialized
ceph_path_info parameter, or else the following
ceph_mdsc_free_path_info() may crash.

Example crash (on Linux 6.18.12):

  virt_to_cache: Object is not a Slab page!
  WARNING: CPU: 184 PID: 2871736 at mm/slub.c:6732 kmem_cache_free+0x316/0x400
  [...]
  Call Trace:
   [...]
   ceph_open+0x13d/0x3e0
   do_dentry_open+0x134/0x480
   vfs_open+0x2a/0xe0
   path_openat+0x9a3/0x1160
  [...]
  cache_from_obj: Wrong slab cache. names_cache but object is from ceph_inode_info
  WARNING: CPU: 184 PID: 2871736 at mm/slub.c:6746 kmem_cache_free+0x2dd/0x400
  [...]
  kernel BUG at mm/slub.c:634!
  Oops: invalid opcode: 0000 [#1] SMP NOPTI
  RIP: 0010:__slab_free+0x1a4/0x350

Some of the ceph_mdsc_build_path() callers had initializers, but
others had not, even though they were all added by commit 15f519e9f883
("ceph: fix race condition validating r_parent before applying state").
The ones without initializer are suspectible to random crashes.  (I can
imagine it could even be possible to exploit this bug to elevate
privileges.)

Unfortunately, these Ceph functions are undocumented and its semantics
can only be derived from the code.  I see that ceph_mdsc_build_path()
initializes the structure only on success, but not on error.

Calling ceph_mdsc_free_path_info() after a failed
ceph_mdsc_build_path() call does not even make sense, but that's what
all callers do, and for it to be safe, the structure must be
zero-initialized.  The least intrusive approach to fix this is
therefore to add initializers everywhere.

Cc: stable@vger.kernel.org
Fixes: 15f519e9f883 ("ceph: fix race condition validating r_parent before applying state")
Signed-off-by: Max Kellermann &lt;max.kellermann@ionos.com&gt;
Reviewed-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ceph: refactor wake_up_bit() pattern of calling</title>
<updated>2025-10-08T21:30:46+00:00</updated>
<author>
<name>Viacheslav Dubeyko</name>
<email>Slava.Dubeyko@ibm.com</email>
</author>
<published>2025-07-08T19:20:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=53db6f25ee47cb1265141d31562604e56146919a'/>
<id>urn:sha1:53db6f25ee47cb1265141d31562604e56146919a</id>
<content type='text'>
The wake_up_bit() is called in ceph_async_unlink_cb(),
wake_async_create_waiters(), and ceph_finish_async_create().
It makes sense to switch on clear_bit() function, because
it makes the code much cleaner and easier to understand.
More important rework is the adding of smp_mb__after_atomic()
memory barrier after the bit modification and before
wake_up_bit() call. It can prevent potential race condition
of accessing the modified bit in other threads. Luckily,
clear_and_wake_up_bit() already implements the required
functionality pattern:

static inline void clear_and_wake_up_bit(int bit, unsigned long *word)
{
	clear_bit_unlock(bit, word);
	/* See wake_up_bit() for which memory barrier you need to use. */
	smp_mb__after_atomic();
	wake_up_bit(word, bit);
}

Signed-off-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Reviewed-by: Alex Markuze &lt;amarkuze@redhat.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
</entry>
<entry>
<title>ceph: fix race condition validating r_parent before applying state</title>
<updated>2025-09-09T10:57:02+00:00</updated>
<author>
<name>Alex Markuze</name>
<email>amarkuze@redhat.com</email>
</author>
<published>2025-08-12T09:57:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=15f519e9f883b316d86e2bb6b767a023aafd9d83'/>
<id>urn:sha1:15f519e9f883b316d86e2bb6b767a023aafd9d83</id>
<content type='text'>
Add validation to ensure the cached parent directory inode matches the
directory info in MDS replies. This prevents client-side race conditions
where concurrent operations (e.g. rename) cause r_parent to become stale
between request initiation and reply processing, which could lead to
applying state changes to incorrect directory inodes.

[ idryomov: folded a kerneldoc fixup and a follow-up fix from Alex to
  move CEPH_CAP_PIN reference when r_parent is updated:

  When the parent directory lock is not held, req-&gt;r_parent can become
  stale and is updated to point to the correct inode.  However, the
  associated CEPH_CAP_PIN reference was not being adjusted.  The
  CEPH_CAP_PIN is a reference on an inode that is tracked for
  accounting purposes.  Moving this pin is important to keep the
  accounting balanced. When the pin was not moved from the old parent
  to the new one, it created two problems: The reference on the old,
  stale parent was never released, causing a reference leak.
  A reference for the new parent was never acquired, creating the risk
  of a reference underflow later in ceph_mdsc_release_request().  This
  patch corrects the logic by releasing the pin from the old parent and
  acquiring it for the new parent when r_parent is switched.  This
  ensures reference accounting stays balanced. ]

Cc: stable@vger.kernel.org
Signed-off-by: Alex Markuze &lt;amarkuze@redhat.com&gt;
Reviewed-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
</entry>
<entry>
<title>ceph: fix a race with rename() in ceph_mdsc_build_path()</title>
<updated>2025-06-17T21:58:14+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2025-02-15T04:37:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0d2da2561bdeb459b6c540c2417a15c1f8732e6a'/>
<id>urn:sha1:0d2da2561bdeb459b6c540c2417a15c1f8732e6a</id>
<content type='text'>
Lift copying the name into callers of ceph_encode_encrypted_dname()
that do not have it already copied; ceph_encode_encrypted_fname()
disappears.

That fixes a UAF in ceph_mdsc_build_path() - while the initial copy
of plaintext into buf is done under -&gt;d_lock, we access the
original name again in ceph_encode_encrypted_fname() and that is
done without any locking.  With ceph_encode_encrypted_dname() using
the stable copy the problem goes away.

Tested-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Reviewed-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>Merge tag 'vfs-6.15-rc1.ceph' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2025-03-24T19:17:13+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-03-24T19:17:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e63046adefc03800f1af76476701606bb148b49c'/>
<id>urn:sha1:e63046adefc03800f1af76476701606bb148b49c</id>
<content type='text'>
Pull vfs ceph updates from Christian Brauner:
 "This contains the work to remove access to page-&gt;index from ceph
  and fixes the test failure observed for ceph with generic/421 by
  refactoring ceph_writepages_start()"

* tag 'vfs-6.15-rc1.ceph' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folio
  ceph: Fix error handling in fill_readdir_cache()
  fs: Remove page_mkwrite_check_truncate()
  ceph: Pass a folio to ceph_allocate_page_array()
  ceph: Convert ceph_move_dirty_page_in_page_array() to move_dirty_folio_in_page_array()
  ceph: Remove uses of page from ceph_process_folio_batch()
  ceph: Convert ceph_check_page_before_write() to use a folio
  ceph: Convert writepage_nounlock() to write_folio_nounlock()
  ceph: Convert ceph_readdir_cache_control to store a folio
  ceph: Convert ceph_find_incompatible() to take a folio
  ceph: Use a folio in ceph_page_mkwrite()
  ceph: Remove ceph_writepage()
  ceph: fix generic/421 test failure
  ceph: introduce ceph_submit_write() method
  ceph: introduce ceph_process_folio_batch() method
  ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
</content>
</entry>
<entry>
<title>ceph: Convert ceph_readdir_cache_control to store a folio</title>
<updated>2025-02-28T10:21:30+00:00</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2025-02-17T18:51:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=baff9740bc8f1a48aba694ac17d5a026b34982de'/>
<id>urn:sha1:baff9740bc8f1a48aba694ac17d5a026b34982de</id>
<content type='text'>
Pass a folio around instead of a page.  This removes an access to
page-&gt;index and a few hidden calls to compound_head().

Signed-off-by: "Matthew Wilcox (Oracle)" &lt;willy@infradead.org&gt;
Link: https://lore.kernel.org/r/20250217185119.430193-5-willy@infradead.org
Tested-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Reviewed-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>ceph: return the correct dentry on mkdir</title>
<updated>2025-02-27T19:00:17+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2025-02-27T01:32:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3f92c7b57687fb24da46487d91dba7aa58aed75b'/>
<id>urn:sha1:3f92c7b57687fb24da46487d91dba7aa58aed75b</id>
<content type='text'>
ceph already splices the correct dentry (in splice_dentry()) from the
result of mkdir but does nothing more with it.

Now that -&gt;mkdir can return a dentry, return the correct dentry.

Note that previously ceph_mkdir() could call
   ceph_init_inode_acls()
on the inode from the wrong dentry, which would be NULL.  This
is safe as ceph_init_inode_acls() checks for NULL, but is not
strictly correct.  With this patch, the inode for the returned dentry
is passed to ceph_init_inode_acls().

Reviewed-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Link: https://lore.kernel.org/r/20250227013949.536172-4-neilb@suse.de
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Change inode_operations.mkdir to return struct dentry *</title>
<updated>2025-02-27T19:00:17+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2025-02-27T01:32:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=88d5baf69082e5b410296435008329676b687549'/>
<id>urn:sha1:88d5baf69082e5b410296435008329676b687549</id>
<content type='text'>
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g.  on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns.  For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.

This means that the dentry passed to -&gt;mkdir() may not be the one that
is associated with the inode after the -&gt;mkdir() completes.  Some
callers need to interact with the inode after the -&gt;mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.

This lookup-after-mkdir requires that the directory remains locked to
avoid races.  Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.

To remove this barrier, this patch changes -&gt;mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
  NULL - the directory was created and no other dentry was used
  ERR_PTR() - an error occurred
  non-NULL - this other dentry was spliced in

This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations.  Subsequent patches will make
further changes to some file-systems to return a correct dentry.

Not all filesystems reliably result in a positive hashed dentry:

- NFS, cifs, hostfs will sometimes need to perform a lookup of
  the name to get inode information.  Races could result in this
  returning something different. Note that this lookup is
  non-atomic which is what we are trying to avoid.  Placing the
  lookup in filesystem code means it only happens when the filesystem
  has no other option.
- kernfs and tracefs leave the dentry negative and the -&gt;revalidate
  operation ensures that lookup will be called to correctly populate
  the dentry.  This could be fixed but I don't think it is important
  to any of the users of vfs_mkdir() which look at the dentry.

The recommendation to use
    d_drop();d_splice_alias()
is ugly but fits with current practice.  A planned future patch will
change this.

Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>ceph_d_revalidate(): propagate stable name down into request encoding</title>
<updated>2025-01-28T00:25:23+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2025-01-07T15:04:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=541795cb0be6e29f79deff50831bc6caa72b5f8e'/>
<id>urn:sha1:541795cb0be6e29f79deff50831bc6caa72b5f8e</id>
<content type='text'>
Currently get_fscrypt_altname() requires -&gt;r_dentry-&gt;d_name to be stable
and it gets that in almost all cases.  The only exception is -&gt;d_revalidate(),
where we have a stable name, but it's passed separately - dentry-&gt;d_name
is not stable there.

Propagate it down to get_fscrypt_altname() as a new field of struct
ceph_mds_request - -&gt;r_dname, to be used instead -&gt;r_dentry-&gt;d_name
when non-NULL.

Reviewed-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
</feed>
