kernel/linux.git/fs/ceph/dir.c, branch v6.6.132

ceph: fix i_nlink underrun during async unlink

2026-03-25T10:05:51+00:00

commit ce0123cbb4a40a2f1bbb815f292b26e96088639f upstream. During async unlink, we drop the `i_nlink` counter before we receive the completion (that will eventually update the `i_nlink`) because "we assume that the unlink will succeed". That is not a bad idea, but it races against deletions by other clients (or against the completion of our own unlink) and can lead to an underrun which emits a WARNING like this one: WARNING: CPU: 85 PID: 25093 at fs/inode.c:407 drop_nlink+0x50/0x68 Modules linked in: CPU: 85 UID: 3221252029 PID: 25093 Comm: php-cgi8.1 Not tainted 6.14.11-cm4all1-ampere #655 Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023 pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : drop_nlink+0x50/0x68 lr : ceph_unlink+0x6c4/0x720 sp : ffff80012173bc90 x29: ffff80012173bc90 x28: ffff086d0a45aaf8 x27: ffff0871d0eb5680 x26: ffff087f2a64a718 x25: 0000020000000180 x24: 0000000061c88647 x23: 0000000000000002 x22: ffff07ff9236d800 x21: 0000000000001203 x20: ffff07ff9237b000 x19: ffff088b8296afc0 x18: 00000000f3c93365 x17: 0000000000070000 x16: ffff08faffcbdfe8 x15: ffff08faffcbdfec x14: 0000000000000000 x13: 45445f65645f3037 x12: 34385f6369706f74 x11: 0000a2653104bb20 x10: ffffd85f26d73290 x9 : ffffd85f25664f94 x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 0000000000000002 x5 : 0000000000000081 x4 : 0000000000000481 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff08727d3f91e8 Call trace: drop_nlink+0x50/0x68 (P) vfs_unlink+0xb0/0x2e8 do_unlinkat+0x204/0x288 __arm64_sys_unlinkat+0x3c/0x80 invoke_syscall.constprop.0+0x54/0xe8 do_el0_svc+0xa4/0xc8 el0_svc+0x18/0x58 el0t_64_sync_handler+0x104/0x130 el0t_64_sync+0x154/0x158 In ceph_unlink(), a call to ceph_mdsc_submit_request() submits the CEPH_MDS_OP_UNLINK to the MDS, but does not wait for completion. Meanwhile, between this call and the following drop_nlink() call, a worker thread may process a CEPH_CAP_OP_IMPORT, CEPH_CAP_OP_GRANT or just a CEPH_MSG_CLIENT_REPLY (the latter of which could be our own completion). These will lead to a set_nlink() call, updating the `i_nlink` counter to the value received from the MDS. If that new `i_nlink` value happens to be zero, it is illegal to decrement it further. But that is exactly what ceph_unlink() will do then. The WARNING can be reproduced this way: 1. Force async unlink; only the async code path is affected. Having no real clue about Ceph internals, I was unable to find out why the MDS wouldn't give me the "Fxr" capabilities, so I patched get_caps_for_async_unlink() to always succeed. (Note that the WARNING dump above was found on an unpatched kernel, without this kludge - this is not a theoretical bug.) 2. Add a sleep call after ceph_mdsc_submit_request() so the unlink completion gets handled by a worker thread before drop_nlink() is called. This guarantees that the `i_nlink` is already zero before drop_nlink() runs. The solution is to skip the counter decrement when it is already zero, but doing so without a lock is still racy (TOCTOU). Since ceph_fill_inode() and handle_cap_grant() both hold the `ceph_inode_info.i_ceph_lock` spinlock while set_nlink() runs, this seems like the proper lock to protect the `i_nlink` updates. I found prior art in NFS and SMB (using `inode.i_lock`) and AFS (using `afs_vnode.cb_lock`). All three have the zero check as well. Cc: stable@vger.kernel.org Fixes: 2ccb45462aea ("ceph: perform asynchronous unlink if we have sufficient caps") Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: refactor wake_up_bit() pattern of calling

2025-11-24T09:29:50+00:00

[ Upstream commit 53db6f25ee47cb1265141d31562604e56146919a ] The wake_up_bit() is called in ceph_async_unlink_cb(), wake_async_create_waiters(), and ceph_finish_async_create(). It makes sense to switch on clear_bit() function, because it makes the code much cleaner and easier to understand. More important rework is the adding of smp_mb__after_atomic() memory barrier after the bit modification and before wake_up_bit() call. It can prevent potential race condition of accessing the modified bit in other threads. Luckily, clear_and_wake_up_bit() already implements the required functionality pattern: static inline void clear_and_wake_up_bit(int bit, unsigned long *word) { clear_bit_unlock(bit, word); /* See wake_up_bit() for which memory barrier you need to use. */ smp_mb__after_atomic(); wake_up_bit(word, bit); } Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin

ceph: rename _to_client() to _to_fs_client()

2024-04-27T15:11:29+00:00

[ Upstream commit 5995d90d2d19f337df6a50bcf4699ef053214dac ] We need to covert the inode to ceph_client in the following commit, and will add one new helper for that, here we rename the old helper to _fs_client(). Link: https://tracker.ceph.com/issues/61590 Signed-off-by: Xiubo Li Reviewed-by: Patrick Donnelly Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov Stable-dep-of: b372e96bd0a3 ("ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE") Signed-off-by: Sasha Levin

ceph: pass the mdsc to several helpers

2024-04-27T15:11:29+00:00

[ Upstream commit 197b7d792d6aead2e30d4b2c054ffabae2ed73dc ] We will use the 'mdsc' to get the global_id in the following commits. Link: https://tracker.ceph.com/issues/61590 Signed-off-by: Xiubo Li Reviewed-by: Patrick Donnelly Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov Stable-dep-of: b372e96bd0a3 ("ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE") Signed-off-by: Sasha Levin

ceph: switch ceph_lookup/atomic_open() to use new fscrypt helper

2023-08-24T09:24:37+00:00

Instead of setting the no-key dentry, use the new fscrypt_prepare_lookup_partial() helper. We still need to mark the directory as incomplete if the directory was just unlocked. In ceph_atomic_open() this fixes a bug where a dentry is incorrectly set with DCACHE_NOKEY_NAME when 'dir' has been evicted but the key is still available (for example, where there's a drop_caches). Signed-off-by: Luís Henriques Reviewed-by: Xiubo Li Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov

ceph: prevent snapshot creation in encrypted locked directories

2023-08-24T09:24:36+00:00

With snapshot names encryption we can not allow snapshots to be created in locked directories because the names wouldn't be encrypted. This patch forces the directory to be unlocked to allow a snapshot to be created. Signed-off-by: Luís Henriques Reviewed-by: Jeff Layton Reviewed-by: Xiubo Li Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov

ceph: size handling in MClientRequest, cap updates and inode traces

2023-08-24T09:24:35+00:00

For encrypted inodes, transmit a rounded-up size to the MDS as the normal file size and send the real inode size in fscrypt_file field. Also, fix up creates and truncates to also transmit fscrypt_file. When we get an inode trace from the MDS, grab the fscrypt_file field if the inode is encrypted, and use it to populate the i_size field instead of the regular inode size field. Signed-off-by: Jeff Layton Reviewed-by: Xiubo Li Reviewed-and-tested-by: Luís Henriques Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov

ceph: mark directory as non-complete after loading key

2023-08-24T09:24:35+00:00

When setting a directory's crypt context, ceph_dir_clear_complete() needs to be called otherwise if it was complete before, any existing (old) dentry will still be valid. This patch adds a wrapper around __fscrypt_prepare_readdir() which will ensure a directory is marked as non-complete if key status changes. [ xiubli: revise commit title per Milind ] Signed-off-by: Luís Henriques Reviewed-by: Xiubo Li Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov

ceph: add some fscrypt guardrails

2023-08-24T09:24:35+00:00

Add the appropriate calls into fscrypt for various actions, including link, rename, setattr, and the open codepaths. Disable fallocate for encrypted inodes -- hopefully, just for now. If we have an encrypted inode, then the client will need to re-encrypt the contents of the new object. Disable copy offload to or from encrypted inodes. Set i_blkbits to crypto block size for encrypted inodes -- some of the underlying infrastructure for fscrypt relies on i_blkbits being aligned to crypto blocksize. Report STATX_ATTR_ENCRYPTED on encrypted inodes. [ lhenriques: forbid encryption with striped layouts ] Signed-off-by: Jeff Layton Reviewed-by: Xiubo Li Reviewed-and-tested-by: Luís Henriques Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov

ceph: create symlinks with encrypted and base64-encoded targets

2023-08-24T09:24:35+00:00

When creating symlinks in encrypted directories, encrypt and base64-encode the target with the new inode's key before sending to the MDS. When filling a symlinked inode, base64-decode it into a buffer that we'll keep in ci->i_symlink. When get_link is called, decrypt the buffer into a new one that will hang off i_link. Signed-off-by: Jeff Layton Reviewed-by: Xiubo Li Reviewed-and-tested-by: Luís Henriques Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov