kernel/linux.git/fs/ceph, branch v6.6.141

ceph: fix a buffer leak in __ceph_setxattr()

2026-05-23T11:03:33+00:00

commit 5d3cc36b4e77a27ce7b686b7c59c7072bcb3fa8e upstream. The old_blob in __ceph_setxattr() can store ci->i_xattrs.prealloc_blob value during the retry. However, it is never called the ceph_buffer_put() for the old_blob object. This patch fixes the issue of the buffer leak. Cc: stable@vger.kernel.org Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: only d_add() negative dentries when they are unhashed

2026-05-17T15:13:53+00:00

[ Upstream commit 803447f93d75ab6e40c85e6d12b5630d281d70d6 ] Ceph can call d_add(dentry, NULL) on a negative dentry that is already present in the primary dcache hash. In the current VFS that is not safe. d_add() goes through __d_add() to __d_rehash(), which unconditionally reinserts dentry->d_hash into the hlist_bl bucket. If the dentry is already hashed, reinserting the same node can corrupt the bucket, including creating a self-loop. Once that happens, __d_lookup() can spin forever in the hlist_bl walk, typically looping only on the d_name.hash mismatch check and eventually triggering RCU stall reports like this one: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 87-....: (2100 ticks this GP) idle=3a4c/1/0x4000000000000000 softirq=25003319/25003319 fqs=829 rcu: (t=2101 jiffies g=79058445 q=698988 ncpus=192) CPU: 87 UID: 2952868916 PID: 3933303 Comm: php-cgi8.3 Not tainted 6.18.17-i1-amd #950 NONE Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.6 09/22/2023 RIP: 0010:__d_lookup+0x46/0xb0 Code: c1 e8 07 48 8d 04 c2 48 8b 00 49 89 fc 49 89 f5 48 89 c3 48 83 e3 fe 48 83 f8 01 77 0f eb 2d 0f 1f 44 00 00 48 8b 1b 48 85 db <74> 20 39 6b 18 75 f3 48 8d 7b 78 e8 ba 85 d0 00 4c 39 63 10 74 1f RSP: 0018:ff745a70c8253898 EFLAGS: 00000282 RAX: ff26e470054cb208 RBX: ff26e470054cb208 RCX: 000000006e958966 RDX: ff26e48267340000 RSI: ff745a70c82539b0 RDI: ff26e458f74655c0 RBP: 000000006e958966 R08: 0000000000000180 R09: 9cd08d909b919a89 R10: ff26e458f74655c0 R11: 0000000000000000 R12: ff26e458f74655c0 R13: ff745a70c82539b0 R14: d0d0d0d0d0d0d0d0 R15: 2f2f2f2f2f2f2f2f FS: 00007f5770896980(0000) GS:ff26e482c5d88000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5764de50c0 CR3: 000000a72abb5001 CR4: 0000000000771ef0 PKRU: 55555554 Call Trace: lookup_fast+0x9f/0x100 walk_component+0x1f/0x150 link_path_walk+0x20e/0x3d0 path_lookupat+0x68/0x180 filename_lookup+0xdc/0x1e0 vfs_statx+0x6c/0x140 vfs_fstatat+0x67/0xa0 __do_sys_newfstatat+0x24/0x60 do_syscall_64+0x6a/0x230 entry_SYSCALL_64_after_hwframe+0x76/0x7e This is reachable with reused cached negative dentries. A Ceph lookup or atomic_open can be handed a negative dentry that is already hashed, and fs/ceph/dir.c then hits one of two paths that incorrectly assume "negative" also means "unhashed": - ceph_finish_lookup(): MDS reply is -ENOENT with no trace -> d_add(dentry, NULL) - ceph_lookup(): local ENOENT fast path for a complete directory with shared caps -> d_add(dentry, NULL) Both paths can therefore re-add an already-hashed negative dentry. Ceph already uses the correct pattern elsewhere: ceph_fill_trace() only calls d_add(dn, NULL) for a negative null-dentry reply when d_unhashed(dn) is true. Fix both fs/ceph/dir.c sites the same way: only call d_add() for a negative dentry when it is actually unhashed. If the negative dentry is already hashed, leave it in place and reuse it as-is. This preserves the existing behavior for unhashed dentries while avoiding d_hash list corruption for reused hashed negatives. Cc: stable@vger.kernel.org Fixes: 2817b000b02c ("ceph: directory operations") Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov [ kept existing dout() debug call instead of upstream's doutc() form when adding the d_unhashed() guard around d_add() ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

ceph: fix memory leaks in ceph_mdsc_build_path()

2026-03-25T10:05:51+00:00

commit 040d159a45ded7f33201421a81df0aa2a86e5a0b upstream. Add __putname() calls to error code paths that did not free the "path" pointer obtained by __getname(). If ownership of this pointer is not passed to the caller via path_info.path, the function must free it before returning. Cc: stable@vger.kernel.org Fixes: 3fd945a79e14 ("ceph: encode encrypted name in ceph_mdsc_build_path and dentry release") Fixes: 550f7ca98ee0 ("ceph: give up on paths longer than PATH_MAX") Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: fix i_nlink underrun during async unlink

2026-03-25T10:05:51+00:00

commit ce0123cbb4a40a2f1bbb815f292b26e96088639f upstream. During async unlink, we drop the `i_nlink` counter before we receive the completion (that will eventually update the `i_nlink`) because "we assume that the unlink will succeed". That is not a bad idea, but it races against deletions by other clients (or against the completion of our own unlink) and can lead to an underrun which emits a WARNING like this one: WARNING: CPU: 85 PID: 25093 at fs/inode.c:407 drop_nlink+0x50/0x68 Modules linked in: CPU: 85 UID: 3221252029 PID: 25093 Comm: php-cgi8.1 Not tainted 6.14.11-cm4all1-ampere #655 Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023 pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : drop_nlink+0x50/0x68 lr : ceph_unlink+0x6c4/0x720 sp : ffff80012173bc90 x29: ffff80012173bc90 x28: ffff086d0a45aaf8 x27: ffff0871d0eb5680 x26: ffff087f2a64a718 x25: 0000020000000180 x24: 0000000061c88647 x23: 0000000000000002 x22: ffff07ff9236d800 x21: 0000000000001203 x20: ffff07ff9237b000 x19: ffff088b8296afc0 x18: 00000000f3c93365 x17: 0000000000070000 x16: ffff08faffcbdfe8 x15: ffff08faffcbdfec x14: 0000000000000000 x13: 45445f65645f3037 x12: 34385f6369706f74 x11: 0000a2653104bb20 x10: ffffd85f26d73290 x9 : ffffd85f25664f94 x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 0000000000000002 x5 : 0000000000000081 x4 : 0000000000000481 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff08727d3f91e8 Call trace: drop_nlink+0x50/0x68 (P) vfs_unlink+0xb0/0x2e8 do_unlinkat+0x204/0x288 __arm64_sys_unlinkat+0x3c/0x80 invoke_syscall.constprop.0+0x54/0xe8 do_el0_svc+0xa4/0xc8 el0_svc+0x18/0x58 el0t_64_sync_handler+0x104/0x130 el0t_64_sync+0x154/0x158 In ceph_unlink(), a call to ceph_mdsc_submit_request() submits the CEPH_MDS_OP_UNLINK to the MDS, but does not wait for completion. Meanwhile, between this call and the following drop_nlink() call, a worker thread may process a CEPH_CAP_OP_IMPORT, CEPH_CAP_OP_GRANT or just a CEPH_MSG_CLIENT_REPLY (the latter of which could be our own completion). These will lead to a set_nlink() call, updating the `i_nlink` counter to the value received from the MDS. If that new `i_nlink` value happens to be zero, it is illegal to decrement it further. But that is exactly what ceph_unlink() will do then. The WARNING can be reproduced this way: 1. Force async unlink; only the async code path is affected. Having no real clue about Ceph internals, I was unable to find out why the MDS wouldn't give me the "Fxr" capabilities, so I patched get_caps_for_async_unlink() to always succeed. (Note that the WARNING dump above was found on an unpatched kernel, without this kludge - this is not a theoretical bug.) 2. Add a sleep call after ceph_mdsc_submit_request() so the unlink completion gets handled by a worker thread before drop_nlink() is called. This guarantees that the `i_nlink` is already zero before drop_nlink() runs. The solution is to skip the counter decrement when it is already zero, but doing so without a lock is still racy (TOCTOU). Since ceph_fill_inode() and handle_cap_grant() both hold the `ceph_inode_info.i_ceph_lock` spinlock while set_nlink() runs, this seems like the proper lock to protect the `i_nlink` updates. I found prior art in NFS and SMB (using `inode.i_lock`) and AFS (using `afs_vnode.cb_lock`). All three have the zero check as well. Cc: stable@vger.kernel.org Fixes: 2ccb45462aea ("ceph: perform asynchronous unlink if we have sufficient caps") Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: supply snapshot context in ceph_zero_partial_object()

2026-03-04T12:21:16+00:00

[ Upstream commit f16bd3fa74a2084ee7e16a8a2be7e7399b970907 ] The ceph_zero_partial_object function was missing proper snapshot context for its OSD write operations, which could lead to data inconsistencies in snapshots. Reproducer: ../src/vstart.sh --new -x --localhost --bluestore ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a' mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf dd if=/dev/urandom of=/mnt/mycephfs/foo bs=64K count=1 mkdir /mnt/mycephfs/.snap/snap1 md5sum /mnt/mycephfs/.snap/snap1/foo fallocate -p -o 0 -l 4096 /mnt/mycephfs/foo echo 3 > /proc/sys/vm/drop/caches md5sum /mnt/mycephfs/.snap/snap1/foo # get different md5sum!! Cc: stable@vger.kernel.org Fixes: ad7a60de882ac ("ceph: punch hole support") Signed-off-by: ethanwu Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin

ceph: supply snapshot context in ceph_uninline_data()

2026-03-04T12:20:56+00:00

[ Upstream commit 305ff6b3a03c230d3c07b61457e961406d979693 ] The ceph_uninline_data function was missing proper snapshot context handling for its OSD write operations. Both CEPH_OSD_OP_CREATE and CEPH_OSD_OP_WRITE requests were passing NULL instead of the appropriate snapshot context, which could lead to unnecessary object clone. Reproducer: ../src/vstart.sh --new -x --localhost --bluestore // turn on cephfs inline data ./bin/ceph fs set a inline_data true --yes-i-really-really-mean-it // allow fs_a client to take snapshot ./bin/ceph auth caps client.fs_a mds 'allow rwps fsname=a' mon 'allow r fsname=a' osd 'allow rw tag cephfs data=a' // mount cephfs with fuse, since kernel cephfs doesn't support inline write ceph-fuse --id fs_a -m 127.0.0.1:40318 --conf ceph.conf -d /mnt/mycephfs/ // bump snapshot seq mkdir /mnt/mycephfs/.snap/snap1 echo "foo" > /mnt/mycephfs/test // umount and mount it again using kernel cephfs client umount /mnt/mycephfs mount -t ceph fs_a@.a=/ /mnt/mycephfs/ -o conf=./ceph.conf echo "bar" >> /mnt/mycephfs/test ./bin/rados listsnaps -p cephfs.a.data $(printf "%x\n" $(stat -c %i /mnt/mycephfs/test)).00000000 will see this object does unnecessary clone 1000000000a.00000000 (seq:2): cloneid snaps size overlap 2 2 4 [] head - 8 but it's expected to see 10000000000.00000000 (seq:2): cloneid snaps size overlap head - 8 since there's no snapshot between these 2 writes clone happened because the first osd request CEPH_OSD_OP_CREATE doesn't pass snap context so object is created with snap seq 0, but later data writeback is equipped with snapshot context. snap.seq(1) > object snap seq(0), so osd does object clone. This fix properly acquiring the snapshot context before performing write operations. Signed-off-by: ethanwu Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin

ceph: refactor wake_up_bit() pattern of calling

2025-11-24T09:29:50+00:00

[ Upstream commit 53db6f25ee47cb1265141d31562604e56146919a ] The wake_up_bit() is called in ceph_async_unlink_cb(), wake_async_create_waiters(), and ceph_finish_async_create(). It makes sense to switch on clear_bit() function, because it makes the code much cleaner and easier to understand. More important rework is the adding of smp_mb__after_atomic() memory barrier after the bit modification and before wake_up_bit() call. It can prevent potential race condition of accessing the modified bit in other threads. Luckily, clear_and_wake_up_bit() already implements the required functionality pattern: static inline void clear_and_wake_up_bit(int bit, unsigned long *word) { clear_bit_unlock(bit, word); /* See wake_up_bit() for which memory barrier you need to use. */ smp_mb__after_atomic(); wake_up_bit(word, bit); } Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin

ceph: add checking of wait_for_completion_killable() return value

2025-11-24T09:29:49+00:00

[ Upstream commit b7ed1e29cfe773d648ca09895b92856bd3a2092d ] The Coverity Scan service has detected the calling of wait_for_completion_killable() without checking the return value in ceph_lock_wait_for_completion() [1]. The CID 1636232 defect contains explanation: "If the function returns an error value, the error value may be mistaken for a normal value. In ceph_lock_wait_for_completion(): Value returned from a function is not checked for errors before being used. (CWE-252)". The patch adds the checking of wait_for_completion_killable() return value and return the error code from ceph_lock_wait_for_completion(). [1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1636232 Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin

ceph: fix possible integer overflow in ceph_zero_objects()

2025-07-06T09:00:08+00:00

[ Upstream commit 0abd87942e0c93964e93224836944712feba1d91 ] In 'ceph_zero_objects', promote 'object_size' to 'u64' to avoid possible integer overflow. Compile tested only. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Dmitry Kandybka Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin

ceph: set superblock s_magic for IMA fsmagic matching

2025-06-27T10:08:47+00:00

commit 72386d5245b249f5a0a8fabb881df7ad947b8ea4 upstream. The CephFS kernel driver forgets to set the filesystem magic signature in its superblock. As a result, IMA policy rules based on fsmagic matching do not apply as intended. This causes a major performance regression in Talos Linux [1] when mounting CephFS volumes, such as when deploying Rook Ceph [2]. Talos Linux ships a hardened kernel with the following IMA policy (irrelevant lines omitted): [...] dont_measure fsmagic=0xc36400 # CEPH_SUPER_MAGIC [...] measure func=FILE_CHECK mask=^MAY_READ euid=0 measure func=FILE_CHECK mask=^MAY_READ uid=0 [...] Currently, IMA compares 0xc36400 == 0x0 for CephFS files, resulting in all files opened with O_RDONLY or O_RDWR getting measured with SHA512 on every open(2): 10 69990c87e8af323d47e2d6ae4... ima-ng sha512: /data/cephfs/test-file Since O_WRONLY is rare, this results in an order of magnitude lower performance than expected for practically all file operations. Properly setting CEPH_SUPER_MAGIC in the CephFS superblock resolves the regression. Tests performed on a 3x replicated Ceph v19.3.0 cluster across three i5-7200U nodes each equipped with one Micron 7400 MAX M.2 disk (BlueStore) and Gigabit ethernet, on Talos Linux v1.10.2: FS-Mark 3.3 Test: 500 Files, Empty Files/s > Higher Is Better 6.12.27-talos . 16.6 |==== +twelho patch . 208.4 |==================================================== FS-Mark 3.3 Test: 500 Files, 1KB Size Files/s > Higher Is Better 6.12.27-talos . 15.6 |======= +twelho patch . 118.6 |==================================================== FS-Mark 3.3 Test: 500 Files, 32 Sub Dirs, 1MB Size Files/s > Higher Is Better 6.12.27-talos . 12.7 |=============== +twelho patch . 44.7 |===================================================== IO500 [3] 2fcd6d6 results (benchmarks within variance omitted): | IO500 benchmark | 6.12.27-talos | +twelho patch | Speedup | |-------------------|----------------|----------------|-----------| | mdtest-easy-write | 0.018524 kIOPS | 1.135027 kIOPS | 6027.33 % | | mdtest-hard-write | 0.018498 kIOPS | 0.973312 kIOPS | 5161.71 % | | ior-easy-read | 0.064727 GiB/s | 0.155324 GiB/s | 139.97 % | | mdtest-hard-read | 0.018246 kIOPS | 0.780800 kIOPS | 4179.29 % | This applies outside of synthetic benchmarks as well, for example, the time to rsync a 55 MiB directory with ~12k of mostly small files drops from an unusable 10m5s to a reasonable 26s (23x the throughput). [1]: https://www.talos.dev/ [2]: https://www.talos.dev/v1.10/kubernetes-guides/configuration/ceph-with-rook/ [3]: https://github.com/IO500/io500 Cc: stable@vger.kernel.org Signed-off-by: Dennis Marttinen Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman