kernel/linux.git/fs/ceph, branch v6.18.33

ceph: fix BUG_ON in __ceph_build_xattrs_blob() due to stale blob size

2026-05-23T11:07:18+00:00

commit 0c22d9511cbde746622f8e4c11aaa63fe76d45f9 upstream. The generic/642 test-case can reproduce the kernel crash: [40243.605254] ------------[ cut here ]------------ [40243.605956] kernel BUG at fs/ceph/xattr.c:918! [40243.607142] Oops: invalid opcode: 0000 [#1] SMP PTI [40243.608067] CPU: 7 UID: 0 PID: 498762 Comm: kworker/7:1 Not tainted 7.0.0-rc7+ #3 PREEMPT(full) [40243.609700] Hardware name: QEMU Ubuntu 25.10 PC v2 (i440FX + PIIX, + 10.1 machine, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [40243.611820] Workqueue: ceph-msgr ceph_con_workfn [40243.612715] RIP: 0010:__ceph_build_xattrs_blob+0x1b8/0x1e0 [40243.613731] Code: 0f 84 82 fe ff ff e9 cf 8e 56 ff 48 8d 65 e8 31 c0 5b 41 5c 41 5d 5d 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9 c3 cc cc cc cc <0f> 0b 4c 8b 62 08 41 8b 85 24 07 00 00 49 83 c4 04 41 89 44 24 fc [40243.616888] RSP: 0018:ffffcc80c4d4b688 EFLAGS: 00010287 [40243.617773] RAX: 0000000000010026 RBX: 0000000000000001 RCX: 0000000000000000 [40243.618928] RDX: ffff8a773798dee0 RSI: 0000000000000000 RDI: 0000000000000000 [40243.620158] RBP: ffffcc80c4d4b6a0 R08: 0000000000000000 R09: 0000000000000000 [40243.621573] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a75f3b58000 [40243.622907] R13: ffff8a75f3b58000 R14: 0000000000000080 R15: 000000000000bffd [40243.624054] FS: 0000000000000000(0000) GS:ffff8a787d1b4000(0000) knlGS:0000000000000000 [40243.625331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [40243.626269] CR2: 000072f390b623c0 CR3: 000000011c02a003 CR4: 0000000000372ef0 [40243.627408] Call Trace: [40243.627839] [40243.628188] __prep_cap+0x3fd/0x4a0 [40243.628789] ? do_raw_spin_unlock+0x4e/0xe0 [40243.629474] ceph_check_caps+0x46a/0xc80 [40243.630094] ? __lock_acquire+0x4a2/0x2650 [40243.630773] ? find_held_lock+0x31/0x90 [40243.631347] ? handle_cap_grant+0x79f/0x1060 [40243.632068] ? lock_release+0xd9/0x300 [40243.632696] ? __mutex_unlock_slowpath+0x3e/0x340 [40243.633429] ? lock_release+0xd9/0x300 [40243.634052] handle_cap_grant+0xcf6/0x1060 [40243.634745] ceph_handle_caps+0x122b/0x2110 [40243.635415] mds_dispatch+0x5bd/0x2160 [40243.636034] ? ceph_con_process_message+0x65/0x190 [40243.636828] ? lock_release+0xd9/0x300 [40243.637431] ceph_con_process_message+0x7a/0x190 [40243.638184] ? kfree+0x311/0x4f0 [40243.638749] ? kfree+0x311/0x4f0 [40243.639268] process_message+0x16/0x1a0 [40243.639915] ? sg_free_table+0x39/0x90 [40243.640572] ceph_con_v2_try_read+0xf58/0x2120 [40243.641255] ? lock_acquire+0xc8/0x300 [40243.641863] ceph_con_workfn+0x151/0x820 [40243.642493] process_one_work+0x22f/0x630 [40243.643093] ? process_one_work+0x254/0x630 [40243.643770] worker_thread+0x1e2/0x400 [40243.644332] ? __pfx_worker_thread+0x10/0x10 [40243.645020] kthread+0x109/0x140 [40243.645560] ? __pfx_kthread+0x10/0x10 [40243.646125] ret_from_fork+0x3f8/0x480 [40243.646752] ? __pfx_kthread+0x10/0x10 [40243.647316] ? __pfx_kthread+0x10/0x10 [40243.647919] ret_from_fork_asm+0x1a/0x30 [40243.648556] [40243.648902] Modules linked in: overlay hctr2 libpolyval chacha libchacha adiantum libnh libpoly1305 essiv intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit kvm_intel kvm irqbypass joydev ghash_clmulni_intel aesni_intel rapl input_leds mac_hid psmouse vga16fb serio_raw vgastate floppy i2c_piix4 pata_acpi bochs qemu_fw_cfg i2c_smbus sch_fq_codel rbd dm_crypt msr parport_pc ppdev lp parport efi_pstore [40243.654766] ---[ end trace 0000000000000000 ]--- Commit d93231a6bc8a ("ceph: prevent a client from exceeding the MDS maximum xattr size") moved the required_blob_size computation to before the __build_xattrs() call, introducing a race. __build_xattrs() releases and reacquires i_ceph_lock during execution. In that window, handle_cap_grant() may update i_xattrs.blob with a newer MDS-provided blob and bump i_xattrs.version. When __build_xattrs() detects that index_version < version, it destroys and rebuilds the entire xattr rb-tree from the new blob, potentially increasing count, names_size, and vals_size. The prealloc_blob size check that follows still uses the stale required_blob_size computed before the rebuild, so it passes even when prealloc_blob is too small for the now-larger tree. After __set_xattr() adds one more xattr on top, __ceph_build_xattrs_blob() is called from the cap flush path and hits: BUG_ON(need > ci->i_xattrs.prealloc_blob->alloc_len); Fix this by recomputing required_blob_size after __build_xattrs() returns, using the current tree state. Also re-validate against m_max_xattr_size to fall back to the sync path if the rebuilt tree now exceeds the MDS limit. Cc: stable@vger.kernel.org Fixes: d93231a6bc8a ("ceph: prevent a client from exceeding the MDS maximum xattr size") Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: fix a buffer leak in __ceph_setxattr()

2026-05-23T11:07:18+00:00

commit 5d3cc36b4e77a27ce7b686b7c59c7072bcb3fa8e upstream. The old_blob in __ceph_setxattr() can store ci->i_xattrs.prealloc_blob value during the retry. However, it is never called the ceph_buffer_put() for the old_blob object. This patch fixes the issue of the buffer leak. Cc: stable@vger.kernel.org Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: fix num_ops off-by-one when crypto allocation fails

2026-05-14T13:30:11+00:00

commit a0d9555bf9eaeba34fe6b6bb86f442fe08ba3842 upstream. move_dirty_folio_in_page_array() may fail if the file is encrypted, the dirty folio is not the first in the batch, and it fails to allocate a bounce buffer to hold the ciphertext. When that happens, ceph_process_folio_batch() simply redirties the folio and flushes the current batch -- it can retry that folio in a future batch. However, if this failed folio is not contiguous with the last folio that did make it into the batch, then ceph_process_folio_batch() has already incremented `ceph_wbc->num_ops`; because it doesn't follow through and add the discontiguous folio to the array, ceph_submit_write() -- which expects that `ceph_wbc->num_ops` accurately reflects the number of contiguous ranges (and therefore the required number of "write extent" ops) in the writeback -- will panic the kernel: BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops); This issue can be reproduced on affected kernels by writing to fscrypt-enabled CephFS file(s) with a 4KiB-written/4KiB-skipped/repeat pattern (total filesize should not matter) and gradually increasing the system's memory pressure until a bounce buffer allocation fails. Fix this crash by decrementing `ceph_wbc->num_ops` back to the correct value when move_dirty_folio_in_page_array() fails, but the folio already started counting a new (i.e. still-empty) extent. The defect corrected by this patch has existed since 2022 (see first `Fixes:`), but another bug blocked multi-folio encrypted writeback until recently (see second `Fixes:`). The second commit made it into 6.18.16, 6.19.6, and 7.0-rc1, unmasking the panic in those versions. This patch therefore fixes a regression (panic) introduced by cac190c7674f. Cc: stable@vger.kernel.org Fixes: d55207717ded ("ceph: add encryption support to writepage and writepages") Fixes: cac190c7674f ("ceph: fix write storm on fscrypted files") Signed-off-by: Sam Edwards Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin

ceph: only d_add() negative dentries when they are unhashed

2026-05-07T04:11:58+00:00

commit 803447f93d75ab6e40c85e6d12b5630d281d70d6 upstream. Ceph can call d_add(dentry, NULL) on a negative dentry that is already present in the primary dcache hash. In the current VFS that is not safe. d_add() goes through __d_add() to __d_rehash(), which unconditionally reinserts dentry->d_hash into the hlist_bl bucket. If the dentry is already hashed, reinserting the same node can corrupt the bucket, including creating a self-loop. Once that happens, __d_lookup() can spin forever in the hlist_bl walk, typically looping only on the d_name.hash mismatch check and eventually triggering RCU stall reports like this one: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 87-....: (2100 ticks this GP) idle=3a4c/1/0x4000000000000000 softirq=25003319/25003319 fqs=829 rcu: (t=2101 jiffies g=79058445 q=698988 ncpus=192) CPU: 87 UID: 2952868916 PID: 3933303 Comm: php-cgi8.3 Not tainted 6.18.17-i1-amd #950 NONE Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.6 09/22/2023 RIP: 0010:__d_lookup+0x46/0xb0 Code: c1 e8 07 48 8d 04 c2 48 8b 00 49 89 fc 49 89 f5 48 89 c3 48 83 e3 fe 48 83 f8 01 77 0f eb 2d 0f 1f 44 00 00 48 8b 1b 48 85 db <74> 20 39 6b 18 75 f3 48 8d 7b 78 e8 ba 85 d0 00 4c 39 63 10 74 1f RSP: 0018:ff745a70c8253898 EFLAGS: 00000282 RAX: ff26e470054cb208 RBX: ff26e470054cb208 RCX: 000000006e958966 RDX: ff26e48267340000 RSI: ff745a70c82539b0 RDI: ff26e458f74655c0 RBP: 000000006e958966 R08: 0000000000000180 R09: 9cd08d909b919a89 R10: ff26e458f74655c0 R11: 0000000000000000 R12: ff26e458f74655c0 R13: ff745a70c82539b0 R14: d0d0d0d0d0d0d0d0 R15: 2f2f2f2f2f2f2f2f FS: 00007f5770896980(0000) GS:ff26e482c5d88000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5764de50c0 CR3: 000000a72abb5001 CR4: 0000000000771ef0 PKRU: 55555554 Call Trace: lookup_fast+0x9f/0x100 walk_component+0x1f/0x150 link_path_walk+0x20e/0x3d0 path_lookupat+0x68/0x180 filename_lookup+0xdc/0x1e0 vfs_statx+0x6c/0x140 vfs_fstatat+0x67/0xa0 __do_sys_newfstatat+0x24/0x60 do_syscall_64+0x6a/0x230 entry_SYSCALL_64_after_hwframe+0x76/0x7e This is reachable with reused cached negative dentries. A Ceph lookup or atomic_open can be handed a negative dentry that is already hashed, and fs/ceph/dir.c then hits one of two paths that incorrectly assume "negative" also means "unhashed": - ceph_finish_lookup(): MDS reply is -ENOENT with no trace -> d_add(dentry, NULL) - ceph_lookup(): local ENOENT fast path for a complete directory with shared caps -> d_add(dentry, NULL) Both paths can therefore re-add an already-hashed negative dentry. Ceph already uses the correct pattern elsewhere: ceph_fill_trace() only calls d_add(dn, NULL) for a negative null-dentry reply when d_unhashed(dn) is true. Fix both fs/ceph/dir.c sites the same way: only call d_add() for a negative dentry when it is actually unhashed. If the negative dentry is already hashed, leave it in place and reuse it as-is. This preserves the existing behavior for unhashed dentries while avoiding d_hash list corruption for reused hashed negatives. Cc: stable@vger.kernel.org Fixes: 2817b000b02c ("ceph: directory operations") Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: fix memory leaks in ceph_mdsc_build_path()

2026-03-19T15:08:30+00:00

commit 040d159a45ded7f33201421a81df0aa2a86e5a0b upstream. Add __putname() calls to error code paths that did not free the "path" pointer obtained by __getname(). If ownership of this pointer is not passed to the caller via path_info.path, the function must free it before returning. Cc: stable@vger.kernel.org Fixes: 3fd945a79e14 ("ceph: encode encrypted name in ceph_mdsc_build_path and dentry release") Fixes: 550f7ca98ee0 ("ceph: give up on paths longer than PATH_MAX") Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: do not skip the first folio of the next object in writeback

2026-03-19T15:08:30+00:00

commit 081a0b78ef30f5746cda3e92e28b4d4ae92901d1 upstream. When `ceph_process_folio_batch` encounters a folio past the end of the current object, it should leave it in the batch so that it is picked up in the next iteration. Removing the folio from the batch means that it does not get written back and remains dirty instead. This makes `fsync()` silently skip some of the data, delays capability release, and breaks coherence with `O_DIRECT`. The link below contains instructions for reproducing the bug. Cc: stable@vger.kernel.org Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") Link: https://tracker.ceph.com/issues/75156 Signed-off-by: Hristo Venev Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: fix i_nlink underrun during async unlink

2026-03-19T15:08:30+00:00

commit ce0123cbb4a40a2f1bbb815f292b26e96088639f upstream. During async unlink, we drop the `i_nlink` counter before we receive the completion (that will eventually update the `i_nlink`) because "we assume that the unlink will succeed". That is not a bad idea, but it races against deletions by other clients (or against the completion of our own unlink) and can lead to an underrun which emits a WARNING like this one: WARNING: CPU: 85 PID: 25093 at fs/inode.c:407 drop_nlink+0x50/0x68 Modules linked in: CPU: 85 UID: 3221252029 PID: 25093 Comm: php-cgi8.1 Not tainted 6.14.11-cm4all1-ampere #655 Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023 pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : drop_nlink+0x50/0x68 lr : ceph_unlink+0x6c4/0x720 sp : ffff80012173bc90 x29: ffff80012173bc90 x28: ffff086d0a45aaf8 x27: ffff0871d0eb5680 x26: ffff087f2a64a718 x25: 0000020000000180 x24: 0000000061c88647 x23: 0000000000000002 x22: ffff07ff9236d800 x21: 0000000000001203 x20: ffff07ff9237b000 x19: ffff088b8296afc0 x18: 00000000f3c93365 x17: 0000000000070000 x16: ffff08faffcbdfe8 x15: ffff08faffcbdfec x14: 0000000000000000 x13: 45445f65645f3037 x12: 34385f6369706f74 x11: 0000a2653104bb20 x10: ffffd85f26d73290 x9 : ffffd85f25664f94 x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 0000000000000002 x5 : 0000000000000081 x4 : 0000000000000481 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff08727d3f91e8 Call trace: drop_nlink+0x50/0x68 (P) vfs_unlink+0xb0/0x2e8 do_unlinkat+0x204/0x288 __arm64_sys_unlinkat+0x3c/0x80 invoke_syscall.constprop.0+0x54/0xe8 do_el0_svc+0xa4/0xc8 el0_svc+0x18/0x58 el0t_64_sync_handler+0x104/0x130 el0t_64_sync+0x154/0x158 In ceph_unlink(), a call to ceph_mdsc_submit_request() submits the CEPH_MDS_OP_UNLINK to the MDS, but does not wait for completion. Meanwhile, between this call and the following drop_nlink() call, a worker thread may process a CEPH_CAP_OP_IMPORT, CEPH_CAP_OP_GRANT or just a CEPH_MSG_CLIENT_REPLY (the latter of which could be our own completion). These will lead to a set_nlink() call, updating the `i_nlink` counter to the value received from the MDS. If that new `i_nlink` value happens to be zero, it is illegal to decrement it further. But that is exactly what ceph_unlink() will do then. The WARNING can be reproduced this way: 1. Force async unlink; only the async code path is affected. Having no real clue about Ceph internals, I was unable to find out why the MDS wouldn't give me the "Fxr" capabilities, so I patched get_caps_for_async_unlink() to always succeed. (Note that the WARNING dump above was found on an unpatched kernel, without this kludge - this is not a theoretical bug.) 2. Add a sleep call after ceph_mdsc_submit_request() so the unlink completion gets handled by a worker thread before drop_nlink() is called. This guarantees that the `i_nlink` is already zero before drop_nlink() runs. The solution is to skip the counter decrement when it is already zero, but doing so without a lock is still racy (TOCTOU). Since ceph_fill_inode() and handle_cap_grant() both hold the `ceph_inode_info.i_ceph_lock` spinlock while set_nlink() runs, this seems like the proper lock to protect the `i_nlink` updates. I found prior art in NFS and SMB (using `inode.i_lock`) and AFS (using `afs_vnode.cb_lock`). All three have the zero check as well. Cc: stable@vger.kernel.org Fixes: 2ccb45462aea ("ceph: perform asynchronous unlink if we have sufficient caps") Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: add a bunch of missing ceph_path_info initializers

2026-03-19T15:08:29+00:00

commit 43323a5934b660afae687e8e4e95ac328615a5c4 upstream. ceph_mdsc_build_path() must be called with a zero-initialized ceph_path_info parameter, or else the following ceph_mdsc_free_path_info() may crash. Example crash (on Linux 6.18.12): virt_to_cache: Object is not a Slab page! WARNING: CPU: 184 PID: 2871736 at mm/slub.c:6732 kmem_cache_free+0x316/0x400 [...] Call Trace: [...] ceph_open+0x13d/0x3e0 do_dentry_open+0x134/0x480 vfs_open+0x2a/0xe0 path_openat+0x9a3/0x1160 [...] cache_from_obj: Wrong slab cache. names_cache but object is from ceph_inode_info WARNING: CPU: 184 PID: 2871736 at mm/slub.c:6746 kmem_cache_free+0x2dd/0x400 [...] kernel BUG at mm/slub.c:634! Oops: invalid opcode: 0000 [#1] SMP NOPTI RIP: 0010:__slab_free+0x1a4/0x350 Some of the ceph_mdsc_build_path() callers had initializers, but others had not, even though they were all added by commit 15f519e9f883 ("ceph: fix race condition validating r_parent before applying state"). The ones without initializer are suspectible to random crashes. (I can imagine it could even be possible to exploit this bug to elevate privileges.) Unfortunately, these Ceph functions are undocumented and its semantics can only be derived from the code. I see that ceph_mdsc_build_path() initializes the structure only on success, but not on error. Calling ceph_mdsc_free_path_info() after a failed ceph_mdsc_build_path() call does not even make sense, but that's what all callers do, and for it to be safe, the structure must be zero-initialized. The least intrusive approach to fix this is therefore to add initializers everywhere. Cc: stable@vger.kernel.org Fixes: 15f519e9f883 ("ceph: fix race condition validating r_parent before applying state") Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov Signed-off-by: Greg Kroah-Hartman

ceph: fix write storm on fscrypted files

2026-03-04T12:21:28+00:00

[ Upstream commit cac190c7674fea71620d754ffcdaaeed7c551dbc ] CephFS stores file data across multiple RADOS objects. An object is the atomic unit of storage, so the writeback code must clean only folios that belong to the same object with each OSD request. CephFS also supports RAID0-style striping of file contents: if enabled, each object stores multiple unbroken "stripe units" covering different portions of the file; if disabled, a "stripe unit" is simply the whole object. The stripe unit is (usually) reported as the inode's block size. Though the writeback logic could, in principle, lock all dirty folios belonging to the same object, its current design is to lock only a single stripe unit at a time. Ever since this code was first written, it has determined this size by checking the inode's block size. However, the relatively-new fscrypt support needed to reduce the block size for encrypted inodes to the crypto block size (see 'fixes' commit), which causes an unnecessarily high number of write operations (~1024x as many, with 4MiB objects) and correspondingly degraded performance. Fix this (and clarify intent) by using i_layout.stripe_unit directly in ceph_define_write_size() so that encrypted inodes are written back with the same number of operations as if they were unencrypted. This patch depends on the preceding commit ("ceph: do not propagate page array emplacement errors as batch errors") for correctness. While it applies cleanly on its own, applying it alone will introduce a regression. This dependency is only relevant for kernels where ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") has been applied; stable kernels without that commit are unaffected. Cc: stable@vger.kernel.org Fixes: 94af0470924c ("ceph: add some fscrypt guardrails") Signed-off-by: Sam Edwards Reviewed-by: Ilya Dryomov Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin

ceph: do not propagate page array emplacement errors as batch errors

2026-03-04T12:21:28+00:00

[ Upstream commit 707104682e3c163f7c14cdd6b07a3e95fb374759 ] When fscrypt is enabled, move_dirty_folio_in_page_array() may fail because it needs to allocate bounce buffers to store the encrypted versions of each folio. Each folio beyond the first allocates its bounce buffer with GFP_NOWAIT. Failures are common (and expected) under this allocation mode; they should flush (not abort) the batch. However, ceph_process_folio_batch() uses the same `rc` variable for its own return code and for capturing the return codes of its routine calls; failing to reset `rc` back to 0 results in the error being propagated out to the main writeback loop, which cannot actually tolerate any errors here: once `ceph_wbc.pages` is allocated, it must be passed to ceph_submit_write() to be freed. If it survives until the next iteration (e.g. due to the goto being followed), ceph_allocate_page_array()'s BUG_ON() will oops the worker. Note that this failure mode is currently masked due to another bug (addressed next in this series) that prevents multiple encrypted folios from being selected for the same write. For now, just reset `rc` when redirtying the folio to prevent errors in move_dirty_folio_in_page_array() from propagating. Note that move_dirty_folio_in_page_array() is careful never to return errors on the first folio, so there is no need to check for that. After this change, ceph_process_folio_batch() no longer returns errors; its only remaining failure indicator is `locked_pages == 0`, which the caller already handles correctly. Cc: stable@vger.kernel.org Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") Signed-off-by: Sam Edwards Reviewed-by: Ilya Dryomov Signed-off-by: Ilya Dryomov Signed-off-by: Sasha Levin