kernel/linux.git/fs/ceph/addr.c, branch linux-7.1.y

ceph: put folios not suitable for writeback

2026-05-11T08:39:22+00:00

The batch holds references to the folios (see `filemap_get_folios`, `folio_batch_release`), so we need to `folio_put` the folios we remove. Tested on v6.18. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/74156 Signed-off-by: Hristo Venev Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

Merge tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client

2026-04-24T20:47:19+00:00

Pull ceph updates from Ilya Dryomov: "We have a series from Alex which extends CephFS client metrics with support for per-subvolume data I/O performance and latency tracking (metadata operations aren't included) and a good variety of fixes and cleanups across RBD and CephFS" * tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client: ceph: add subvolume metrics collection and reporting ceph: parse subvolume_id from InodeStat v9 and store in inode ceph: handle InodeStat v8 versioned field in reply parsing libceph: Fix slab-out-of-bounds access in auth message processing rbd: fix null-ptr-deref when device_add_disk() fails crush: cleanup in crush_do_rule() method ceph: clear s_cap_reconnect when ceph_pagelist_encode_32() fails ceph: only d_add() negative dentries when they are unhashed libceph: update outdated comment in ceph_sock_write_space() libceph: Remove obsolete session key alignment logic ceph: fix num_ops off-by-one when crypto allocation fails libceph: Prevent potential null-ptr-deref in ceph_handle_auth_reply()

ceph: add subvolume metrics collection and reporting

2026-04-21T23:40:23+00:00

Add complete infrastructure for per-subvolume I/O metrics collection and reporting to the MDS. This enables administrators to monitor I/O patterns at the subvolume granularity, which is useful for multi-tenant CephFS deployments. This patch adds: - CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation - CEPH_SUBVOLUME_ID_NONE constant (0) for unknown/unset state - Red-black tree based metrics tracker for efficient per-subvolume aggregation with kmem_cache for entry allocations - Wire format encoding matching the MDS C++ AggregatedIOMetrics struct - Integration with the existing CLIENT_METRICS message - Recording of I/O operations from file read/write and writeback paths - Debugfs interfaces for monitoring (metrics/subvolumes, metrics/metric_features) Metrics tracked per subvolume include: - Read/write operation counts - Read/write byte counts - Read/write latency sums (for average calculation) The metrics are periodically sent to the MDS as part of the existing metrics reporting infrastructure when the MDS advertises support for the SUBVOLUME_METRICS feature. CEPH_SUBVOLUME_ID_NONE enforces subvolume_id immutability. Following the FUSE client convention, 0 means unknown/unset. Once an inode has a valid (non-zero) subvolume_id, it should not change during the inode's lifetime. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

ceph: fix num_ops off-by-one when crypto allocation fails

2026-04-21T23:40:22+00:00

move_dirty_folio_in_page_array() may fail if the file is encrypted, the dirty folio is not the first in the batch, and it fails to allocate a bounce buffer to hold the ciphertext. When that happens, ceph_process_folio_batch() simply redirties the folio and flushes the current batch -- it can retry that folio in a future batch. However, if this failed folio is not contiguous with the last folio that did make it into the batch, then ceph_process_folio_batch() has already incremented `ceph_wbc->num_ops`; because it doesn't follow through and add the discontiguous folio to the array, ceph_submit_write() -- which expects that `ceph_wbc->num_ops` accurately reflects the number of contiguous ranges (and therefore the required number of "write extent" ops) in the writeback -- will panic the kernel: BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops); This issue can be reproduced on affected kernels by writing to fscrypt-enabled CephFS file(s) with a 4KiB-written/4KiB-skipped/repeat pattern (total filesize should not matter) and gradually increasing the system's memory pressure until a bounce buffer allocation fails. Fix this crash by decrementing `ceph_wbc->num_ops` back to the correct value when move_dirty_folio_in_page_array() fails, but the folio already started counting a new (i.e. still-empty) extent. The defect corrected by this patch has existed since 2022 (see first `Fixes:`), but another bug blocked multi-folio encrypted writeback until recently (see second `Fixes:`). The second commit made it into 6.18.16, 6.19.6, and 7.0-rc1, unmasking the panic in those versions. This patch therefore fixes a regression (panic) introduced by cac190c7674f. Cc: stable@vger.kernel.org Fixes: d55207717ded ("ceph: add encryption support to writepage and writepages") Fixes: cac190c7674f ("ceph: fix write storm on fscrypted files") Signed-off-by: Sam Edwards Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

folio_batch: rename pagevec.h to folio_batch.h

2026-04-05T20:53:07+00:00

struct pagevec was removed in commit 1e0877d58b1e ("mm: remove struct pagevec"). Rename include/linux/pagevec.h to reflect reality and update includes tree-wide. Add the new filename to MAINTAINERS explicitly, as it no longer matches the "include/linux/page[-_]*" pattern in MEMORY MANAGEMENT - CORE. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-3-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman Acked-by: David Hildenbrand (Arm) Reviewed-by: Jan Kara Acked-by: Zi Yan Reviewed-by: Lorenzo Stoakes (Oracle) Cc: Chris Li Cc: Christian Brauner Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton

ceph: do not skip the first folio of the next object in writeback

2026-03-09T11:34:40+00:00

When `ceph_process_folio_batch` encounters a folio past the end of the current object, it should leave it in the batch so that it is picked up in the next iteration. Removing the folio from the batch means that it does not get written back and remains dirty instead. This makes `fsync()` silently skip some of the data, delays capability release, and breaks coherence with `O_DIRECT`. The link below contains instructions for reproducing the bug. Cc: stable@vger.kernel.org Fixes: ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") Link: https://tracker.ceph.com/issues/75156 Signed-off-by: Hristo Venev Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

2026-02-21T09:02:28+00:00

This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook

ceph: assert loop invariants in ceph_writepages_start()

2026-02-11T18:14:32+00:00

If `locked_pages` is zero, the page array must not be allocated: ceph_process_folio_batch() uses `locked_pages` to decide when to allocate `pages`, and redundant allocations trigger ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and writeback stall) or even a kernel panic. Consequently, the main loop in ceph_writepages_start() assumes that the lifetime of `pages` is confined to a single iteration. This expectation is currently not clear enough, as evidenced by the recent patch which fixed an oops caused by `pages` persisting into the next loop iteration: - "ceph: do not propagate page array emplacement errors as batch errors" Use an explicit BUG_ON() at the top of the loop to assert the loop's preexisting expectation that `pages` is cleaned up by the previous iteration. Because this is closely tied to `locked_pages`, also make it the previous iteration's responsibility to guarantee its reset, and verify with a second new BUG_ON() instead of handling (and masking) failures to do so. This patch does not change invariants, behavior, or failure modes. The added BUG_ON() lines catch conditions that would already trigger oops, but do so earlier for easier debugging and programmer clarity. Signed-off-by: Sam Edwards Reviewed-by: Ilya Dryomov Signed-off-by: Ilya Dryomov

ceph: remove error return from ceph_process_folio_batch()

2026-02-11T16:52:50+00:00

Following an earlier commit, ceph_process_folio_batch() no longer returns errors because the writeback loop cannot handle them. Since this function already indicates failure to lock any pages by leaving `ceph_wbc.locked_pages == 0`, and the writeback loop has no way to handle abandonment of a locked batch, change the return type of ceph_process_folio_batch() to `void` and remove the pathological goto in the writeback loop. The lack of a return code emphasizes that ceph_process_folio_batch() is designed to be abort-free: that is, once it commits a folio for writeback, it will not later abandon it or propagate an error for that folio. Any future changes requiring "abort" logic should follow this invariant by cleaning up its array and resetting ceph_wbc.locked_pages appropriately. Signed-off-by: Sam Edwards Reviewed-by: Ilya Dryomov Signed-off-by: Ilya Dryomov

ceph: fix write storm on fscrypted files

2026-02-11T16:52:50+00:00

CephFS stores file data across multiple RADOS objects. An object is the atomic unit of storage, so the writeback code must clean only folios that belong to the same object with each OSD request. CephFS also supports RAID0-style striping of file contents: if enabled, each object stores multiple unbroken "stripe units" covering different portions of the file; if disabled, a "stripe unit" is simply the whole object. The stripe unit is (usually) reported as the inode's block size. Though the writeback logic could, in principle, lock all dirty folios belonging to the same object, its current design is to lock only a single stripe unit at a time. Ever since this code was first written, it has determined this size by checking the inode's block size. However, the relatively-new fscrypt support needed to reduce the block size for encrypted inodes to the crypto block size (see 'fixes' commit), which causes an unnecessarily high number of write operations (~1024x as many, with 4MiB objects) and correspondingly degraded performance. Fix this (and clarify intent) by using i_layout.stripe_unit directly in ceph_define_write_size() so that encrypted inodes are written back with the same number of operations as if they were unencrypted. This patch depends on the preceding commit ("ceph: do not propagate page array emplacement errors as batch errors") for correctness. While it applies cleanly on its own, applying it alone will introduce a regression. This dependency is only relevant for kernels where ce80b76dd327 ("ceph: introduce ceph_process_folio_batch() method") has been applied; stable kernels without that commit are unaffected. Cc: stable@vger.kernel.org Fixes: 94af0470924c ("ceph: add some fscrypt guardrails") Signed-off-by: Sam Edwards Reviewed-by: Ilya Dryomov Signed-off-by: Ilya Dryomov