kernel/linux.git/fs/ceph/super.c, branch v6.19.11

ceph: add trace points to the MDS client

2025-12-10T10:50:54+00:00

This patch adds trace points to the Ceph filesystem MDS client: - request submission (CEPH_MSG_CLIENT_REQUEST) and completion (CEPH_MSG_CLIENT_REPLY) - capabilities (CEPH_MSG_CLIENT_CAPS) These are the central pieces that are useful for analyzing MDS latency/performance problems from the client's perspective. In the long run, all doutc() calls should be replaced with tracepoints. This way, the Ceph filesystem can be traced at any time (without spamming the kernel log). Additionally, trace points can be used in BPF programs (which can even deference the pointer parameters and extract more values). Signed-off-by: Max Kellermann Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

libceph: drop started parameter of __ceph_open_session()

2025-11-26T22:29:11+00:00

With the previous commit revamping the timeout handling, started isn't used anymore. It could be taken into account by adjusting the initial value of the timeout, but there is little point as both callers capture the timestamp shortly before calling __ceph_open_session() -- the only thing of note that happens in the interim is taking client->mount_mutex and that isn't expected to take multiple seconds. Signed-off-by: Ilya Dryomov Reviewed-by: Viacheslav Dubeyko

Merge tag 'ceph-for-6.18-rc1' of https://github.com/ceph/ceph-client

2025-10-10T18:30:19+00:00

Pull ceph updates from Ilya Dryomov: - some messenger improvements (Eric and Max) - address an issue (also affected userspace) of incorrect permissions being granted to users who have access to multiple different CephFS instances within the same cluster (Kotresh) - a bunch of assorted CephFS fixes (Slava) * tag 'ceph-for-6.18-rc1' of https://github.com/ceph/ceph-client: ceph: add bug tracking system info to MAINTAINERS ceph: fix multifs mds auth caps issue ceph: cleanup in ceph_alloc_readdir_reply_buffer() ceph: fix potential NULL dereference issue in ceph_fill_trace() libceph: add empty check to ceph_con_get_out_msg() libceph: pass the message pointer instead of loading con->out_msg libceph: make ceph_con_get_out_msg() return the message pointer ceph: fix potential race condition on operations with CEPH_I_ODIRECT flag ceph: refactor wake_up_bit() pattern of calling ceph: fix potential race condition in ceph_ioctl_lazyio() ceph: fix overflowed constant issue in ceph_do_objects_copy() ceph: fix wrong sizeof argument issue in register_session() ceph: add checking of wait_for_completion_killable() return value ceph: make ceph_start_io_*() killable libceph: Use HMAC-SHA256 library instead of crypto_shash

ceph: fix multifs mds auth caps issue

2025-10-08T21:30:47+00:00

The mds auth caps check should also validate the fsname along with the associated caps. Not doing so would result in applying the mds auth caps of one fs on to the other fs in a multifs ceph cluster. The bug causes multiple issues w.r.t user authentication, following is one such example. Steps to Reproduce (on vstart cluster): 1. Create two file systems in a cluster, say 'fsname1' and 'fsname2' 2. Authorize read only permission to the user 'client.usr' on fs 'fsname1' $ceph fs authorize fsname1 client.usr / r 3. Authorize read and write permission to the same user 'client.usr' on fs 'fsname2' $ceph fs authorize fsname2 client.usr / rw 4. Update the keyring $ceph auth get client.usr >> ./keyring With above permssions for the user 'client.usr', following is the expectation. a. The 'client.usr' should be able to only read the contents and not allowed to create or delete files on file system 'fsname1'. b. The 'client.usr' should be able to read/write on file system 'fsname2'. But, with this bug, the 'client.usr' is allowed to read/write on file system 'fsname1'. See below. 5. Mount the file system 'fsname1' with the user 'client.usr' $sudo bin/mount.ceph usr@.fsname1=/ /kmnt_fsname1_usr/ 6. Try creating a file on file system 'fsname1' with user 'client.usr'. This should fail but passes with this bug. $touch /kmnt_fsname1_usr/file1 7. Mount the file system 'fsname1' with the user 'client.admin' and create a file. $sudo bin/mount.ceph admin@.fsname1=/ /kmnt_fsname1_admin $echo "data" > /kmnt_fsname1_admin/admin_file1 8. Try removing an existing file on file system 'fsname1' with the user 'client.usr'. This shoudn't succeed but succeeds with the bug. $rm -f /kmnt_fsname1_usr/admin_file1 For more information, please take a look at the corresponding mds/fuse patch and tests added by looking into the tracker mentioned below. v2: Fix a possible null dereference in doutc v3: Don't store fsname from mdsmap, validate against ceph_mount_options's fsname and use it v4: Code refactor, better warning message and fix possible compiler warning [ Slava.Dubeyko: "fsname check failed" -> "fsname mismatch" ] Link: https://tracker.ceph.com/issues/72167 Signed-off-by: Kotresh HR Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

Merge tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2025-09-29T17:27:17+00:00

Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq

fs: WQ_PERCPU added to alloc_workqueue users

2025-09-19T14:15:07+00:00

Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to all the fs subsystem users to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo Signed-off-by: Marco Crivellari Link: https://lore.kernel.org/20250916082906.77439-4-marco.crivellari@suse.com Signed-off-by: Christian Brauner

fs: rename generic_delete_inode() and generic_drop_inode()

2025-09-15T14:09:42+00:00

generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik Reviewed-by: Jan Kara Signed-off-by: Christian Brauner

new helper: set_default_d_op()

2025-06-11T02:21:16+00:00

... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner Signed-off-by: Al Viro

ceph: fix variable dereferenced before check in ceph_umount_begin()

2025-06-06T09:08:59+00:00

smatch warnings: fs/ceph/super.c:1042 ceph_umount_begin() warn: variable dereferenced before check 'fsc' (see line 1041) vim +/fsc +1042 fs/ceph/super.c void ceph_umount_begin(struct super_block *sb) { struct ceph_fs_client *fsc = ceph_sb_to_fs_client(sb); doutc(fsc->client, "starting forced umount\n"); ^^^^^^^^^^^ Dereferenced if (!fsc) ^^^^ Checked too late. return; fsc->mount_state = CEPH_MOUNT_SHUTDOWN; __ceph_umount_begin(fsc); } The VFS guarantees that the superblock is still alive when it calls into ceph via ->umount_begin(). Finally, we don't need to check the fsc and it should be valid. This patch simply removes the fsc check. Reported-by: kernel test robot Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202503280852.YDB3pxUY-lkp@intel.com/ Signed-off-by: Viacheslav Dubeyko Reviewed by: Alex Markuze Signed-off-by: Ilya Dryomov

ceph: set superblock s_magic for IMA fsmagic matching

2025-06-01T15:53:16+00:00

The CephFS kernel driver forgets to set the filesystem magic signature in its superblock. As a result, IMA policy rules based on fsmagic matching do not apply as intended. This causes a major performance regression in Talos Linux [1] when mounting CephFS volumes, such as when deploying Rook Ceph [2]. Talos Linux ships a hardened kernel with the following IMA policy (irrelevant lines omitted): [...] dont_measure fsmagic=0xc36400 # CEPH_SUPER_MAGIC [...] measure func=FILE_CHECK mask=^MAY_READ euid=0 measure func=FILE_CHECK mask=^MAY_READ uid=0 [...] Currently, IMA compares 0xc36400 == 0x0 for CephFS files, resulting in all files opened with O_RDONLY or O_RDWR getting measured with SHA512 on every open(2): 10 69990c87e8af323d47e2d6ae4... ima-ng sha512: /data/cephfs/test-file Since O_WRONLY is rare, this results in an order of magnitude lower performance than expected for practically all file operations. Properly setting CEPH_SUPER_MAGIC in the CephFS superblock resolves the regression. Tests performed on a 3x replicated Ceph v19.3.0 cluster across three i5-7200U nodes each equipped with one Micron 7400 MAX M.2 disk (BlueStore) and Gigabit ethernet, on Talos Linux v1.10.2: FS-Mark 3.3 Test: 500 Files, Empty Files/s > Higher Is Better 6.12.27-talos . 16.6 |==== +twelho patch . 208.4 |==================================================== FS-Mark 3.3 Test: 500 Files, 1KB Size Files/s > Higher Is Better 6.12.27-talos . 15.6 |======= +twelho patch . 118.6 |==================================================== FS-Mark 3.3 Test: 500 Files, 32 Sub Dirs, 1MB Size Files/s > Higher Is Better 6.12.27-talos . 12.7 |=============== +twelho patch . 44.7 |===================================================== IO500 [3] 2fcd6d6 results (benchmarks within variance omitted): | IO500 benchmark | 6.12.27-talos | +twelho patch | Speedup | |-------------------|----------------|----------------|-----------| | mdtest-easy-write | 0.018524 kIOPS | 1.135027 kIOPS | 6027.33 % | | mdtest-hard-write | 0.018498 kIOPS | 0.973312 kIOPS | 5161.71 % | | ior-easy-read | 0.064727 GiB/s | 0.155324 GiB/s | 139.97 % | | mdtest-hard-read | 0.018246 kIOPS | 0.780800 kIOPS | 4179.29 % | This applies outside of synthetic benchmarks as well, for example, the time to rsync a 55 MiB directory with ~12k of mostly small files drops from an unusable 10m5s to a reasonable 26s (23x the throughput). [1]: https://www.talos.dev/ [2]: https://www.talos.dev/v1.10/kubernetes-guides/configuration/ceph-with-rook/ [3]: https://github.com/IO500/io500 Cc: stable@vger.kernel.org Signed-off-by: Dennis Marttinen Reviewed-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov