<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/ceph/super.c, branch v6.6.131</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.131</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.131'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-06-27T10:08:47+00:00</updated>
<entry>
<title>ceph: set superblock s_magic for IMA fsmagic matching</title>
<updated>2025-06-27T10:08:47+00:00</updated>
<author>
<name>Dennis Marttinen</name>
<email>twelho@welho.tech</email>
</author>
<published>2025-05-29T17:45:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=96707ff5818f29926e4dd3626f56035bf7e45b2b'/>
<id>urn:sha1:96707ff5818f29926e4dd3626f56035bf7e45b2b</id>
<content type='text'>
commit 72386d5245b249f5a0a8fabb881df7ad947b8ea4 upstream.

The CephFS kernel driver forgets to set the filesystem magic signature in
its superblock. As a result, IMA policy rules based on fsmagic matching do
not apply as intended. This causes a major performance regression in Talos
Linux [1] when mounting CephFS volumes, such as when deploying Rook Ceph
[2]. Talos Linux ships a hardened kernel with the following IMA policy
(irrelevant lines omitted):

[...]
dont_measure fsmagic=0xc36400 # CEPH_SUPER_MAGIC
[...]
measure func=FILE_CHECK mask=^MAY_READ euid=0
measure func=FILE_CHECK mask=^MAY_READ uid=0
[...]

Currently, IMA compares 0xc36400 == 0x0 for CephFS files, resulting in all
files opened with O_RDONLY or O_RDWR getting measured with SHA512 on every
open(2):

10 69990c87e8af323d47e2d6ae4... ima-ng sha512:&lt;hash&gt; /data/cephfs/test-file

Since O_WRONLY is rare, this results in an order of magnitude lower
performance than expected for practically all file operations. Properly
setting CEPH_SUPER_MAGIC in the CephFS superblock resolves the regression.

Tests performed on a 3x replicated Ceph v19.3.0 cluster across three
i5-7200U nodes each equipped with one Micron 7400 MAX M.2 disk (BlueStore)
and Gigabit ethernet, on Talos Linux v1.10.2:

FS-Mark 3.3
Test: 500 Files, Empty
Files/s &gt; Higher Is Better
6.12.27-talos . 16.6  |====
+twelho patch . 208.4 |====================================================

FS-Mark 3.3
Test: 500 Files, 1KB Size
Files/s &gt; Higher Is Better
6.12.27-talos . 15.6  |=======
+twelho patch . 118.6 |====================================================

FS-Mark 3.3
Test: 500 Files, 32 Sub Dirs, 1MB Size
Files/s &gt; Higher Is Better
6.12.27-talos . 12.7 |===============
+twelho patch . 44.7 |=====================================================

IO500 [3] 2fcd6d6 results (benchmarks within variance omitted):

| IO500 benchmark   | 6.12.27-talos  | +twelho patch  | Speedup   |
|-------------------|----------------|----------------|-----------|
| mdtest-easy-write | 0.018524 kIOPS | 1.135027 kIOPS | 6027.33 % |
| mdtest-hard-write | 0.018498 kIOPS | 0.973312 kIOPS | 5161.71 % |
| ior-easy-read     | 0.064727 GiB/s | 0.155324 GiB/s | 139.97 %  |
| mdtest-hard-read  | 0.018246 kIOPS | 0.780800 kIOPS | 4179.29 % |

This applies outside of synthetic benchmarks as well, for example, the time
to rsync a 55 MiB directory with ~12k of mostly small files drops from an
unusable 10m5s to a reasonable 26s (23x the throughput).

[1]: https://www.talos.dev/
[2]: https://www.talos.dev/v1.10/kubernetes-guides/configuration/ceph-with-rook/
[3]: https://github.com/IO500/io500

Cc: stable@vger.kernel.org
Signed-off-by: Dennis Marttinen &lt;twelho@welho.tech&gt;
Reviewed-by: Viacheslav Dubeyko &lt;Slava.Dubeyko@ibm.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ceph: validate snapdirname option length when mounting</title>
<updated>2024-12-27T12:58:56+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2024-11-20T15:43:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f006f6eaea7b1345d2d9ad579087d3f63ab72c36'/>
<id>urn:sha1:f006f6eaea7b1345d2d9ad579087d3f63ab72c36</id>
<content type='text'>
commit 12eb22a5a609421b380c3c6ca887474fb2089b2c upstream.

It becomes a path component, so it shouldn't exceed NAME_MAX
characters.  This was hardened in commit c152737be22b ("ceph: Use
strscpy() instead of strcpy() in __get_snap_name()"), but no actual
check was put in place.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Alex Markuze &lt;amarkuze@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ceph: extract entity name from device id</title>
<updated>2024-12-09T09:33:06+00:00</updated>
<author>
<name>Patrick Donnelly</name>
<email>pdonnell@redhat.com</email>
</author>
<published>2024-10-13T00:54:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=308e0cc36dc409718db0bf55e57c9867367183c4'/>
<id>urn:sha1:308e0cc36dc409718db0bf55e57c9867367183c4</id>
<content type='text'>
commit 955710afcb3bb63e21e186451ed5eba85fa14d0b upstream.

Previously, the "name" in the new device syntax "&lt;name&gt;@&lt;fsid&gt;.&lt;fsname&gt;"
was ignored because (presumably) tests were done using mount.ceph which
also passed the entity name using "-o name=foo". If mounting is done
without the mount.ceph helper, the new device id syntax fails to set
the name properly.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/68516
Signed-off-by: Patrick Donnelly &lt;pdonnell@redhat.com&gt;
Reviewed-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>ceph: fix incorrect kmalloc size of pagevec mempool</title>
<updated>2024-08-03T06:54:40+00:00</updated>
<author>
<name>ethanwu</name>
<email>ethanwu@synology.com</email>
</author>
<published>2024-07-11T06:47:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=da0a3ebf9a96f0c8d60c8278e792228432435147'/>
<id>urn:sha1:da0a3ebf9a96f0c8d60c8278e792228432435147</id>
<content type='text'>
[ Upstream commit 03230edb0bd831662a7c08b6fef66b2a9a817774 ]

The kmalloc size of pagevec mempool is incorrectly calculated.
It misses the size of page pointer and only accounts the number for the array.

Fixes: a0102bda5bc0 ("ceph: move sb-&gt;wb_pagevec_pool to be a global mempool")
Signed-off-by: ethanwu &lt;ethanwu@synology.com&gt;
Reviewed-by: Xiubo Li &lt;xiubli@redhat.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>ceph: rename _to_client() to _to_fs_client()</title>
<updated>2024-04-27T15:11:29+00:00</updated>
<author>
<name>Xiubo Li</name>
<email>xiubli@redhat.com</email>
</author>
<published>2023-06-12T02:50:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=985b9ee8a2cfc702cb9a63ca4f22b0b244297d2c'/>
<id>urn:sha1:985b9ee8a2cfc702cb9a63ca4f22b0b244297d2c</id>
<content type='text'>
[ Upstream commit 5995d90d2d19f337df6a50bcf4699ef053214dac ]

We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().

Link: https://tracker.ceph.com/issues/61590
Signed-off-by: Xiubo Li &lt;xiubli@redhat.com&gt;
Reviewed-by: Patrick Donnelly &lt;pdonnell@redhat.com&gt;
Reviewed-by: Milind Changire &lt;mchangir@redhat.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Stable-dep-of: b372e96bd0a3 ("ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>ceph: wait for OSD requests' callbacks to finish when unmounting</title>
<updated>2023-08-24T09:24:36+00:00</updated>
<author>
<name>Xiubo Li</name>
<email>xiubli@redhat.com</email>
</author>
<published>2023-05-09T08:38:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1464de9f813e35559ff049f1f1f20f1e2a31d6b8'/>
<id>urn:sha1:1464de9f813e35559ff049f1f1f20f1e2a31d6b8</id>
<content type='text'>
The sync_filesystem() will flush all the dirty buffer and submit the
osd reqs to the osdc and then is blocked to wait for all the reqs to
finish. But the when the reqs' replies come, the reqs will be removed
from osdc just before the req-&gt;r_callback()s are called. Which means
the sync_filesystem() will be woke up by leaving the req-&gt;r_callback()s
are still running.

This will be buggy when the waiter require the req-&gt;r_callback()s to
release some resources before continuing. So we need to make sure the
req-&gt;r_callback()s are called before removing the reqs from the osdc.

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
&lt;TASK&gt;
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

We need to increase the blocker counter to make sure all the osd
requests' callbacks have been finished just before calling the
kill_anon_super() when unmounting.

Link: https://tracker.ceph.com/issues/58126
Signed-off-by: Xiubo Li &lt;xiubli@redhat.com&gt;
Reviewed-and-tested-by: Luís Henriques &lt;lhenriques@suse.de&gt;
Reviewed-by: Milind Changire &lt;mchangir@redhat.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
</entry>
<entry>
<title>ceph: drop messages from MDS when unmounting</title>
<updated>2023-08-24T09:24:36+00:00</updated>
<author>
<name>Xiubo Li</name>
<email>xiubli@redhat.com</email>
</author>
<published>2022-12-21T06:13:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e3dfcab2080dc1f9a4b09cc1327361bc2845bfcd'/>
<id>urn:sha1:e3dfcab2080dc1f9a4b09cc1327361bc2845bfcd</id>
<content type='text'>
When unmounting all the dirty buffers will be flushed and after
the last osd request is finished the last reference of the i_count
will be released. Then it will flush the dirty cap/snap to MDSs,
and the unmounting won't wait the possible acks, which will ihold
the inodes when updating the metadata locally but makes no sense
any more, of this. This will make the evict_inodes() to skip these
inodes.

If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
&lt;TASK&gt;
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

Later the kernel will crash when iput() the inodes and dereferencing
the "sb-&gt;s_master_keys", which has been released by the
generic_shutdown_super().

Link: https://tracker.ceph.com/issues/59162
Signed-off-by: Xiubo Li &lt;xiubli@redhat.com&gt;
Reviewed-and-tested-by: Luís Henriques &lt;lhenriques@suse.de&gt;
Reviewed-by: Milind Changire &lt;mchangir@redhat.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
</entry>
<entry>
<title>ceph: implement -o test_dummy_encryption mount option</title>
<updated>2023-08-22T07:01:48+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@kernel.org</email>
</author>
<published>2020-09-08T13:47:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6b5717bd30ab7f35792d20b71211055bdb43e6de'/>
<id>urn:sha1:6b5717bd30ab7f35792d20b71211055bdb43e6de</id>
<content type='text'>
Add support for the test_dummy_encryption mount option. This allows us
to test the encrypted codepaths in ceph without having to manually set
keys, etc.

[ lhenriques: fix potential fsc-&gt;fsc_dummy_enc_policy memory leak in
  ceph_real_mount() ]

Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Xiubo Li &lt;xiubli@redhat.com&gt;
Reviewed-and-tested-by: Luís Henriques &lt;lhenriques@suse.de&gt;
Reviewed-by: Milind Changire &lt;mchangir@redhat.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
</entry>
<entry>
<title>ceph: fscrypt_auth handling for ceph</title>
<updated>2023-08-22T07:01:48+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@kernel.org</email>
</author>
<published>2020-07-27T14:16:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2d332d5bc424404911540006a8bb450fbb96b178'/>
<id>urn:sha1:2d332d5bc424404911540006a8bb450fbb96b178</id>
<content type='text'>
Most fscrypt-enabled filesystems store the crypto context in an xattr,
but that's problematic for ceph as xatts are governed by the XATTR cap,
but we really want the crypto context as part of the AUTH cap.

Because of this, the MDS has added two new inode metadata fields:
fscrypt_auth and fscrypt_file. The former is used to hold the crypto
context, and the latter is used to track the real file size.

Parse new fscrypt_auth and fscrypt_file fields in inode traces. For now,
we don't use fscrypt_file, but fscrypt_auth is used to hold the fscrypt
context.

Allow the client to use a setattr request for setting the fscrypt_auth
field. Since this is not a standard setattr request from the VFS, we add
a new field to __ceph_setattr that carries ceph-specific inode attrs.

Have the set_context op do a setattr that sets the fscrypt_auth value,
and get_context just return the contents of that field (since it should
always be available).

Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Xiubo Li &lt;xiubli@redhat.com&gt;
Reviewed-and-tested-by: Luís Henriques &lt;lhenriques@suse.de&gt;
Reviewed-by: Milind Changire &lt;mchangir@redhat.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
</entry>
<entry>
<title>ceph: add new mount option to enable sparse reads</title>
<updated>2023-08-22T07:01:47+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@kernel.org</email>
</author>
<published>2022-02-26T11:33:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=03bc06c7b0bd8d86b9f17f459acaeb1283ba2700'/>
<id>urn:sha1:03bc06c7b0bd8d86b9f17f459acaeb1283ba2700</id>
<content type='text'>
Add a new mount option that has the client issue sparse reads instead of
normal ones. The callers now preallocate an sparse extent buffer that
the libceph receive code can populate and hand back after the operation
completes.

After a successful sparse read, we can't use the req-&gt;r_result value to
determine the amount of data "read", so instead we set the received
length to be from the end of the last extent in the buffer. Any
interstitial holes will have been filled by the receive code.

[ xiubli: fix a double free on req reported by Ilya ]

Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Reviewed-by: Xiubo Li &lt;xiubli@redhat.com&gt;
Reviewed-and-tested-by: Luís Henriques &lt;lhenriques@suse.de&gt;
Reviewed-by: Milind Changire &lt;mchangir@redhat.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
</content>
</entry>
</feed>
