summaryrefslogtreecommitdiff
path: root/fs/ceph
AgeCommit message (Collapse)AuthorFilesLines
2025-11-24ceph: refactor wake_up_bit() pattern of callingViacheslav Dubeyko2-6/+3
[ Upstream commit 53db6f25ee47cb1265141d31562604e56146919a ] The wake_up_bit() is called in ceph_async_unlink_cb(), wake_async_create_waiters(), and ceph_finish_async_create(). It makes sense to switch on clear_bit() function, because it makes the code much cleaner and easier to understand. More important rework is the adding of smp_mb__after_atomic() memory barrier after the bit modification and before wake_up_bit() call. It can prevent potential race condition of accessing the modified bit in other threads. Luckily, clear_and_wake_up_bit() already implements the required functionality pattern: static inline void clear_and_wake_up_bit(int bit, unsigned long *word) { clear_bit_unlock(bit, word); /* See wake_up_bit() for which memory barrier you need to use. */ smp_mb__after_atomic(); wake_up_bit(word, bit); } Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24ceph: add checking of wait_for_completion_killable() return valueViacheslav Dubeyko1-1/+4
[ Upstream commit b7ed1e29cfe773d648ca09895b92856bd3a2092d ] The Coverity Scan service has detected the calling of wait_for_completion_killable() without checking the return value in ceph_lock_wait_for_completion() [1]. The CID 1636232 defect contains explanation: "If the function returns an error value, the error value may be mistaken for a normal value. In ceph_lock_wait_for_completion(): Value returned from a function is not checked for errors before being used. (CWE-252)". The patch adds the checking of wait_for_completion_killable() return value and return the error code from ceph_lock_wait_for_completion(). [1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1636232 Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-06ceph: fix possible integer overflow in ceph_zero_objects()Dmitry Kandybka1-1/+1
[ Upstream commit 0abd87942e0c93964e93224836944712feba1d91 ] In 'ceph_zero_objects', promote 'object_size' to 'u64' to avoid possible integer overflow. Compile tested only. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Dmitry Kandybka <d.kandybka@gmail.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27ceph: set superblock s_magic for IMA fsmagic matchingDennis Marttinen1-0/+1
commit 72386d5245b249f5a0a8fabb881df7ad947b8ea4 upstream. The CephFS kernel driver forgets to set the filesystem magic signature in its superblock. As a result, IMA policy rules based on fsmagic matching do not apply as intended. This causes a major performance regression in Talos Linux [1] when mounting CephFS volumes, such as when deploying Rook Ceph [2]. Talos Linux ships a hardened kernel with the following IMA policy (irrelevant lines omitted): [...] dont_measure fsmagic=0xc36400 # CEPH_SUPER_MAGIC [...] measure func=FILE_CHECK mask=^MAY_READ euid=0 measure func=FILE_CHECK mask=^MAY_READ uid=0 [...] Currently, IMA compares 0xc36400 == 0x0 for CephFS files, resulting in all files opened with O_RDONLY or O_RDWR getting measured with SHA512 on every open(2): 10 69990c87e8af323d47e2d6ae4... ima-ng sha512:<hash> /data/cephfs/test-file Since O_WRONLY is rare, this results in an order of magnitude lower performance than expected for practically all file operations. Properly setting CEPH_SUPER_MAGIC in the CephFS superblock resolves the regression. Tests performed on a 3x replicated Ceph v19.3.0 cluster across three i5-7200U nodes each equipped with one Micron 7400 MAX M.2 disk (BlueStore) and Gigabit ethernet, on Talos Linux v1.10.2: FS-Mark 3.3 Test: 500 Files, Empty Files/s > Higher Is Better 6.12.27-talos . 16.6 |==== +twelho patch . 208.4 |==================================================== FS-Mark 3.3 Test: 500 Files, 1KB Size Files/s > Higher Is Better 6.12.27-talos . 15.6 |======= +twelho patch . 118.6 |==================================================== FS-Mark 3.3 Test: 500 Files, 32 Sub Dirs, 1MB Size Files/s > Higher Is Better 6.12.27-talos . 12.7 |=============== +twelho patch . 44.7 |===================================================== IO500 [3] 2fcd6d6 results (benchmarks within variance omitted): | IO500 benchmark | 6.12.27-talos | +twelho patch | Speedup | |-------------------|----------------|----------------|-----------| | mdtest-easy-write | 0.018524 kIOPS | 1.135027 kIOPS | 6027.33 % | | mdtest-hard-write | 0.018498 kIOPS | 0.973312 kIOPS | 5161.71 % | | ior-easy-read | 0.064727 GiB/s | 0.155324 GiB/s | 139.97 % | | mdtest-hard-read | 0.018246 kIOPS | 0.780800 kIOPS | 4179.29 % | This applies outside of synthetic benchmarks as well, for example, the time to rsync a 55 MiB directory with ~12k of mostly small files drops from an unusable 10m5s to a reasonable 26s (23x the throughput). [1]: https://www.talos.dev/ [2]: https://www.talos.dev/v1.10/kubernetes-guides/configuration/ceph-with-rook/ [3]: https://github.com/IO500/io500 Cc: stable@vger.kernel.org Signed-off-by: Dennis Marttinen <twelho@welho.tech> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02ceph: Fix incorrect flush end position calculationDavid Howells1-1/+1
[ Upstream commit f452a2204614fc10e2c3b85904c4bd300c2789dc ] In ceph, in fill_fscrypt_truncate(), the end flush position is calculated by: loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SHIFT - 1; but that's using the block shift not the block size. Fix this to use the block size instead. Fixes: 5c64737d2536 ("ceph: add truncate size handling support for fscrypt") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-09ceph: give up on paths longer than PATH_MAXMax Kellermann1-5/+4
commit 550f7ca98ee028a606aa75705a7e77b1bd11720f upstream. If the full path to be built by ceph_mdsc_build_path() happens to be longer than PATH_MAX, then this function will enter an endless (retry) loop, effectively blocking the whole task. Most of the machine becomes unusable, making this a very simple and effective DoS vulnerability. I cannot imagine why this retry was ever implemented, but it seems rather useless and harmful to me. Let's remove it and fail with ENAMETOOLONG instead. Cc: stable@vger.kernel.org Reported-by: Dario Weißer <dario@cure53.de> Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> [idryomov@gmail.com: backport to 6.6: pr_warn() is still in use] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-02ceph: allocate sparse_ext map only for sparse readsIlya Dryomov1-1/+1
[ Upstream commit 18d44c5d062b97b97bb0162d9742440518958dc1 ] If mounted with sparseread option, ceph_direct_read_write() ends up making an unnecessarily allocation for O_DIRECT writes. Fixes: 03bc06c7b0bd ("ceph: add new mount option to enable sparse reads") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-02ceph: fix memory leak in ceph_direct_read_write()Ilya Dryomov1-21/+22
[ Upstream commit 66e0c4f91461d17d48071695271c824620bed4ef ] The bvecs array which is allocated in iter_get_bvecs_alloc() is leaked and pages remain pinned if ceph_alloc_sparse_ext_map() fails. There is no need to delay the allocation of sparse_ext map until after the bvecs array is set up, so fix this by moving sparse_ext allocation a bit earlier. Also, make a similar adjustment in __ceph_sync_read() for consistency (a leak of the same kind in __ceph_sync_read() has been addressed differently). Cc: stable@vger.kernel.org Fixes: 03bc06c7b0bd ("ceph: add new mount option to enable sparse reads") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Stable-dep-of: 18d44c5d062b ("ceph: allocate sparse_ext map only for sparse reads") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-01-02ceph: try to allocate a smaller extent map for sparse readXiubo Li3-3/+23
[ Upstream commit aaefabc4a5f7ae48682c4d2d5d10faaf95c08eb9 ] In fscrypt case and for a smaller read length we can predict the max count of the extent map. And for small read length use cases this could save some memories. [ idryomov: squash into a single patch to avoid build break, drop redundant variable in ceph_alloc_sparse_ext_map() ] Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: 18d44c5d062b ("ceph: allocate sparse_ext map only for sparse reads") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-27ceph: fix memory leaks in __ceph_sync_read()Max Kellermann1-4/+3
commit d6fd6f8280f0257ba93f16900a0d3d3912f32c79 upstream. In two `break` statements, the call to ceph_release_page_vector() was missing, leaking the allocation from ceph_alloc_page_vector(). Instead of adding the missing ceph_release_page_vector() calls, the Ceph maintainers preferred to transfer page ownership to the `ceph_osd_request` by passing `own_pages=true` to osd_req_op_extent_osd_data_pages(). This requires postponing the ceph_osdc_put_request() call until after the block that accesses the `pages`. Cc: stable@vger.kernel.org Fixes: 03bc06c7b0bd ("ceph: add new mount option to enable sparse reads") Fixes: f0fe1e54cfcf ("ceph: plumb in decryption during reads") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-27ceph: improve error handling and short/overflow-read logic in __ceph_sync_read()Alex Markuze1-15/+14
commit 9abee475803fab6ad59d4f4fc59c6a75374a7d9d upstream. This patch refines the read logic in __ceph_sync_read() to ensure more predictable and efficient behavior in various edge cases. - Return early if the requested read length is zero or if the file size (`i_size`) is zero. - Initialize the index variable (`idx`) where needed and reorder some code to ensure it is always set before use. - Improve error handling by checking for negative return values earlier. - Remove redundant encrypted file checks after failures. Only attempt filesystem-level decryption if the read succeeded. - Simplify leftover calculations to correctly handle cases where the read extends beyond the end of the file or stops short. This can be hit by continuously reading a file while, on another client, we keep truncating and writing new data into it. - This resolves multiple issues caused by integer and consequent buffer overflow (`pages` array being accessed beyond `num_pages`): - https://tracker.ceph.com/issues/67524 - https://tracker.ceph.com/issues/68980 - https://tracker.ceph.com/issues/68981 Cc: stable@vger.kernel.org Fixes: 1065da21e5df ("ceph: stop copying to iter at EOF on sync reads") Reported-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Signed-off-by: Alex Markuze <amarkuze@redhat.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-27ceph: validate snapdirname option length when mountingIlya Dryomov1-0/+2
commit 12eb22a5a609421b380c3c6ca887474fb2089b2c upstream. It becomes a path component, so it shouldn't exceed NAME_MAX characters. This was hardened in commit c152737be22b ("ceph: Use strscpy() instead of strcpy() in __get_snap_name()"), but no actual check was put in place. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-09ceph: extract entity name from device idPatrick Donnelly1-1/+9
commit 955710afcb3bb63e21e186451ed5eba85fa14d0b upstream. Previously, the "name" in the new device syntax "<name>@<fsid>.<fsname>" was ignored because (presumably) tests were done using mount.ceph which also passed the entity name using "-o name=foo". If mounting is done without the mount.ceph helper, the new device id syntax fails to set the name properly. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/68516 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-10ceph: fix cap ref leak via netfs init_requestPatrick Donnelly1-1/+4
commit ccda9910d8490f4fb067131598e4b2e986faa5a0 upstream. Log recovered from a user's cluster: <7>[ 5413.970692] ceph: get_cap_refs 00000000958c114b ret 1 got Fr <7>[ 5413.970695] ceph: start_read 00000000958c114b, no cache cap ... <7>[ 5473.934609] ceph: my wanted = Fr, used = Fr, dirty - <7>[ 5473.934616] ceph: revocation: pAsLsXsFr -> pAsLsXs (revoking Fr) <7>[ 5473.934632] ceph: __ceph_caps_issued 00000000958c114b cap 00000000f7784259 issued pAsLsXs <7>[ 5473.934638] ceph: check_caps 10000000e68.fffffffffffffffe file_want - used Fr dirty - flushing - issued pAsLsXs revoking Fr retain pAsLsXsFsr AUTHONLY NOINVAL FLUSH_FORCE The MDS subsequently complains that the kernel client is late releasing caps. Approximately, a series of changes to this code by commits 49870056005c ("ceph: convert ceph_readpages to ceph_readahead"), 2de160417315 ("netfs: Change ->init_request() to return an error code") and a5c9dc445139 ("ceph: Make ceph_init_request() check caps on readahead") resulted in subtle resource cleanup to be missed. The main culprit is the change in error handling in 2de160417315 which meant that a failure in init_request() would no longer cause cleanup to be called. That would prevent the ceph_put_cap_refs() call which would cleanup the leaked cap ref. Cc: stable@vger.kernel.org Fixes: a5c9dc445139 ("ceph: Make ceph_init_request() check caps on readahead") Link: https://tracker.ceph.com/issues/67008 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-10ceph: remove the incorrect Fw reference check when dirtying pagesXiubo Li1-1/+0
[ Upstream commit c08dfb1b49492c09cf13838c71897493ea3b424e ] When doing the direct-io reads it will also try to mark pages dirty, but for the read path it won't hold the Fw caps and there is case will it get the Fw reference. Fixes: 5dda377cf0a6 ("ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03ceph: fix incorrect kmalloc size of pagevec mempoolethanwu1-1/+2
[ Upstream commit 03230edb0bd831662a7c08b6fef66b2a9a817774 ] The kmalloc size of pagevec mempool is incorrectly calculated. It misses the size of page pointer and only accounts the number for the array. Fixes: a0102bda5bc0 ("ceph: move sb->wb_pagevec_pool to be a global mempool") Signed-off-by: ethanwu <ethanwu@synology.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-27ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATENeilBrown1-1/+3
[ Upstream commit b372e96bd0a32729d55d27f613c8bc80708a82e1 ] The page has been marked clean before writepage is called. If we don't redirty it before postponing the write, it might never get written. Cc: stable@vger.kernel.org Fixes: 503d4fa6ee28 ("ceph: remove reliance on bdi congestion") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-27ceph: rename _to_client() to _to_fs_client()Xiubo Li14-95/+95
[ Upstream commit 5995d90d2d19f337df6a50bcf4699ef053214dac ] We need to covert the inode to ceph_client in the following commit, and will add one new helper for that, here we rename the old helper to _fs_client(). Link: https://tracker.ceph.com/issues/61590 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: b372e96bd0a3 ("ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-27ceph: pass the mdsc to several helpersXiubo Li9-36/+51
[ Upstream commit 197b7d792d6aead2e30d4b2c054ffabae2ed73dc ] We will use the 'mdsc' to get the global_id in the following commits. Link: https://tracker.ceph.com/issues/61590 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Stable-dep-of: b372e96bd0a3 ("ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27ceph: stop copying to iter at EOF on sync readsXiubo Li1-10/+13
[ Upstream commit 1065da21e5df9d843d2c5165d5d576be000142a6 ] If EOF is encountered, ceph_sync_read() return value is adjusted down according to i_size, but the "to" iter is advanced by the actual number of bytes read. Then, when retrying, the remainder of the range may be skipped incorrectly. Ensure that the "to" iter is advanced only until EOF. [ idryomov: changelog ] Fixes: c3d8e0b5de48 ("ceph: return the real size read when it hits EOF") Reported-by: Frank Hsiao <frankhsiao@qnap.com> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Frank Hsiao <frankhsiao@qnap.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15ceph: switch to corrected encoding of max_xattr_size in mdsmapXiubo Li1-3/+4
[ Upstream commit 51d31149a88b5c5a8d2d33f06df93f6187a25b4c ] The addition of bal_rank_mask with encoding version 17 was merged into ceph.git in Oct 2022 and made it into v18.2.0 release normally. A few months later, the much delayed addition of max_xattr_size got merged, also with encoding version 17, placed before bal_rank_mask in the encoding -- but it didn't make v18.2.0 release. The way this ended up being resolved on the MDS side is that bal_rank_mask will continue to be encoded in version 17 while max_xattr_size is now encoded in version 18. This does mean that older kernels will misdecode version 17, but this is also true for v18.2.0 and v18.2.1 clients in userspace. The best we can do is backport this adjustment -- see ceph.git commit 78abfeaff27fee343fb664db633de5b221699a73 for details. [ idryomov: changelog ] Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/64440 Fixes: d93231a6bc8a ("ceph: prevent a client from exceeding the MDS maximum xattr size") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23ceph: prevent use-after-free in encode_cap_msg()Rishabh Dave1-1/+2
commit cda4672da1c26835dcbd7aec2bfed954eda9b5ef upstream. In fs/ceph/caps.c, in encode_cap_msg(), "use after free" error was caught by KASAN at this line - 'ceph_buffer_get(arg->xattr_buf);'. This implies before the refcount could be increment here, it was freed. In same file, in "handle_cap_grant()" refcount is decremented by this line - 'ceph_buffer_put(ci->i_xattrs.blob);'. It appears that a race occurred and resource was freed by the latter line before the former line could increment it. encode_cap_msg() is called by __send_cap() and __send_cap() is called by ceph_check_caps() after calling __prep_cap(). __prep_cap() is where arg->xattr_buf is assigned to ci->i_xattrs.blob. This is the spot where the refcount must be increased to prevent "use after free" error. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/59259 Signed-off-by: Rishabh Dave <ridave@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-05ceph: fix invalid pointer access if get_quota_realm return ERR_PTRWenchao Hao1-17/+22
[ Upstream commit 0f4cf64eabc6e16cfc2704f1960e82dc79d91c8d ] This issue is reported by smatch that get_quota_realm() might return ERR_PTR but we did not handle it. It's not a immediate bug, while we still should address it to avoid potential bugs if get_quota_realm() is changed to return other ERR_PTR in future. Set ceph_snap_realm's pointer in get_quota_realm()'s to address this issue, the pointer would be set to NULL if get_quota_realm() failed to get struct ceph_snap_realm, so no ERR_PTR would happen any more. [ xiubli: minor code style clean up ] Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-05ceph: fix deadlock or deadcode of misusing dget()Xiubo Li1-6/+3
[ Upstream commit b493ad718b1f0357394d2cdecbf00a44a36fa085 ] The lock order is incorrect between denty and its parent, we should always make sure that the parent get the lock first. But since this deadcode is never used and the parent dir will always be set from the callers, let's just remove it. Link: https://lore.kernel.org/r/20231116081919.GZ1957730@ZenIV Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-05ceph: reinitialize mds feature bit even when session in openVenky Shankar1-1/+1
[ Upstream commit f48e0342a74d7770cdf1d11894bdc3b6d989b29e ] Following along the same lines as per the user-space fix. Right now this isn't really an issue with the ceph kernel driver because of the feature bit laginess, however, that can change over time (when the new snaprealm info type is ported to the kernel driver) and depending on the MDS version that's being upgraded can cause message decoding issues - so, fix that early on. Link: http://tracker.ceph.com/issues/63188 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-26ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTIONEric Biggers1-0/+1
commit 9c896d6bc3dfef86659a6a1fb25ccdea5dbef6a3 upstream. The kconfig options for filesystems that support FS_ENCRYPTION are supposed to select FS_ENCRYPTION_ALGS. This is needed to ensure that required crypto algorithms get enabled as loadable modules or builtin as is appropriate for the set of enabled filesystems. Do this for CEPH_FS so that there aren't any missing algorithms if someone happens to have CEPH_FS as their only enabled filesystem that supports encryption. Cc: stable@vger.kernel.org Fixes: f061feda6c54 ("ceph: add fscrypt ioctls and ceph.fscrypt.auth vxattr") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-28Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-1/+1
Pull misc filesystem fixes from Al Viro: "Assorted fixes all over the place: literally nothing in common, could have been three separate pull requests. All are simple regression fixes, but not for anything from this cycle" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed sparc32: fix a braino in fault handling in csum_and_copy_..._user()
2023-10-28ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lockAl Viro1-1/+1
Use of dget() after we'd dropped ->d_lock is too late - dentry might be gone by that point. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-09ceph: fix type promotion bug on 32bit systemsDan Carpenter1-1/+1
In this code "ret" is type long and "src_objlen" is unsigned int. The problem is that on 32bit systems, when we do the comparison signed longs are type promoted to unsigned int. So negative error codes from do_splice_direct() are treated as success instead of failure. Cc: stable@vger.kernel.org Fixes: 1b0c3b9f91f0 ("ceph: re-org copy_file_range and fix some error paths") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-10-09ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr()Luis Henriques1-1/+1
Before returning, function ceph_fname_to_usr() does a final IS_ERR() check in 'dir': if ((dir != fname->dir) && !IS_ERR(dir)) {...} This check is unnecessary because, if the 'dir' variable has changed to something other than 'fname->dir' (it's initial value), that error check has been performed already and, if there was indeed an error, it would have been returned immediately. Besides, this useless IS_ERR() is also confusing static analysis tools. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202309282202.xZxGdvS3-lkp@intel.com/ Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-10-09ceph: fix incorrect revoked caps assert in ceph_fill_file_size()Xiubo Li1-3/+1
When truncating the inode the MDS will acquire the xlock for the ifile Locker, which will revoke the 'Frwsxl' caps from the clients. But when the client just releases and flushes the 'Fw' caps to MDS, for exmaple, and once the MDS receives the caps flushing msg it just thought the revocation has finished. Then the MDS will continue truncating the inode and then issued the truncate notification to all the clients. While just before the clients receives the cap flushing ack they receive the truncation notification, the clients will detecte that the 'issued | dirty' is still holding the 'Fw' caps. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/56693 Fixes: b0d7c2231015 ("ceph: introduce i_truncate_mutex") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-09-18ceph: remove unnecessary check for NULL in parse_longname()Luís Henriques1-4/+2
Function ceph_get_inode() never returns NULL; instead it returns an ERR_PTR() if something fails. Thus, the check for NULL in parse_longname() is useless and can be dropped. Instead, move there the debug code that does the error checking so that it's only executed if ceph_get_inode() is called. Fixes: dd66df0053ef ("ceph: add support for encrypted snapshot names") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-09-06Merge tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds18-462/+3532
Pull ceph updates from Ilya Dryomov: "Mixed with some fixes and cleanups, this brings in reasonably complete fscrypt support to CephFS! The list of things which don't work with encryption should be fairly short, mostly around the edges: fallocate (not supported well in CephFS to begin with), copy_file_range (requires re-encryption), non-default striping patterns. This was a multi-year effort principally by Jeff Layton with assistance from Xiubo Li, Luís Henriques and others, including several dependant changes in the MDS, netfs helper library and fscrypt framework itself" * tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client: (53 commits) ceph: make num_fwd and num_retry to __u32 ceph: make members in struct ceph_mds_request_args_ext a union rbd: use list_for_each_entry() helper libceph: do not include crypto/algapi.h ceph: switch ceph_lookup/atomic_open() to use new fscrypt helper ceph: fix updating i_truncate_pagecache_size for fscrypt ceph: wait for OSD requests' callbacks to finish when unmounting ceph: drop messages from MDS when unmounting ceph: update documentation regarding snapshot naming limitations ceph: prevent snapshot creation in encrypted locked directories ceph: add support for encrypted snapshot names ceph: invalidate pages when doing direct/sync writes ceph: plumb in decryption during reads ceph: add encryption support to writepage and writepages ceph: add read/modify/write to ceph_sync_write ceph: align data in pages in ceph_sync_write ceph: don't use special DIO path for encrypted inodes ceph: add truncate size handling support for fscrypt ceph: add object version support for sync read libceph: allow ceph_osdc_new_request to accept a multi-op read ...
2023-08-31ceph: make num_fwd and num_retry to __u32Xiubo Li2-87/+106
The num_fwd in MClientRequestForward is int32_t, while the num_fwd in ceph_mds_request_head is __u8. This is buggy when the num_fwd is larger than 256 it will always be truncate to 0 again. But the client couldn't recoginize this. This will make them to __u32 instead. Because the old cephs will directly copy the raw memories when decoding the reqeust's head, so we need to make sure this kclient will be compatible with old cephs. For newer cephs they will decode the requests depending the version, which will be much simpler and easier to extend new members. Link: https://tracker.ceph.com/issues/62145 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-30Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds5-12/+14
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-24ceph: switch ceph_lookup/atomic_open() to use new fscrypt helperLuís Henriques2-11/+10
Instead of setting the no-key dentry, use the new fscrypt_prepare_lookup_partial() helper. We still need to mark the directory as incomplete if the directory was just unlocked. In ceph_atomic_open() this fixes a bug where a dentry is incorrectly set with DCACHE_NOKEY_NAME when 'dir' has been evicted but the key is still available (for example, where there's a drop_caches). Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: fix updating i_truncate_pagecache_size for fscryptXiubo Li2-13/+26
When fscrypt is enabled we will align the truncate size up to the CEPH_FSCRYPT_BLOCK_SIZE always, so if we truncate the size in the same block more than once, the latter ones will be skipped being invalidated from the page caches. This will force invalidating the page caches by using the smaller size than the real file size. At the same time add more debug log and fix the debug log for truncate code. Link: https://tracker.ceph.com/issues/58834 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: wait for OSD requests' callbacks to finish when unmountingXiubo Li3-0/+23
The sync_filesystem() will flush all the dirty buffer and submit the osd reqs to the osdc and then is blocked to wait for all the reqs to finish. But the when the reqs' replies come, the reqs will be removed from osdc just before the req->r_callback()s are called. Which means the sync_filesystem() will be woke up by leaving the req->r_callback()s are still running. This will be buggy when the waiter require the req->r_callback()s to release some resources before continuing. So we need to make sure the req->r_callback()s are called before removing the reqs from the osdc. WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> generic_shutdown_super+0x47/0x120 kill_anon_super+0x14/0x30 ceph_kill_sb+0x36/0x90 [ceph] deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x67/0xb0 exit_to_user_mode_prepare+0x23d/0x240 syscall_exit_to_user_mode+0x25/0x60 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd83dc39e9b We need to increase the blocker counter to make sure all the osd requests' callbacks have been finished just before calling the kill_anon_super() when unmounting. Link: https://tracker.ceph.com/issues/58126 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: drop messages from MDS when unmountingXiubo Li7-22/+109
When unmounting all the dirty buffers will be flushed and after the last osd request is finished the last reference of the i_count will be released. Then it will flush the dirty cap/snap to MDSs, and the unmounting won't wait the possible acks, which will ihold the inodes when updating the metadata locally but makes no sense any more, of this. This will make the evict_inodes() to skip these inodes. If encrypt is enabled the kernel generate a warning when removing the encrypt keys when the skipped inodes still hold the keyring: WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> generic_shutdown_super+0x47/0x120 kill_anon_super+0x14/0x30 ceph_kill_sb+0x36/0x90 [ceph] deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x67/0xb0 exit_to_user_mode_prepare+0x23d/0x240 syscall_exit_to_user_mode+0x25/0x60 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd83dc39e9b Later the kernel will crash when iput() the inodes and dereferencing the "sb->s_master_keys", which has been released by the generic_shutdown_super(). Link: https://tracker.ceph.com/issues/59162 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: prevent snapshot creation in encrypted locked directoriesLuís Henriques1-0/+5
With snapshot names encryption we can not allow snapshots to be created in locked directories because the names wouldn't be encrypted. This patch forces the directory to be unlocked to allow a snapshot to be created. Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: add support for encrypted snapshot namesLuís Henriques3-41/+201
Since filenames in encrypted directories are encrypted and shown as a base64-encoded string when the directory is locked, make snapshot names show a similar behaviour. When creating a snapshot, .snap directories for every subdirectory will show the snapshot name in the "long format": # mkdir .snap/my-snap # ls my-dir/.snap/ _my-snap_1099511627782 Encrypted snapshots will need to be able to handle these by encrypting/decrypting only the snapshot part of the string ('my-snap'). Also, since the MDS prevents snapshot names to be bigger than 240 characters it is necessary to adapt CEPH_NOHASH_NAME_MAX to accommodate this extra limitation. [ idryomov: drop const on !CONFIG_FS_ENCRYPTION branch too ] Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: invalidate pages when doing direct/sync writesLuís Henriques1-5/+14
When doing a direct/sync write, we need to invalidate the page cache in the range being written to. If we don't do this, the cache will include invalid data as we just did a write that avoided the page cache. In the event that invalidation fails, just ignore the error. That likely just means that we raced with another task doing a buffered write, in which case we want to leave the page intact anyway. [ jlayton: minor comment update ] Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: plumb in decryption during readsJeff Layton2-34/+125
Force the use of sparse reads when the inode is encrypted, and add the appropriate code to decrypt the extent map after receiving. Note that the crypto block may be smaller than a page, but the reverse cannot be true. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: add encryption support to writepage and writepagesJeff Layton2-21/+98
Allow writepage to issue encrypted writes. Extend out the requested size and offset to cover complete blocks, and then encrypt and write them to the OSDs. Add the appropriate machinery to write back dirty data with encryption. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: add read/modify/write to ceph_sync_writeJeff Layton1-28/+290
When doing a synchronous write on an encrypted inode, we have no guarantee that the caller is writing crypto block-aligned data. When that happens, we must do a read/modify/write cycle. First, expand the range to cover complete blocks. If we had to change the original pos or length, issue a read to fill the first and/or last pages, and fetch the version of the object from the result. We then copy data into the pages as usual, encrypt the result and issue a write prefixed by an assertion that the version hasn't changed. If it has changed then we restart the whole thing again. If there is no object at that position in the file (-ENOENT), we prefix the write on an exclusive create of the object instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: align data in pages in ceph_sync_writeJeff Layton1-10/+10
Encrypted files will need to be dealt with in block-sized chunks and once we do that, the way that ceph_sync_write aligns the data in the bounce buffer won't be acceptable. Change it to align the data the same way it would be aligned in the pagecache. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: don't use special DIO path for encrypted inodesJeff Layton1-2/+4
Eventually I want to merge the synchronous and direct read codepaths, possibly via new netfs infrastructure. For now, the direct path is not crypto-enabled, so use the sync read/write paths instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: add truncate size handling support for fscryptXiubo Li4-12/+234
This will transfer the encrypted last block contents to the MDS along with the truncate request only when the new size is smaller and not aligned to the fscrypt BLOCK size. When the last block is located in the file hole, the truncate request will only contain the header. The MDS could fail to do the truncate if there has another client or process has already updated the RADOS object which contains the last block, and will return -EAGAIN, then the kclient needs to retry it. The RMW will take around 50ms, and will let it retry 20 times for now. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24ceph: add object version support for sync readXiubo Li2-12/+35
Turn the guts of ceph_sync_read into a new helper that takes an inode and an offset instead of a kiocb struct, and make ceph_sync_read call the helper as a wrapper. Make the new helper always return the last object's version. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>