summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2025-05-08xfs: update atomic write limitsJohn Garry2-7/+47
Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)(). No reflink support always means no CoW-based atomic writes. For updating xfs_get_atomic_write_min(), we support blocksize only and that depends on HW or reflink support. For updating xfs_get_atomic_write_max(), for no reflink, we are limited to blocksize but only if HW support. Otherwise we are limited to combined limit in mp->m_atomic_write_unit_max. For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by the bdev atomic write limit. If xfs_get_atomic_write_max() does not report > 1x blocksize, then just continue to report 0 as before. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: update comments in the helper functions] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: add xfs_calc_atomic_write_unit_max()John Garry7-0/+263
Now that CoW-based atomic writes are supported, update the max size of an atomic write for the data device. The limit of a CoW-based atomic write will be the limit of the number of logitems which can fit into a single transaction. In addition, the max atomic write size needs to be aligned to the agsize. Limit the size of atomic writes to the greatest power-of-two factor of the agsize so that allocations for an atomic write will always be aligned compatibly with the alignment requirements of the storage. Function xfs_atomic_write_logitems() is added to find the limit the number of log items which can fit in a single transaction. Amend the max atomic write computation to create a new transaction reservation type, and compute the maximum size of an atomic write completion (in fsblocks) based on this new transaction reservation. Initially, tr_atomic_write is a clone of tr_itruncate, which provides a reasonable level of parallelism. In the next patch, we'll add a mount option so that sysadmins can configure their own limits. [djwong: use a new reservation type for atomic write ioends, refactor group limit calculations] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: rounddown power-of-2 always] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: add xfs_file_dio_write_atomic()John Garry1-0/+68
Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes. Now HW offload will not be required to support atomic writes and is an optional feature. CoW-based atomic writes will be supported with out-of-places write and atomic extent remapping. Either mode of operation may be used for an atomic write, depending on the circumstances. The preferred method is HW offload as it will be faster. If HW offload is not available then we always use the CoW-based method. If HW offload is available but not possible to use, then again we use the CoW-based method. If available, HW offload would not be possible for the write length exceeding the HW offload limit, the write spanning multiple extents, unaligned disk blocks, etc. Apart from the write exceeding the HW offload limit, other conditions for HW offload usage can only be detected in the iomap handling for the write. As such, we use a fallback method to issue the write if we detect in the ->iomap_begin() handler that HW offload is not possible. Special code -ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload is not possible. In other words, atomic writes are supported on any filesystem that can perform out of place write remapping atomically (i.e. reflink) up to some fairly large size. If the conditions are right (a single correctly aligned overwrite mapping) then the filesystem will use any available hardware support to avoid the filesystem metadata updates. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: commit CoW-based atomic writes atomicallyJohn Garry6-1/+82
When completing a CoW-based write, each extent range mapping update is covered by a separate transaction. For a CoW-based atomic write, all mappings must be changed at once, so change to use a single transaction. Note that there is a limit on the amount of log intent items which can be fit into a single transaction, but this is being ignored for now since the count of items for a typical atomic write would be much less than is typically supported. A typical atomic write would be expected to be 64KB or less, which means only 16 possible extents unmaps, which is quite small. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add tr_atomic_ioend] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()John Garry1-2/+60
For when large atomic writes (> 1x FS block) are supported, there will be various occasions when HW offload may not be possible. Such instances include: - unaligned extent mapping wrt write length - extent mappings which do not cover the full write, e.g. the write spans sparse or mixed-mapping extents - the write length is greater than HW offload can support - no hardware support at all In those cases, we need to fallback to the CoW-based atomic write mode. For this, report special code -ENOPROTOOPT to inform the caller that HW offload-based method is not possible. In addition to the occasions mentioned, if the write covers an unallocated range, we again judge that we need to rely on the CoW-based method when we would need to allocate anything more than 1x block. This is because if we allocate less blocks that is required for the write, then again HW offload-based method would not be possible. So we are taking a pessimistic approach to writes covering unallocated space. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: various cleanups] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: add xfs_atomic_write_cow_iomap_begin()John Garry6-1/+159
For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork support. Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create staging mappings in the CoW fork for atomic write updates. The general steps in the function are as follows: - find extent mapping in the CoW fork for the FS block range being written - if part or full extent is found, proceed to process found extent - if no extent found, map in new blocks to the CoW fork - convert unwritten blocks in extent if required - update iomap extent mapping and return The bulk of this function is quite similar to the processing in xfs_reflink_allocate_cow(), where we try to find an extent mapping; if none exists, then allocate a new extent in the CoW fork, convert unwritten blocks, and return a mapping. Performance testing has shown the XFS_ILOCK_EXCL locking to be quite a bottleneck, so this is an area which could be optimised in future. Christoph Hellwig contributed almost all of the code in xfs_atomic_write_cow_iomap_begin(). Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add a new xfs_can_sw_atomic_write to convey intent better] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: refine atomic write size check in xfs_file_write_iter()John Garry3-12/+39
Currently the size of atomic write allowed is fixed at the blocksize. To start to lift this restriction, partly refactor xfs_report_atomic_write() to into helpers - xfs_get_atomic_write_{min, max}() - and use those helpers to find the per-inode atomic write limits and check according to that. Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and just return 0 since large atomics aren't supported yet. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: refactor xfs_reflink_end_cow_extent()John Garry1-30/+42
Refactor xfs_reflink_end_cow_extent() into separate parts which process the CoW range and commit the transaction. This refactoring will be used in future for when it is required to commit a range of extents as a single transaction, similar to how it was done pre-commit d6f215f359637. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: allow block allocator to take an alignment hintJohn Garry2-1/+10
Add a BMAPI flag to provide a hint to the block allocator to align extents according to the extszhint. This will be useful for atomic writes to ensure that we are not being allocated extents which are not suitable (for atomic writes). Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: ignore HW which cannot atomic write a single blockDarrick J. Wong3-10/+14
Currently only HW which can write at least 1x block is supported. For supporting atomic writes > 1x block, a CoW-based method will also be used and this will not be resticted to using HW which can write >= 1x block. However for deciding if HW-based atomic writes can be used, we need to start adding checks for write length < HW min, which complicates the code. Indeed, a statx field similar to unit_max_opt should also be added for this minimum, which is undesirable. HW which can only write > 1x blocks would be uncommon and quite weird, so let's just not support it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-08xfs: add helpers to compute transaction reservation for finishing intent itemsDarrick J. Wong2-31/+152
In the transaction reservation code, hoist the logic that computes the reservation needed to finish one log intent item into separate helper functions. These will be used in subsequent patches to estimate the number of blocks that an online repair can commit to reaping in the same transaction as the change committing the new data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: add helpers to compute log item overheadDarrick J. Wong12-3/+88
Add selected helpers to estimate the transaction reservation required to write various log intent and buffer items to the log. These helpers will be used by the online repair code for more precise estimations of how much work can be done in a single transaction. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: separate out setting buftarg atomic writes limitsDarrick J. Wong3-12/+28
Separate out setting buftarg atomic writes limits into a dedicated function, xfs_configure_buftarg_atomic_writes(), to keep the specific functionality self-contained. For naming consistency, rename xfs_setsize_buftarg() -> xfs_configure_buftarg(). Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: separate out from patch "xfs: ignore HW which ..."] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-08xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomic_write()John Garry3-5/+3
In future we will want to be able to check if specifically HW offload-based atomic writes are possible, so rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite(). Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> [djwong: add an underscore to be consistent with everything else] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08xfs: only call xfs_setsize_buftarg once per buffer targetDarrick J. Wong2-14/+30
It's silly to call xfs_setsize_buftarg from xfs_alloc_buftarg with the block device LBA size because we don't need to ask the block layer to validate a geometry number that it provided us. Instead, set the preliminary bt_meta_sector* fields to the LBA size in preparation for reading the primary super. However, we still want to flush and invalidate the pagecache for all three block devices before we start reading metadata from those devices, so call sync_blockdev() per bdev in xfs_alloc_buftarg(). This will enable a subsequent patch to validate hw atomic write geometry against the filesystem geometry. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> [jpg: call sync_blockdev() from xfs_alloc_buftarg()] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-08fs: add atomic write unit max opt to statxJohn Garry3-3/+7
XFS will be able to support large atomic writes (atomic write > 1x block) in future. This will be achieved by using different operating methods, depending on the size of the write. Specifically a new method of operation based in FS atomic extent remapping will be supported in addition to the current HW offload-based method. The FS method will generally be appreciably slower performing than the HW-offload method. However the FS method will be typically able to contribute to achieving a larger atomic write unit max limit. XFS will support a hybrid mode, where HW offload method will be used when possible, i.e. HW offload is used when the length of the write is supported, and for other times FS-based atomic writes will be used. As such, there is an atomic write length at which the user may experience appreciably slower performance. Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt. When zero, it means that there is no such performance boundary. Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is ok for older kernels which don't support this new field, as they would report 0 in this field (from zeroing in cp_statx()) already. Furthermore those older kernels don't support large atomic writes - apart from block fops, but there would be consistent performance there for atomic writes in range [unit min, unit max]. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-08bcachefs: Don't aggressively discard the journalKent Overstreet1-3/+4
We frequently use 'bcachefs list_journal -a' for debugging, as it provides a record of all btree transactions, and a history of what happened. But it's not so useful if we immediately discard journal buckets right after they're no longer dirty. This tweaks journal reclaim to only discard when we're low on space, keeping the journal mostly un-discarded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-08bcachefs: Ensure superblock gets written when we go EROKent Overstreet1-0/+5
When we go emergency read-only, make sure we do a final write_super() to persist counters and error counts - this can be critical for piecing together what fsck was doing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07bcachefs: Filter out harmless EROFS error messagesKent Overstreet1-1/+2
These just indicate that we're shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07bcachefs: journal_shutdown is EROFS, not EIOKent Overstreet1-1/+1
We often filter out EROFS errors to avoid log spew after an emergency shutdown - journal_shutdown is just another emergency shutdown error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07smb: client: Avoid race in open_cached_dir with lease breaksPaul Aurich1-8/+2
A pre-existing valid cfid returned from find_or_create_cached_dir might race with a lease break, meaning open_cached_dir doesn't consider it valid, and thinks it's newly-constructed. This leaks a dentry reference if the allocation occurs before the queued lease break work runs. Avoid the race by extending holding the cfid_list_lock across find_or_create_cached_dir and when the result is checked. Cc: stable@vger.kernel.org Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Paul Aurich <paul@darkrain42.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-07Merge tag 'erofs-for-6.15-rc6-fixes' of ↵Linus Torvalds3-20/+16
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: - Add a new reviewer, Hongbo Li, for better community development - Fix an I/O hang out of file-backed mounts - Address a rare data corruption caused by concurrent I/Os on the same deduplicated compressed data - Minor cleanup * tag 'erofs-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: ensure the extra temporary copy is valid for shortened bvecs erofs: remove unused enum type fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio() MAINTAINERS: erofs: add myself as reviewer
2025-05-07fs: aio: initialize .ki_write_stream of read-write requestMing Lei1-0/+1
AIO needs to initialize .ki_write_stream explicitly for read/write request, otherwise random .ki_write_stream is used, and cause -EINVAL returned for aio write randomly. Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Cc: Kanchan Joshi <joshi.k@samsung.com> Fixes: c27683da6406 ("block: expose write streams for block device nodes") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250507133328.3040255-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07hfsplus: use bdev_rw_virt in hfsplus_submit_bioChristoph Hellwig1-37/+9
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07btrfs: use bdev_rw_virt in scrub_one_superChristoph Hellwig1-8/+2
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify building the bio in xlog_write_iclogChristoph Hellwig1-26/+6
Use the bio_add_virt_nofail and bio_add_vmalloc helpers to abstract away the details of the memory allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify xfs_rw_bdevChristoph Hellwig1-18/+12
Delegate to bdev_rw_virt when operating on non-vmalloc memory and use bio_add_vmalloc_chunk to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify xfs_buf_submit_bioChristoph Hellwig1-35/+8
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it and use bio_add_vmalloc to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07zonefs: use bdev_rw_virt in zonefs_read_superChristoph Hellwig1-22/+12
Switch zonefs_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07gfs2: use bdev_rw_virt in gfs2_read_superChristoph Hellwig1-15/+9
Switch gfs2_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07udf: Make sure i_lenExtents is uptodate on inode evictionJan Kara1-1/+1
UDF maintains total length of all extents in i_lenExtents. Generally we keep extent lengths (and thus i_lenExtents) block aligned because it makes the file appending logic simpler. However the standard mandates that the inode size must match the length of all extents and thus we trim the last extent when closing the file. To catch possible bugs we also verify that i_lenExtents matches i_size when evicting inode from memory. Commit b405c1e58b73 ("udf: refactor udf_next_aext() to handle error") however broke the code updating i_lenExtents and thus udf_evict_inode() ended up spewing lots of errors about incorrectly sized extents although the extents were actually sized properly. Fix the updating of i_lenExtents to silence the errors. Fixes: b405c1e58b73 ("udf: refactor udf_next_aext() to handle error") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
2025-05-07erofs: ensure the extra temporary copy is valid for shortened bvecsGao Xiang1-17/+14
When compressed data deduplication is enabled, multiple logical extents may reference the same compressed physical cluster. The previous commit 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents") already avoids using shortened bvecs. However, in such cases, the extra temporary buffers also need to be preserved for later use in z_erofs_fill_other_copies() to to prevent data corruption. IOWs, extra temporary buffers have to be retained not only due to varying start relative offsets (`pageofs_out`, as indicated by `pcl->multibases`) but also because of shortened bvecs. android.hardware.graphics.composer@2.1.so : 270696 bytes 0: 0.. 204185 | 204185 : 628019200.. 628084736 | 65536 -> 1: 204185.. 225536 | 21351 : 544063488.. 544129024 | 65536 2: 225536.. 270696 | 45160 : 0.. 0 | 0 com.android.vndk.v28.apex : 93814897 bytes ... 364: 53869896..54095257 | 225361 : 543997952.. 544063488 | 65536 -> 365: 54095257..54309344 | 214087 : 544063488.. 544129024 | 65536 366: 54309344..54514557 | 205213 : 544129024.. 544194560 | 65536 ... Both 204185 and 54095257 have the same start relative offset of 3481, but the logical page 55 of `android.hardware.graphics.composer@2.1.so` ranges from 225280 to 229632, forming a shortened bvec [225280, 225536) that cannot be used for decompressing the range from 54095257 to 54309344 of `com.android.vndk.v28.apex`. Since `pcl->multibases` is already meaningless, just mark `be->keepxcpy` on demand for simplicity. Again, this issue can only lead to data corruption if `-Ededupe` is on. Fixes: 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents") Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250506101850.191506-1-hsiangkao@linux.alibaba.com
2025-05-06kill vfs_submount()Al Viro2-23/+1
The last remaining user of vfs_submount() (tracefs) is easy to convert to fs_context_for_submount(); do that and bury that thing, along with SB_SUBMOUNT Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-06f2fs: drop usage of folio_indexKairui Song3-5/+5
folio_index is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through `swap_rw` but f2fs does not use that method for swap. Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> (maintainer:F2FS FILE SYSTEM) Cc: Chao Yu <chao@kernel.org> (maintainer:F2FS FILE SYSTEM) Cc: linux-f2fs-devel@lists.sourceforge.net (open list:F2FS FILE SYSTEM) Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-06f2fs: support FAULT_TIMEOUTChao Yu3-0/+21
Support to inject a timeout fault into function, currently it only support to inject timeout to commit_atomic_write flow to reproduce inconsistent bug, like the bug fixed by commit f098aeba04c9 ("f2fs: fix to avoid atomicity corruption of atomic file"). By default, the new type fault will inject 1000ms timeout, and the timeout process can be interrupted by SIGKILL. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-06f2fs: handle error cases of memory donationDaeho Jeong3-10/+27
In cases of removing memory donation, we need to handle some error cases like ENOENT and EACCES (indicating the range already has been donated). Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-06f2fs: fix to bail out in get_new_segment()Chao Yu1-1/+5
------------[ cut here ]------------ WARNING: CPU: 3 PID: 579 at fs/f2fs/segment.c:2832 new_curseg+0x5e8/0x6dc pc : new_curseg+0x5e8/0x6dc Call trace: new_curseg+0x5e8/0x6dc f2fs_allocate_data_block+0xa54/0xe28 do_write_page+0x6c/0x194 f2fs_do_write_node_page+0x38/0x78 __write_node_page+0x248/0x6d4 f2fs_sync_node_pages+0x524/0x72c f2fs_write_checkpoint+0x4bc/0x9b0 __checkpoint_and_complete_reqs+0x80/0x244 issue_checkpoint_thread+0x8c/0xec kthread+0x114/0x1bc ret_from_fork+0x10/0x20 get_new_segment() detects inconsistent status in between free_segmap and free_secmap, let's record such error into super block, and bail out get_new_segment() instead of continue using the segment. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-06f2fs: sysfs: export linear_lookup in features directoryChao Yu1-0/+6
cat /sys/fs/f2fs/features/linear_lookup supported Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-06f2fs: sysfs: add encoding_flags entryChao Yu1-0/+9
This patch adds a new sysfs entry /sys/fs/f2fs/<disk>/encoding_flags, it is a read-only entry to show the value of sb.s_encoding_flags, the value is hexadecimal. ============================ ========== Flag_Name Flag_Value ============================ ========== SB_ENC_STRICT_MODE_FL 0x00000001 SB_ENC_NO_COMPAT_FALLBACK_FL 0x00000002 ============================ ========== case#1 mkfs.f2fs -f -O casefold -C utf8:strict /dev/vda mount /dev/vda /mnt/f2fs cat /sys/fs/f2fs/vda/encoding_flags 1 case#2 mkfs.f2fs -f -O casefold -C utf8 /dev/vda fsck.f2fs --nolinear-lookup=1 /dev/vda mount /dev/vda /mnt/f2fs cat /sys/fs/f2fs/vda/encoding_flags 2 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-06Merge tag 'for-6.15-rc5-tag' of ↵Linus Torvalds5-95/+8
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - revert device path canonicalization, this does not work as intended with namespaces and is not reliable in all setups - fix crash in scrub when checksum tree is not valid, e.g. when mounted with rescue=ignoredatacsums - fix crash when tracepoint btrfs_prelim_ref_insert is enabled - other minor fixups: - open code folio_index(), meant to be used in MM code - use matching type for sizeof in compression allocation * tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: open code folio_index() in btree_clear_folio_dirty_tag() Revert "btrfs: canonicalize the device path before adding it" btrfs: avoid NULL pointer dereference if no valid csum tree btrfs: handle empty eb->folios in num_extent_folios() btrfs: correct the order of prelim_ref arguments in btrfs__prelim_ref btrfs: compression: adjust cb->compressed_folios allocation type
2025-05-06smb3 client: warn when parse contexts returns error on compounded operationSteve French1-0/+2
Coverity noticed that the rc on smb2_parse_contexts() was not being checked in the case of compounded operations. Since we don't want to stop parsing the following compounded responses which are likely valid, we can't easily error out here, but at least print a warning message if server has a bug causing us to skip parsing the open response contexts. Addresses-Coverity: 1639191 Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-06ksmbd: Fix UAF in __close_file_table_idsSean Heelan1-7/+26
A use-after-free is possible if one thread destroys the file via __ksmbd_close_fd while another thread holds a reference to it. The existing checks on fp->refcount are not sufficient to prevent this. The fix takes ft->lock around the section which removes the file from the file table. This prevents two threads acquiring the same file pointer via __close_file_table_ids, as well as the other functions which retrieve a file from the IDR and which already use this same lock. Cc: stable@vger.kernel.org Signed-off-by: Sean Heelan <seanheelan@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-06ksmbd: prevent out-of-bounds stream writes by validating *posNorbert Szetei1-0/+7
ksmbd_vfs_stream_write() did not validate whether the write offset (*pos) was within the bounds of the existing stream data length (v_len). If *pos was greater than or equal to v_len, this could lead to an out-of-bounds memory write. This patch adds a check to ensure *pos is less than v_len before proceeding. If the condition fails, -EINVAL is returned. Cc: stable@vger.kernel.org Signed-off-by: Norbert Szetei <norbert@doyensec.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-05bcachefs: Call bch2_fs_start before getting vfs superblockKent Overstreet1-8/+3
This reverts 1fdbe0b184c8 bcachefs: Make sure c->vfs_sb is set before starting fs switched up bch2_fs_get_tree() so that we got a superblock before calling bch2_fs_start, so that c->vfs_sb would always be initialized while the filesystem was active. This turned out not to be necessary, because blk_holder_ops were implemented using our own locking, not vfs locking. And this had the side effect of creating a super_block and doing our full recovery (including potentially fsck) before setting SB_BORN, which causes things like sync calls to hang until our recovery is finished. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05mm: remove NR_BOUNCE zone statChristoph Hellwig1-2/+1
The stat is always 0 now, so remove it and hardwire the user visible output to 0. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05bcachefs: fix hung task timeout in journal readKent Overstreet1-1/+3
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05bcachefs: Add missing barriers before wake_up_bit()Kent Overstreet3-1/+10
wake_up() doesn't require a barrier - but wake_up_bit() does. This only affected non x86, and primarily lead to lost wakeups after btree node reads. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05bcachefs: Ensure proper write alignmentKent Overstreet1-1/+21
There was a buggy version of bcachefs-tools which picked misaligned bucket sizes when formatting, and we're also about to do dynamic block sizes - which will allow picking logical block size or physical block size of the device per-write, allowing for better compression ratios at the cost of slightly worse write performance (i.e. forcing the device to do RMW or extra buffering). To account for this, tweak bch2_alloc_sectors_start() to properly align open_buckets to the blocksize of the write we're about to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05bcachefs: Improve want_cached_ptr()Kent Overstreet1-2/+3
If promote target isn't set, rebalance should still leave a cached copy on the faster device. Fall back to foreground_target if it's set, or allow a cached copy on any device if neither are set. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05saner calling conventions for ->d_automount()Al Viro5-12/+3
Currently the calling conventions for ->d_automount() instances have an odd wart - returned new mount to be attached is expected to have refcount 2. That kludge is intended to make sure that mark_mounts_for_expiry() called before we get around to attaching that new mount to the tree won't decide to take it out. finish_automount() drops the extra reference after it's done with attaching mount to the tree - or drops the reference twice in case of error. ->d_automount() instances have rather counterintuitive boilerplate in them. There's a much simpler approach: have mark_mounts_for_expiry() skip the mounts that are yet to be mounted. And to hell with grabbing/dropping those extra references. Makes for simpler correctness analysis, at that... Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Acked-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>