kernel/linux.git/fs/btrfs/ordered-data.c, branch v7.2-rc1

btrfs: fix false IO failure after falling back to buffered write

2026-06-09T16:22:46+00:00

[BUG] The test case generic/362 will fail with "nodatasum" mount option (*): MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch generic/362 0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad) --- tests/generic/362.out 2024-08-24 15:31:37.200000000 +0930 +++ /home/adam/xfstests/results//generic/362.out.bad 2026-05-27 10:21:17.574771567 +0930 @@ -1,2 +1,3 @@ QA output created by 362 +First write failed: Input/output error Silence is golden ... *: If the test case has been executed before with default data checksum, the failure will not reproduce. Need the following fix to make it reliably reproducible: https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/ [CAUSE] Inside __iomap_dio_rw(), the -EFAULT/-ENOTBLK error is not directly returned. Thus we never got an error pointer from __iomap_dio_rw(). The call chain looks like this: btrfs_direct_write() |- btrfs_dio_write() |- __iomap_dio_rw() | |- iomap_iter() | | |- btrfs_dio_iomap_begin() | | Now an ordered extent is allocated for the 4K write. | | | |- iomi.status = iomap_dio_iter() | | Where iomap_dio_iter() returned -EFAULT. | | | |- ret = iomap_iter() | | |- btrfs_dio_iomap_end() | | | |- btrfs_finish_ordered_extent(uptodate = false) | | | | |- can_finish_ordered_extent() | | | | |- btrfs_mark_ordered_extent_error() | | | | |- mapping_set_error() | | | | Now the address space is marked error. | | | | return -ENOTBLK | | |- return -ENOTBLK | |- if (ret == -ENOTBLK) { ret = 0; } | Now the return value is reset to 0. | Thus no error pointer will be returned. | |- ret = iomap_dio_complete() | Since no byte is submitted, @ret is 0. | |- Fallback to buffered IO | And the buffered write finished without error | |- filemap_fdatawait_range() |- filemap_check_errors() The previous error is recorded, thus an error is returned However the buffered write is properly submitted and finished, the error is from the btrfs_finish_ordered_extent() call with @uptodate = false. [FIX] When a short dio write happened, any range that is submitted will have btrfs_extract_ordered_extent() to be called, thus the submitted range will always have an OE just covering the submitted range. The remaining OE range is never submitted, thus they should be treated as truncated, not an error. So that we can properly reclaim and not insert an unnecessary file extent item, without marking the mapping as error. Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize that helper to mark the direct IO ordered extent as truncated, so it won't cause failure for the later buffered fallback. [REASON FOR NO FIXES TAG] The bug itself is pretty old, at commit f85781fb505e ("btrfs: switch to iomap for direct IO") we're already passing @uptodate=false finishing the OE. But at that time OE with IOERR won't call mapping_set_error(), so it's not exposed. Later commit d61bec08b904 ("btrfs: mark ordered extent and inode with error if we fail to finish") finally exposed the bug, but that commit is doing a correct job, not the root cause. Anyway the bug is very old, dating back to 5.1x days, thus only CC to stable. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov Signed-off-by: Qu Wenruo Signed-off-by: David Sterba

btrfs: make more ASSERTs verbose, part 3

2026-06-08T13:53:28+00:00

We have support for optional string to be printed in ASSERT() (added in 19468a623a9109 ("btrfs: enhance ASSERT() to take optional format string")), it's not yet everywhere it could be so add a few more files. Try to finish what was left after 1c094e6ccead7a ("btrfs: make a few more ASSERTs verbose"). Signed-off-by: David Sterba

btrfs: fix silent IO error loss in encoded writes and zoned split

2026-04-07T17:43:50+00:00

can_finish_ordered_extent() and btrfs_finish_ordered_zoned() set BTRFS_ORDERED_IOERR via bare set_bit(). Later, btrfs_mark_ordered_extent_error() in btrfs_finish_one_ordered() uses test_and_set_bit(), finds it already set, and skips mapping_set_error(). The error is never recorded on the inode's address_space, making it invisible to fsync. For encoded writes this causes btrfs receive to silently produce files with zero-filled holes. Fix: replace bare set_bit(BTRFS_ORDERED_IOERR) with btrfs_mark_ordered_extent_error() which pairs test_and_set_bit() with mapping_set_error(), guaranteeing the error is recorded exactly once. Reviewed-by: Qu Wenruo Reviewed-by: Johannes Thumshirn Reviewed-by: Mark Harmstone Signed-off-by: Michal Grzedzicki Signed-off-by: David Sterba

btrfs: output more info when duplicated ordered extent is found

2026-04-07T16:56:02+00:00

During development of a new feature, I triggered that btrfs_panic() inside insert_ordered_extent() and spent quite some unnecessary before noticing I'm passing incorrect flags when creating a new ordered extent. Unfortunately the existing error message is not providing much help. Enhance the output to provide file offset, num bytes and flags of both existing and new ordered extents. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba

btrfs: check type flags in alloc_ordered_extent()

2026-04-07T16:56:02+00:00

Unlike other flags used in btrfs, BTRFS_ORDERED_* macros are different as they cannot be directly used as flags. They are defined as bit values, thus they should be utilized with bit operations, not directly with logical operations. Unfortunately sometimes I forgot this and passed the incorrect flags to alloc_ordered_extent() and hit weird bugs. Enhance the type checks in alloc_ordered_extent(): - Make sure there is one and only one bit set for exclusive type flags There are four exclusive type flags, REGULAR, NOCOW, PREALLOC and COMPRESSED. So introduce a new macro, BTRFS_ORDERED_EXCLUSIVE_FLAGS, to cover above flags. Add an ASSERT() to check one and only one of those exclusive flags can be set for alloc_ordered_extent(). - Re-order the type bit numbers to the end of the enum This is make it much harder to get a valid false negative. E.g., with the old code BTRFS_ORDERED_REGULAR starts at zero, we can have the following flags passing the bit uniqueness check: * BTRFS_ORDERED_NOCOW Be treated as BTRFS_ORDERED_REGULAR (1 == 1UL << 0). * BTRFS_ORDERED_PREALLOC Be treated as BTRFS_ORDERED_NOCOW (2 == 1UL << 1). * BTRFS_ORDERED_DIRECT Be treated as BTRFS_ORDERED_PREALLOC (4 == 1UL << 2). Now all those types start at 8, passing any of those bit numbers as flags directly will not pass the ASSERT(). - Add a static assert to avoid overflow To make sure all BTRFS_ORDERED_* flags can fit into an unsigned long. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba

btrfs: remove folio parameter from ordered io related functions

2026-04-07T16:55:57+00:00

Both functions btrfs_finish_ordered_extent() and btrfs_mark_ordered_io_finished() are accepting an optional folio parameter. That @folio is passed into can_finish_ordered_extent(), which later will test and clear the ordered flag for the involved range. However I do not think there is any other call site that can clear ordered flags of an page cache folio and can affect can_finish_ordered_extent(). There are limited *_clear_ordered() callers out of can_finish_ordered_extent() function: - btrfs_migrate_folio() This is completely unrelated, it's just migrating the ordered flag to the new folio. - btrfs_cleanup_ordered_extents() We manually clean the ordered flags of all involved folios, then call btrfs_mark_ordered_io_finished() without a @folio parameter. So it doesn't need and didn't pass a @folio parameter in the first place. - btrfs_writepage_fixup_worker() This function is going to be removed soon, and we should not hit that function anymore. - btrfs_invalidate_folio() This is the real call site we need to bother with. If we already have a bio running, btrfs_finish_ordered_extent() in end_bbio_data_write() will be executed first, as btrfs_invalidate_folio() will wait for the writeback to finish. Thus if there is a running bio, it will not see the range has ordered flags, and just skip to the next range. If there is no bio running, meaning the ordered extent is created but the folio is not yet submitted. In that case btrfs_invalidate_folio() will manually clear the folio ordered range, but then manually finish the ordered extent with btrfs_dec_test_ordered_pending() without bothering the folio ordered flags. Meaning if the OE range with folio ordered flags will be finished manually without the need to call can_finish_ordered_extent(). This means all can_finish_ordered_extent() call sites should get a range that has folio ordered flag set, thus the old "return false" branch should never be triggered. Now we can: - Remove the @folio parameter from involved functions * btrfs_mark_ordered_io_finished() * btrfs_finish_ordered_extent() For call sites passing a @folio into those functions, let them manually clear the ordered flag of involved folios. - Move btrfs_finish_ordered_extent() out of the loop in end_bbio_data_write() We only need to call btrfs_finish_ordered_extent() once per bbio, not per folio. - Add an ASSERT() to make sure all folio ranges have ordered flags It's only for end_bbio_data_write(). And we already have enough safe nets to catch over-accounting of ordered extents. Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Signed-off-by: David Sterba

btrfs: remove the btrfs_inode parameter from btrfs_remove_ordered_extent()

2026-04-07T16:55:57+00:00

We already have btrfs_ordered_extent::inode, thus there is no need to pass a btrfs_inode parameter to btrfs_remove_ordered_extent(). Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Signed-off-by: David Sterba

btrfs: rename btrfs_ordered_extent::list to csum_list

2026-04-07T16:55:56+00:00

That list head records all pending checksums for that ordered extent. And unlike other lists, we just use the name "list", which can be very confusing for readers. Rename it to "csum_list" which follows the remaining lists, showing the purpose of the list. And since we're here, remove a comment inside btrfs_finish_ordered_zoned() where we have "ASSERT(!list_empty(&ordered->csum_list))" to make sure the OE has pending csums. That comment is only here to make sure we do not call list_first_entry() before checking BTRFS_ORDERED_PREALLOC. But since we already have that bit checked and even have a dedicated ASSERT(), there is no need for that comment anymore. Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Signed-off-by: David Sterba

Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

2025-12-04T04:03:46+00:00

Pull btrfs updates from David Sterba: "Features: - shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now): - set filesystem state as being shut down (also named going down in other filesystems), where all active operations return EIO and this cannot be changed until unmount - pending operations are attempted to be finished but error messages may still show up depending on where exactly the shutdown happened - scrub (and device replace) vs suspend/hibernate: - a running scrub will prevent suspend, which can be annoying as suspend is an immediate request and scrub is not critical - filesystem freezing before suspend was not sufficient as the problem was in process freezing - behaviour change: on suspend scrub and device replace are cancelled, where scrub can record the last state and continue from there; the device replace has to be restarted from the beginning - zone stats exported in sysfs, from the perspective of the filesystem this includes active, reclaimable, relocation etc zones Performance: - improvements when processing space reservation tickets by optimizing locking and shrinking critical sections, cumulative improvements in lockstat numbers show +15% Notable fixes: - use vmalloc fallback when allocating bios as high order allocations can happen with wide checksums (like sha256) - scrub will always track the last position of progress so it's not starting from zero after an error Core: - under experimental config, checksum calculations are offloaded to process context, simplifies locking and allows to remove compression write worker kthread(s): - speed improvement in direct IO throughput with buffered IO fallback is +15% when not offloaded but this is more related to internal crypto subsystem improvements - this will be probably default in the future removing the sysfs tunable - (experimental) block size > page size updates: - support more operations when not using large folios (encoded read/write and send) - raid56 - more preparations for fscrypt support Other: - more conversions to auto-cleaned variables - parameter cleanups and removals - extended warning fixes - improved printing of structured values like keys - lots of other cleanups and refactoring" * tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits) btrfs: remove unnecessary inode key in btrfs_log_all_parents() btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root() btrfs: remaining BTRFS_PATH_AUTO_FREE conversions btrfs: send: do not allocate memory for xattr data when checking it exists btrfs: send: add unlikely to all unexpected overflow checks btrfs: reduce arguments to btrfs_del_inode_ref_in_log() btrfs: remove root argument from btrfs_del_dir_entries_in_log() btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref() btrfs: don't search back for dir inode item in INO_LOOKUP_USER btrfs: don't rewrite ret from inode_permission btrfs: add orig_logical to btrfs_bio for encryption btrfs: disable verity on encrypted inodes btrfs: disable various operations on encrypted inodes btrfs: remove redundant level reset in btrfs_del_items() btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf() btrfs: optimize balance_level() path reference handling btrfs: factor out root promotion logic into promote_child_to_root() btrfs: raid56: remove the "_step" infix btrfs: raid56: enable bs > ps support btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases ...

btrfs: relax btrfs_inode::ordered_tree_lock IRQ locking context

2025-11-24T21:42:21+00:00

We used IRQ version of spinlock for ordered_tree_lock, as btrfs_finish_ordered_extent() can be called in end_bbio_data_write() which was in IRQ context. However since we're moving all the btrfs_bio::end_io() calls into task context, there is no more need to support IRQ context thus we can relax to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba