Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 1961d20f6fa8903266ed9bd77c691924c22c8f02 ]
When building the free space tree with the block group tree feature
enabled, we can hit an assertion failure like this:
BTRFS info (device loop0 state M): rebuilding free space tree
assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102
------------[ cut here ]------------
kernel BUG at fs/btrfs/free-space-tree.c:1102!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
Modules linked in:
CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102
lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102
sp : ffff8000a4ce7600
x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8
x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001
x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160
x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff
x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0
x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff
x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00
x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0
x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e
Call trace:
populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P)
btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337
btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074
btrfs_remount_rw fs/btrfs/super.c:1319 [inline]
btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543
reconfigure_super+0x1d4/0x6f0 fs/super.c:1083
do_remount fs/namespace.c:3365 [inline]
path_mount+0xb34/0xde0 fs/namespace.c:4200
do_mount fs/namespace.c:4221 [inline]
__do_sys_mount fs/namespace.c:4432 [inline]
__se_sys_mount fs/namespace.c:4409 [inline]
__arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
Code: f0047182 91178042 528089c3 9771d47b (d4210000)
---[ end trace 0000000000000000 ]---
This happens because we are processing an empty block group, which has
no extents allocated from it, there are no items for this block group,
including the block group item since block group items are stored in a
dedicated tree when using the block group tree feature. It also means
this is the block group with the highest start offset, so there are no
higher keys in the extent root, hence btrfs_search_slot_for_read()
returns 1 (no higher key found).
Fix this by asserting 'ret' is 0 only if the block group tree feature
is not enabled, in which case we should find a block group item for
the block group since it's stored in the extent root and block group
item keys are greater than extent item keys (the value for
BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and
BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively).
In case 'ret' is 1, we just need to add a record to the free space
tree which spans the whole block group, and we can achieve this by
making 'ret == 0' as the while loop's condition.
Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 157501b0469969fc1ba53add5049575aadd79d80 ]
We are setting the parent directory's last_unlink_trans directly which
may result in a concurrent task starting to log the directory not see the
update and therefore can log the directory after we removed a child
directory which had a snapshot within instead of falling back to a
transaction commit. Replaying such a log tree would result in a mount
failure since we can't currently delete snapshots (and subvolumes) during
log replay. This is the type of failure described in commit 1ec9a1ae1e30
("Btrfs: fix unreplayable log after snapshot delete + parent dir fsync").
Fix this by using btrfs_record_snapshot_destroy() which updates the
last_unlink_trans field while holding the inode's log_mutex lock.
Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c466e33e729a0ee017d10d919cba18f503853c60 ]
In case the removed directory had a snapshot that was deleted, we are
propagating its inode's last_unlink_trans to the parent directory after
we removed the entry from the parent directory. This leaves a small race
window where someone can log the parent directory after we removed the
entry and before we updated last_unlink_trans, and as a result if we ever
try to replay such a log tree, we will fail since we will attempt to
remove a snapshot during log replay, which is currently not possible and
results in the log replay (and mount) to fail. This is the type of failure
described in commit 1ec9a1ae1e30 ("Btrfs: fix unreplayable log after
snapshot delete + parent dir fsync").
So fix this by propagating the last_unlink_trans to the parent directory
before we remove the entry from it.
Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bf5bcf9a6fa070ec8a725b08db63fb1318f77366 ]
Instead of recording that a new subvolume was created in a directory after
we add the entry do the directory, record it before adding the entry. This
is to avoid races where after creating the entry and before recording the
new subvolume in the directory (the call to btrfs_record_new_subvolume()),
another task logs the directory, so we end up with a log tree where we
logged a directory that has an entry pointing to a root that was not yet
committed, resulting in an invalid entry if the log is persisted and
replayed later due to a power failure or crash.
Also state this requirement in the function comment for
btrfs_record_new_subvolume(), similar to what we do for the
btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy().
Fixes: 45c4102f0d82 ("btrfs: avoid transaction commit on any fsync after subvolume creation")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5f61b961599acbd2bed028d3089105a1f7d224b8 ]
When replaying log trees we use read_one_inode() to get an inode, which is
just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for
btrfs_iget(). But read_one_inode() always returns NULL for any error
that btrfs_iget_logging() / btrfs_iget() may return and this is a problem
because:
1) In many callers of read_one_inode() we convert the NULL into -EIO,
which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT
for example, besides -EIO and other errors. So during log replay we
may end up reporting a false -EIO, which is confusing since we may
not have had any IO error at all;
2) When replaying directory deletes, at replay_dir_deletes(), we assume
the NULL returned from read_one_inode() means that the inode doesn't
exist and then proceed as if no error had happened. This is wrong
because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an
actual error and the target inode may exist in the target subvolume
root - this may later result in the log replay code failing at a
later stage (if we are "lucky") or succeed but leaving some
inconsistency in the filesystem.
So fix this by not ignoring errors from btrfs_iget_logging() and as
a consequence remove the read_one_inode() wrapper and just use
btrfs_iget_logging() directly. Also since btrfs_iget_logging() is
supposed to be called only against subvolume roots, just like
read_one_inode() which had a comment about it, add an assertion to
btrfs_iget_logging() to check that the target root corresponds to a
subvolume root.
Fixes: 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 54a7081ed168b72a8a2d6ef4ba3a1259705a2926 ]
At __inode_add_ref() when processing extrefs, if we jump into the next
label we have an undefined value of victim_name.len, since we haven't
initialized it before we did the goto. This results in an invalid memory
access in the next iteration of the loop since victim_name.len was not
initialized to the length of the name of the current extref.
Fix this by initializing victim_name.len with the current extref's name
length.
Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
replay
[ Upstream commit 6561a40ceced9082f50c374a22d5966cf9fc5f5c ]
During log replay, at __add_inode_ref(), when we are searching for inode
ref keys we totally ignore if btrfs_search_slot() returns an error. This
may make a log replay succeed when there was an actual error and leave
some metadata inconsistency in a subvolume tree. Fix this by checking if
an error was returned from btrfs_search_slot() and if so, return it to
the caller.
Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1e6ed33cabba8f06f532f2e5851a102602823734 ]
If we are rebuilding a free space tree, while modifying the free space
tree we may need to allocate a new metadata block group.
If we end up using multiple transactions for the rebuild, when we call
btrfs_end_transaction() we enter btrfs_create_pending_block_groups()
which calls add_block_group_free_space() to add items to the free space
tree for the block group.
Then later during the free space tree rebuild, at
btrfs_rebuild_free_space_tree(), we may find such new block groups
and call populate_free_space_tree() for them, which fails with -EEXIST
because there are already items in the free space tree. Then we abort the
transaction with -EEXIST at btrfs_rebuild_free_space_tree().
Notice that we say "may find" the new block groups because a new block
group may be inserted in the block groups rbtree, which is being iterated
by the rebuild process, before or after the current node where the rebuild
process is currently at.
Syzbot recently reported such case which produces a trace like the
following:
------------[ cut here ]------------
BTRFS: Transaction aborted (error -17)
WARNING: CPU: 1 PID: 7626 at fs/btrfs/free-space-tree.c:1341 btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
Modules linked in:
CPU: 1 UID: 0 PID: 7626 Comm: syz.2.25 Not tainted 6.15.0-rc7-syzkaller-00085-gd7fa1af5b33e-dirty #0 PREEMPT
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
lr : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
sp : ffff80009c4f7740
x29: ffff80009c4f77b0 x28: ffff0000d4c3f400 x27: 0000000000000000
x26: dfff800000000000 x25: ffff70001389eee8 x24: 0000000000000003
x23: 1fffe000182b6e7b x22: 0000000000000000 x21: ffff0000c15b73d8
x20: 00000000ffffffef x19: ffff0000c15b7378 x18: 1fffe0003386f276
x17: ffff80008f31e000 x16: ffff80008adbe98c x15: 0000000000000001
x14: 1fffe0001b281550 x13: 0000000000000000 x12: 0000000000000000
x11: ffff60001b281551 x10: 0000000000000003 x9 : 1c8922000a902c00
x8 : 1c8922000a902c00 x7 : ffff800080485878 x6 : 0000000000000000
x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff80008047843c
x2 : 0000000000000001 x1 : ffff80008b3ebc40 x0 : 0000000000000001
Call trace:
btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 (P)
btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074
btrfs_remount_rw fs/btrfs/super.c:1319 [inline]
btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543
reconfigure_super+0x1d4/0x6f0 fs/super.c:1083
do_remount fs/namespace.c:3365 [inline]
path_mount+0xb34/0xde0 fs/namespace.c:4200
do_mount fs/namespace.c:4221 [inline]
__do_sys_mount fs/namespace.c:4432 [inline]
__se_sys_mount fs/namespace.c:4409 [inline]
__arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
irq event stamp: 330
hardirqs last enabled at (329): [<ffff80008048590c>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1525 [inline]
hardirqs last enabled at (329): [<ffff80008048590c>] finish_lock_switch+0xb0/0x1c0 kernel/sched/core.c:5130
hardirqs last disabled at (330): [<ffff80008adb9e60>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:511
softirqs last enabled at (10): [<ffff8000801fbf10>] local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32
softirqs last disabled at (8): [<ffff8000801fbedc>] local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19
---[ end trace 0000000000000000 ]---
Fix this by flagging new block groups which had their free space tree
entries already added and then skip them in the rebuild process. Also,
since the rebuild may be triggered when doing a remount, make sure that
when we clear an existing free space tree that we clear such flag from
every existing block group, otherwise we would skip those block groups
during the rebuild.
Reported-by: syzbot+d0014fb0fc39c5487ae5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68460a54.050a0220.daf97.0af5.GAE@google.com/
Fixes: 882af9f13e83 ("btrfs: handle free space tree rebuild in multiple transactions")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 2dcf838cf5c2f0f4501edaa1680fcad03618d760 upstream.
In a few places where we call read_one_inode(), if we get a NULL pointer
we end up jumping into an error path, or fallthrough in case of
__add_inode_ref(), where we then do something like this:
iput(&inode->vfs_inode);
which results in an invalid inode pointer that triggers an invalid memory
access, resulting in a crash.
Fix this by making sure we don't do such dereferences.
Fixes: b4c50cbb01a1 ("btrfs: return a btrfs_inode from read_one_inode()")
CC: stable@vger.kernel.org # 6.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ae4477f937569d097ca5dbce92a89ba384b49bc6 upstream.
Each superblock contains a copy of the device item for that device. In a
transaction which drops a chunk but doesn't create any new ones, we were
correctly updating the device item in the chunk tree but not copying
over the new bytes_used value to the superblock.
This can be seen by doing the following:
# dd if=/dev/zero of=test bs=4096 count=2621440
# mkfs.btrfs test
# mount test /root/temp
# cd /root/temp
# for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done
# sync
# rm *
# sync
# btrfs balance start -dusage=0 .
# sync
# cd
# umount /root/temp
# btrfs check test
For btrfs-check to detect this, you will also need my patch at
https://github.com/kdave/btrfs-progs/pull/991.
Change btrfs_remove_dev_extents() so that it adds the devices to the
fs_info->post_commit_list if they're not there already. This causes
btrfs_commit_device_sizes() to be called, which updates the bytes_used
value in the superblock.
Fixes: bbbf7243d62d ("btrfs: combine device update operations during transaction commit")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3ca864de852bc91007b32d2a0d48993724f4abad upstream.
We have a race between a rename and directory inode logging that if it
happens and we crash/power fail before the rename completes, the next time
the filesystem is mounted, the log replay code will end up deleting the
file that was being renamed.
This is best explained following a step by step analysis of an interleaving
of steps that lead into this situation.
Consider the initial conditions:
1) We are at transaction N;
2) We have directories A and B created in a past transaction (< N);
3) We have inode X corresponding to a file that has 2 hardlinks, one in
directory A and the other in directory B, so we'll name them as
"A/foo_link1" and "B/foo_link2". Both hard links were persisted in a
past transaction (< N);
4) We have inode Y corresponding to a file that as a single hard link and
is located in directory A, we'll name it as "A/bar". This file was also
persisted in a past transaction (< N).
The steps leading to a file loss are the following and for all of them we
are under transaction N:
1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field
is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir();
2) Task A starts a rename for inode Y, with the goal of renaming from
"A/bar" to "A/baz", so we enter btrfs_rename();
3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling
btrfs_insert_inode_ref();
4) Because the rename happens in the same directory, we don't set the
last_unlink_trans field of directoty A's inode to the current
transaction id, that is, we don't cal btrfs_record_unlink_dir();
5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY
and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode()
(actually the dir index item is added as a delayed item, but the
effect is the same);
6) Now before task A adds the new entry "A/baz" to directory A by
calling btrfs_add_link(), another task, task B is logging inode X;
7) Task B starts a fsync of inode X and after logging inode X, at
btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since
inode X has a last_unlink_trans value of N, set at in step 1;
8) At btrfs_log_all_parents() we search for all parent directories of
inode X using the commit root, so we find directories A and B and log
them. Bu when logging direct A, we don't have a dir index item for
inode Y anymore, neither the old name "A/bar" nor for the new name
"A/baz" since the rename has deleted the old name but has not yet
inserted the new name - task A hasn't called yet btrfs_add_link() to
do that.
Note that logging directory A doesn't fallback to a transaction
commit because its last_unlink_trans has a lower value than the
current transaction's id (see step 4);
9) Task B finishes logging directories A and B and gets back to
btrfs_sync_file() where it calls btrfs_sync_log() to persist the log
tree;
10) Task B successfully persisted the log tree, btrfs_sync_log() completed
with success, and a power failure happened.
We have a log tree without any directory entry for inode Y, so the
log replay code deletes the entry for inode Y, name "A/bar", from the
subvolume tree since it doesn't exist in the log tree and the log
tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY
item that covers the index range for the dentry that corresponds to
"A/bar").
Since there's no other hard link for inode Y and the log replay code
deletes the name "A/bar", the file is lost.
The issue wouldn't happen if task B synced the log only after task A
called btrfs_log_new_name(), which would update the log with the new name
for inode Y ("A/bar").
Fix this by pinning the log root during renames before removing the old
directory entry, and unpinning after btrfs_log_new_name() is called.
Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1f2889f5594a2bc4c6a52634c4a51b93e785def5 ]
If we fail to allocate an ordered extent for a COW write we end up leaking
a qgroup data reservation since we called btrfs_qgroup_release_data() but
we didn't call btrfs_qgroup_free_refroot() (which would happen when
running the respective data delayed ref created by ordered extent
completion or when finishing the ordered extent in case an error happened).
So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
ordered extent for a COW write.
Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 05a6ec865d091fe8244657df8063f74e704d1711 ]
The unsigned type is a recommended practice (CWE-190, CWE-194) for bit
shifts to avoid problems with potential unwanted sign extensions.
Although there are no such cases in btrfs codebase, follow the
recommendation.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 1f2889f5594a ("btrfs: fix qgroup reservation leak on failure to allocate ordered extent")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 547e836661554dcfa15c212a3821664e85b4191a ]
[BUG]
There is syzbot based reproducer that can crash the kernel, with the
following call trace: (With some debug output added)
DEBUG: rescue=ibadroots parsed
BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010)
BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8
BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm
BTRFS info (device loop0): using free-space-tree
BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0
DEBUG: read tree root path failed for tree csum, ret=-5
BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0
BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0
process 'repro' launched './file2' with NULL argv: empty string added
DEBUG: no csum root, idatacsums=0 ibadroots=134217728
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f]
CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G OE 6.15.0-custom+ #249 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs]
Call Trace:
<TASK>
btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs]
btrfs_submit_bbio+0x43e/0x1a80 [btrfs]
submit_one_bio+0xde/0x160 [btrfs]
btrfs_readahead+0x498/0x6a0 [btrfs]
read_pages+0x1c3/0xb20
page_cache_ra_order+0x4b5/0xc20
filemap_get_pages+0x2d3/0x19e0
filemap_read+0x314/0xde0
__kernel_read+0x35b/0x900
bprm_execve+0x62e/0x1140
do_execveat_common.isra.0+0x3fc/0x520
__x64_sys_execveat+0xdc/0x130
do_syscall_64+0x54/0x1d0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
---[ end trace 0000000000000000 ]---
[CAUSE]
Firstly the fs has a corrupted csum tree root, thus to mount the fs we
have to go "ro,rescue=ibadroots" mount option.
Normally with that mount option, a bad csum tree root should set
BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will
ignore csum search.
But in this particular case, we have the following call trace that
caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS:
load_global_roots_objectid():
ret = btrfs_search_slot();
/* Succeeded */
btrfs_item_key_to_cpu()
found = true;
/* We found the root item for csum tree. */
root = read_tree_root_path();
if (IS_ERR(root)) {
if (!btrfs_test_opt(fs_info, IGNOREBADROOTS))
/*
* Since we have rescue=ibadroots mount option,
* @ret is still 0.
*/
break;
if (!found || ret) {
/* @found is true, @ret is 0, error handling for csum
* tree is skipped.
*/
}
This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if
the csum tree is corrupted, which results unexpected later csum lookup.
[FIX]
If read_tree_root_path() failed, always populate @ret to the error
number.
As at the end of the function, we need @ret to determine if we need to
do the extra error handling for csum tree.
Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree")
Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com>
Reported-by: Longxing Li <coregee2000@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a26bf338cdad3643a6e7c3d78a172baadba15c1a ]
Syzbot reported an assertion failure due to an attempt to add a delayed
iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info
state:
WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
Modules linked in:
CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
Workqueue: btrfs-endio-write btrfs_work_helper
RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
Code: 4e ad 5d (...)
RSP: 0018:ffffc9000213f780 EFLAGS: 00010293
RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00
RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000
RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c
R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001
R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100
FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635
btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312
btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312
process_one_work kernel/workqueue.c:3238 [inline]
process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
kthread+0x70e/0x8a0 kernel/kthread.c:464
ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
This can happen due to a race with the async reclaim worker like this:
1) The async metadata reclaim worker enters shrink_delalloc(), which calls
btrfs_start_delalloc_roots() with an nr_pages argument that has a value
less than LONG_MAX, and that in turn enters start_delalloc_inodes(),
which sets the local variable 'full_flush' to false because
wbc->nr_to_write is less than LONG_MAX;
2) There it finds inode X in a root's delalloc list, grabs a reference for
inode X (with igrab()), and triggers writeback for it with
filemap_fdatawrite_wbc(), which creates an ordered extent for inode X;
3) The unmount sequence starts from another task, we enter close_ctree()
and we flush the workqueue fs_info->endio_write_workers, which waits
for the ordered extent for inode X to complete and when dropping the
last reference of the ordered extent, with btrfs_put_ordered_extent(),
when we call btrfs_add_delayed_iput() we don't add the inode to the
list of delayed iputs because it has a refcount of 2, so we decrement
it to 1 and return;
4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which
runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT
in the fs_info state;
5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now
calls btrfs_add_delayed_iput() for inode X and there we trigger an
assertion failure since the fs_info state has the flag
BTRFS_FS_STATE_NO_DELAYED_IPUT set.
Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for
the async reclaim workers to finish, after we call cancel_work_sync() for
them at close_ctree(), and by running delayed iputs after wait for the
reclaim workers to finish and before setting the bit.
This race was recently introduced by commit 19e60b2a95f5 ("btrfs: add
extra warning if delayed iput is added when it's not allowed"). Without
the new validation at btrfs_add_delayed_iput(), this described scenario
was safe because close_ctree() later calls btrfs_commit_super(). That
will run any final delayed iputs added by reclaim workers in the window
between the btrfs_run_delayed_iputs() and the the reclaim workers being
shut down.
Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u
Fixes: 19e60b2a95f5 ("btrfs: add extra warning if delayed iput is added when it's not allowed")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 41d69d4d78d8b179bf3bcdfc56d28a12b3a608d2 ]
If split_state() returned an error we call extent_io_tree_panic() which
will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an
uncommon and exotic scenario, then we fallthrough and hit a use after free
when calling set_state_bits() since the extent state record which the
local variable 'prealloc' points to was freed by split_state().
So jump to the label 'out' after calling extent_io_tree_panic() and set
the 'prealloc' pointer to NULL since split_state() has already freed it
when it hit an error.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5e85262e542d6da8898bb8563a724ad98f6fc936 ]
If we fsync a file (or directory) that has no more hard links, because
while a process had a file descriptor open on it, the file's last hard
link was removed and then the process did an fsync against the file
descriptor, after a power failure or crash the file still exists after
replaying the log.
This behaviour is incorrect since once an inode has no more hard links
it's not accessible anymore and we insert an orphan item into its
subvolume's tree so that the deletion of all its items is not missed in
case of a power failure or crash.
So after log replay the file shouldn't exist anymore, which is also the
behaviour on ext4, xfs, f2fs and other filesystems.
Fix this by not ignoring inodes with zero hard links at
btrfs_log_inode_parent() and by committing an inode's delayed inode when
we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or
BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This
last step is necessary because when removing the last hard link we don't
delete the corresponding ref (or extref) item, instead we record the
change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF
flag, so that when the delayed inode is committed we delete the ref/extref
item from the inode's subvolume tree - otherwise the logging code will log
the last hard link and therefore upon log replay the inode is not deleted.
The base code for a fstests test case that reproduces this bug is the
following:
. ./common/dmflakey
_require_scratch
_require_dm_target flakey
_require_mknod
_scratch_mkfs >>$seqres.full 2>&1 || _fail "mkfs failed"
_require_metadata_journaling $SCRATCH_DEV
_init_flakey
_mount_flakey
touch $SCRATCH_MNT/foo
# Commit the current transaction and persist the file.
_scratch_sync
# A fifo to communicate with a background xfs_io process that will
# fsync the file after we deleted its hard link while it's open by
# xfs_io.
mkfifo $SCRATCH_MNT/fifo
tail -f $SCRATCH_MNT/fifo | \
$XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full &
XFS_IO_PID=$!
# Give some time for the xfs_io process to open a file descriptor for
# the file.
sleep 1
# Now while the file is open by the xfs_io process, delete its only
# hard link.
rm -f $SCRATCH_MNT/foo
# Now that it has no more hard links, make the xfs_io process fsync it.
echo "fsync" > $SCRATCH_MNT/fifo
# Terminate the xfs_io process so that we can unmount.
echo "quit" > $SCRATCH_MNT/fifo
wait $XFS_IO_PID
unset XFS_IO_PID
# Simulate a power failure and then mount again the filesystem to
# replay the journal/log.
_flakey_drop_and_remount
# We don't expect the file to exist anymore, since it was fsynced when
# it had no more hard links.
[ -f $SCRATCH_MNT/foo ] && echo "file foo still exists"
_unmount_flakey
# success, all done
echo "Silence is golden"
status=0
exit
A test case for fstests will be submitted soon.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3bf179e36da917c5d9bec71c714573ed1649b7c1 ]
If insert_state() state failed it returns an error pointer and we call
extent_io_tree_panic() which will trigger a BUG() call. However if
CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then
we fallthrough and call cache_state() which will dereference the error
pointer, resulting in an invalid memory access.
So jump to the 'out' label after calling extent_io_tree_panic(), it also
makes the code more clear besides dealing with the exotic scenario where
CONFIG_BUG is disabled.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 17a85f520469a1838379de8ad24f63e778f7c277 ]
If we're doing a mmap write against a folio that has i_size somewhere in
the middle and we have multiple sectors in the folio, we may have to
release excess space previously reserved, for the range going from the
rounded up (to sector size) i_size to the folio's end offset. We are
calculating the right amount to release and passing it to
btrfs_delalloc_release_space(), but we are passing the wrong start offset
of that range - we're passing the folio's start offset instead of the
end offset, plus 1, of the range for which we keep the reservation. This
may result in releasing more space then we should and eventually trigger
an underflow of the data space_info's bytes_may_use counter.
So fix this by passing the start offset as 'end + 1' instead of
'page_start' to btrfs_delalloc_release_space().
Fixes: d0b7da88f640 ("Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d3914d6030aa6be2993dfc223d096ff93018c236 ]
If when truncating a block we fail to reserve data space and then we
proceed anyway because we can do a NOCOW write, if we later get an error
when trying to get the folio from the inode's mapping, we end up releasing
data space that we haven't reserved, screwing up the bytes_may_use counter
from the data space_info, eventually resulting in an underflow when all
other reservations done by other tasks are released, if any, or right away
if there are no other reservations at the moment.
This is because when we get an error when trying to grab the block's folio
we call btrfs_delalloc_release_space(), which releases metadata (which we
have reserved) and data (which we haven't reserved).
Fix this by calling btrfs_delalloc_release_space() only if we did reserve
data space, that is, if we aren't falling back to NOCOW, meaning the local
variable @only_release_metadata has a false value, otherwise release only
metadata by calling btrfs_delalloc_release_metadata().
Fixes: 6d4572a9d71d ("btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f2c19541e421b3235efc515dad88b581f00592ae ]
When the bytenr doesn't match for a metadata tree block, we will report
it as an csum error, which is incorrect and should be reported as a
metadata error instead.
Fixes: a3ddbaebc7c9 ("btrfs: scrub: introduce a helper to verify one metadata block")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ec1f3a207cdf314eae4d4ae145f1ffdb829f0652 ]
[BUG]
Since the migration to the new scrub_stripe interface, scrub no longer
updates the device stats when hitting an error, no matter if it's a read
or checksum mismatch error. E.g:
BTRFS info (device dm-2): scrub: started on devid 1
BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0
Note there is no line showing the device stats error update.
[CAUSE]
In the migration to the new scrub_stripe interface, we no longer call
btrfs_dev_stat_inc_and_print().
[FIX]
- Introduce a new bitmap for metadata generation errors
* A new bitmap
@meta_gen_error_bitmap is introduced to record which blocks have
metadata generation mismatch errors.
* A new counter for that bitmap
@init_nr_meta_gen_errors, is also introduced to store the number of
generation mismatch errors that are found during the initial read.
This is for the error reporting at scrub_stripe_report_errors().
* New dedicated error message for unrepaired generation mismatches
* Update @meta_gen_error_bitmap if a transid mismatch is hit
- Add btrfs_dev_stat_inc_and_print() calls to the following call sites
* scrub_stripe_report_errors()
* scrub_write_endio()
This is only for the write errors.
This means there is a minor behavior change:
- The timing of device stats error message
Since we concentrate the error messages at
scrub_stripe_report_errors(), the device stats error messages will all
show up in one go, after the detailed scrub error messages:
BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential endless loop when discarding a block group when
disabling discard
- reinstate message when setting a large value of mount option 'commit'
- fix a folio leak when async extent submission fails
* tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add back warning for mount option commit values exceeding 300
btrfs: fix folio leak in submit_one_async_extent()
btrfs: fix discard worker infinite loop after disabling discard
|
|
The Btrfs documentation states that if the commit value is greater than
300 a warning should be issued. The warning was accidentally lost in the
new mount API update.
Fixes: 6941823cc878 ("btrfs: remove old mount API code")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If btrfs_reserve_extent() fails while submitting an async_extent for a
compressed write, then we fail to call free_async_extent_pages() on the
async_extent and leak its folios. A likely cause for such a failure
would be btrfs_reserve_extent() failing to find a large enough
contiguous free extent for the compressed extent.
I was able to reproduce this by:
1. mount with compress-force=zstd:3
2. fallocating most of a filesystem to a big file
3. fragmenting the remaining free space
4. trying to copy in a file which zstd would generate large compressed
extents for (vmlinux worked well for this)
Step 4. hits the memory leak and can be repeated ad nauseam to
eventually exhaust the system memory.
Fix this by detecting the case where we fallback to uncompressed
submission for a compressed async_extent and ensuring that we call
free_async_extent_pages().
Fixes: 131a821a243f ("btrfs: fallback if compressed IO fails for ENOSPC")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Co-developed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If the discard worker is running and there's currently only one block
group, that block group is a data block group, it's in the unused block
groups discard list and is being used (it got an extent allocated from it
after becoming unused), the worker can end up in an infinite loop if a
transaction abort happens or the async discard is disabled (during remount
or unmount for example).
This happens like this:
1) Task A, the discard worker, is at peek_discard_list() and
find_next_block_group() returns block group X;
2) Block group X is in the unused block groups discard list (its discard
index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
it become an unused block group and was added to that list, but then
later it got an extent allocated from it, so its ->used counter is not
zero anymore;
3) The current transaction is aborted by task B and we end up at
__btrfs_handle_fs_error() in the transaction abort path, where we call
btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
(setting SB_RDONLY in the super block's s_flags field);
4) Task A calls __add_to_discard_list() with the goal of moving the block
group from the unused block groups discard list into another discard
list, but at __add_to_discard_list() we end up doing nothing because
btrfs_run_discard_work() returns false, since the super block has
SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
anymore in fs_info->flags. So block group X remains in the unused block
groups discard list;
5) Task A then does a goto into the 'again' label, calls
find_next_block_group() again we gets block group X again. Then it
repeats the previous steps over and over since there are not other
block groups in the discard lists and block group X is never moved
out of the unused block groups discard list since
btrfs_run_discard_work() keeps returning false and therefore
__add_to_discard_list() doesn't move block group X out of that discard
list.
When this happens we can get a soft lockup report like this:
[71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
[71.957] Modules linked in: xfs af_packet rfkill (...)
[71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
[71.957] Tainted: [W]=WARN
[71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
[71.957] Code: c1 01 48 83 (...)
[71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
[71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
[71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
[71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
[71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
[71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
[71.957] FS: 0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
[71.957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
[71.957] PKRU: 55555554
[71.957] Call Trace:
[71.957] <TASK>
[71.957] process_one_work+0x17e/0x330
[71.957] worker_thread+0x2ce/0x3f0
[71.957] ? __pfx_worker_thread+0x10/0x10
[71.957] kthread+0xef/0x220
[71.957] ? __pfx_kthread+0x10/0x10
[71.957] ret_from_fork+0x34/0x50
[71.957] ? __pfx_kthread+0x10/0x10
[71.957] ret_from_fork_asm+0x1a/0x30
[71.957] </TASK>
[71.957] Kernel panic - not syncing: softlockup: hung tasks
[71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W L 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
[71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP
[71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[71.992] Call Trace:
[71.993] <IRQ>
[71.994] dump_stack_lvl+0x5a/0x80
[71.994] panic+0x10b/0x2da
[71.995] watchdog_timer_fn.cold+0x9a/0xa1
[71.996] ? __pfx_watchdog_timer_fn+0x10/0x10
[71.997] __hrtimer_run_queues+0x132/0x2a0
[71.997] hrtimer_interrupt+0xff/0x230
[71.998] __sysvec_apic_timer_interrupt+0x55/0x100
[71.999] sysvec_apic_timer_interrupt+0x6c/0x90
[72.000] </IRQ>
[72.000] <TASK>
[72.001] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
[72.002] Code: c1 01 48 83 (...)
[72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
[72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
[72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
[72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
[72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
[72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
[72.010] ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
[72.011] process_one_work+0x17e/0x330
[72.012] worker_thread+0x2ce/0x3f0
[72.013] ? __pfx_worker_thread+0x10/0x10
[72.014] kthread+0xef/0x220
[72.014] ? __pfx_kthread+0x10/0x10
[72.015] ret_from_fork+0x34/0x50
[72.015] ? __pfx_kthread+0x10/0x10
[72.016] ret_from_fork_asm+0x1a/0x30
[72.017] </TASK>
[72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[72.019] Rebooting in 90 seconds..
So fix this by making sure we move a block group out of the unused block
groups discard list when calling __add_to_discard_list().
Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- revert device path canonicalization, this does not work as intended
with namespaces and is not reliable in all setups
- fix crash in scrub when checksum tree is not valid, e.g. when mounted
with rescue=ignoredatacsums
- fix crash when tracepoint btrfs_prelim_ref_insert is enabled
- other minor fixups:
- open code folio_index(), meant to be used in MM code
- use matching type for sizeof in compression allocation
* tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: open code folio_index() in btree_clear_folio_dirty_tag()
Revert "btrfs: canonicalize the device path before adding it"
btrfs: avoid NULL pointer dereference if no valid csum tree
btrfs: handle empty eb->folios in num_extent_folios()
btrfs: correct the order of prelim_ref arguments in btrfs__prelim_ref
btrfs: compression: adjust cb->compressed_folios allocation type
|
|
The folio_index() helper is only needed for mixed usage of page cache
and swap cache, for pure page cache usage, the caller can just use
folio->index instead.
It can't be a swap cache folio here. Swap mapping may only call into fs
through 'swap_rw' but btrfs does not use that method for swap.
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This reverts commit 7e06de7c83a746e58d4701e013182af133395188.
Commit 7e06de7c83a7 ("btrfs: canonicalize the device path before adding
it") tries to make btrfs to use "/dev/mapper/*" name first, then any
filename inside "/dev/" as the device path.
This is mostly fine when there is only the root namespace involved, but
when multiple namespace are involved, things can easily go wrong for the
d_path() usage.
As d_path() returns a file path that is namespace dependent, the
resulted string may not make any sense in another namespace.
Furthermore, the "/dev/" prefix checks itself is not reliable, one can
still make a valid initramfs without devtmpfs, and fill all needed
device nodes manually.
Overall the userspace has all its might to pass whatever device path for
mount, and we are not going to win the war trying to cover every corner
case.
So just revert that commit, and do no extra d_path() based file path
sanity check.
CC: stable@vger.kernel.org # 6.12+
Link: https://lore.kernel.org/linux-fsdevel/20250115185608.GA2223535@zen.localdomain/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When trying read-only scrub on a btrfs with rescue=idatacsums mount
option, it will crash with the following call trace:
BUG: kernel NULL pointer dereference, address: 0000000000000208
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 1 UID: 0 PID: 835 Comm: btrfs Tainted: G O 6.15.0-rc3-custom+ #236 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:btrfs_lookup_csums_bitmap+0x49/0x480 [btrfs]
Call Trace:
<TASK>
scrub_find_fill_first_stripe+0x35b/0x3d0 [btrfs]
scrub_simple_mirror+0x175/0x290 [btrfs]
scrub_stripe+0x5f7/0x6f0 [btrfs]
scrub_chunk+0x9a/0x150 [btrfs]
scrub_enumerate_chunks+0x333/0x660 [btrfs]
btrfs_scrub_dev+0x23e/0x600 [btrfs]
btrfs_ioctl+0x1dcf/0x2f80 [btrfs]
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x4f/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[CAUSE]
Mount option "rescue=idatacsums" will completely skip loading the csum
tree, so that any data read will not find any data csum thus we will
ignore data checksum verification.
Normally call sites utilizing csum tree will check the fs state flag
NO_DATA_CSUMS bit, but unfortunately scrub does not check that bit at all.
This results in scrub to call btrfs_search_slot() on a NULL pointer
and triggered above crash.
[FIX]
Check both extent and csum tree root before doing any tree search.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
num_extent_folios() unconditionally calls folio_order() on
eb->folios[0]. If that is NULL this will be a segfault. It is reasonable
for it to return 0 as the number of folios in the eb when the first
entry is NULL, so do that instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In preparation for making the kmalloc() family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct folio **" but the returned type will be
"struct page **". These are the same allocation size (pointer size), but
the types don't match. Adjust the allocation type to match the assignment.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential inode leak in iget() after memory allocation failure
- in subpage mode, fix extent buffer bitmap iteration when writing out
dirty sectors
- fix range calculation when falling back to COW for a NOCOW file
* tag 'for-6.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: adjust subpage bit start based on sectorsize
btrfs: fix the inode leak in btrfs_iget()
btrfs: fix COW handling in run_delalloc_nocow()
|
|
When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production. This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.
When writing out a subpage EB we scan the subpage bitmap for a dirty
range. If the range isn't dirty we do
bit_start++;
to move onto the next bit. The problem is the bitmap is based on the
number of sectors that an EB has. So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4
bits for every node. With a 64k page size we end up with 4 nodes per
page.
To make this easier this is how everything looks
[0 16k 32k 48k ] logical address
[0 4 8 12 ] radix tree offset
[ 64k page ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | | | | | | | | | | | | | | ] bitmap
Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.
When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.
However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.
In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again. This time it is dirty, and we go to find that
start using the following equation
start = folio_start + bit_start * fs_info->sectorsize;
so in the case above, eb->start 0 is now dirty, and we calculate start
as
0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1
Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5. If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.
The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change. Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.
Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a bug report that a syzbot reproducer can lead to the following
busy inode at unmount time:
BTRFS info (device loop1): last unmount of filesystem 1680000e-3c1e-4c46-84b6-56bd3909af50
VFS: Busy inodes after unmount of loop1 (btrfs)
------------[ cut here ]------------
kernel BUG at fs/super.c:650!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 48168 Comm: syz-executor Not tainted 6.15.0-rc2-00471-g119009db2674 #2 PREEMPT(full)
Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:generic_shutdown_super+0x2e9/0x390 fs/super.c:650
Call Trace:
<TASK>
kill_anon_super+0x3a/0x60 fs/super.c:1237
btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2099
deactivate_locked_super+0xbe/0x1a0 fs/super.c:473
deactivate_super fs/super.c:506 [inline]
deactivate_super+0xe2/0x100 fs/super.c:502
cleanup_mnt+0x21f/0x440 fs/namespace.c:1435
task_work_run+0x14d/0x240 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x269/0x290 kernel/entry/common.c:218
do_syscall_64+0xd4/0x250 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
[CAUSE]
When btrfs_alloc_path() failed, btrfs_iget() directly returned without
releasing the inode already allocated by btrfs_iget_locked().
This results the above busy inode and trigger the kernel BUG.
[FIX]
Fix it by calling iget_failed() if btrfs_alloc_path() failed.
If we hit error inside btrfs_read_locked_inode(), it will properly call
iget_failed(), so nothing to worry about.
Although the iget_failed() cleanup inside btrfs_read_locked_inode() is a
break of the normal error handling scheme, let's fix the obvious bug
and backport first, then rework the error handling later.
Reported-by: Penglei Jiang <superman.xpt@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/20250421102425.44431-1-superman.xpt@gmail.com/
Fixes: 7c855e16ab72 ("btrfs: remove conditional path allocation in btrfs_read_locked_inode()")
CC: stable@vger.kernel.org # 6.13+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In run_delalloc_nocow(), when the found btrfs_key's offset > cur_offset,
it indicates a gap between the current processing region and
the next file extent. The original code would directly jump to
the "must_cow" label, which increments the slot and forces a fallback
to COW. This behavior might skip an extent item and result in an
overestimated COW fallback range.
This patch modifies the logic so that when a gap is detected:
- If no COW range is already being recorded (cow_start is unset),
cow_start is set to cur_offset.
- cur_offset is then advanced to the beginning of the next extent.
- Instead of jumping to "must_cow", control flows directly to
"next_slot" so that the same extent item can be reexamined properly.
The change ensures that we accurately account for the extent gap and
avoid accidentally extending the range that needs to fallback to COW.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Dave Chen <davechen@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- subpage mode fixes:
- access correct object (folio) when looking up bit offset
- fix assertion condition for number of blocks per folio
- fix upper boundary of locking range in hole punch
- zoned fixes:
- fix potential deadlock caught by lockdep when zone reporting and
device freeze run in parallel
- fix zone write pointer mismatch and NULL pointer dereference when
metadata are converted from DUP to RAID1
- fix error handling when reloc inode creation fails
- in tree-checker, unify error code for header level check
- block layer: add helpers to read zone capacity
* tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: skip reporting zone for new block group
block: introduce zone capacity helper
btrfs: tree-checker: adjust error code for header level check
btrfs: fix invalid inode pointer after failure to create reloc inode
btrfs: zoned: return EIO on RAID1 block group write pointer mismatch
btrfs: fix the ASSERT() inside GET_SUBPAGE_BITMAP()
btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()
btrfs: subpage: access correct object when reading bitmap start in subpage_calc_start_bit()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- handle encoded read ioctl returning EAGAIN so it does not mistakenly
free the work structure
- escape subvolume path in mount option list so it cannot be wrongly
parsed when the path contains ","
- remove folio size assertions when writing super block to device with
enabled large folios
* tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: remove folio order ASSERT()s in super block writeback path
btrfs: correctly escape subvol in btrfs_show_options()
btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
|
|
There is a potential deadlock if we do report zones in an IO context, detailed
in below lockdep report. When one process do a report zones and another process
freezes the block device, the report zones side cannot allocate a tag because
the freeze is already started. This can thus result in new block group creation
to hang forever, blocking the write path.
Thankfully, a new block group should be created on empty zones. So, reporting
the zones is not necessary and we can set the write pointer = 0 and load the
zone capacity from the block layer using bdev_zone_capacity() helper.
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc1 #252 Not tainted
------------------------------------------------------
modprobe/1110 is trying to acquire lock:
ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60
but task is already holding lock:
ffff8881205b6f20 (&q->q_usage_counter(queue)#16){++++}-{0:0}, at: sd_remove+0x85/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&q->q_usage_counter(queue)#16){++++}-{0:0}:
blk_queue_enter+0x3d9/0x500
blk_mq_alloc_request+0x47d/0x8e0
scsi_execute_cmd+0x14f/0xb80
sd_zbc_do_report_zones+0x1c1/0x470
sd_zbc_report_zones+0x362/0xd60
blkdev_report_zones+0x1b1/0x2e0
btrfs_get_dev_zones+0x215/0x7e0 [btrfs]
btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs]
btrfs_make_block_group+0x36b/0x870 [btrfs]
btrfs_create_chunk+0x147d/0x2320 [btrfs]
btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs]
start_transaction+0xce6/0x1620 [btrfs]
btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs]
kthread+0x39d/0x750
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
-> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}:
down_read+0x9b/0x470
btrfs_map_block+0x2ce/0x2ce0 [btrfs]
btrfs_submit_chunk+0x2d4/0x16c0 [btrfs]
btrfs_submit_bbio+0x16/0x30 [btrfs]
btree_write_cache_pages+0xb5a/0xf90 [btrfs]
do_writepages+0x17f/0x7b0
__writeback_single_inode+0x114/0xb00
writeback_sb_inodes+0x52b/0xe00
wb_writeback+0x1a7/0x800
wb_workfn+0x12a/0xbd0
process_one_work+0x85a/0x1460
worker_thread+0x5e2/0xfc0
kthread+0x39d/0x750
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}:
__mutex_lock+0x1aa/0x1360
btree_write_cache_pages+0x252/0xf90 [btrfs]
do_writepages+0x17f/0x7b0
__writeback_single_inode+0x114/0xb00
writeback_sb_inodes+0x52b/0xe00
wb_writeback+0x1a7/0x800
wb_workfn+0x12a/0xbd0
process_one_work+0x85a/0x1460
worker_thread+0x5e2/0xfc0
kthread+0x39d/0x750
ret_from_fork+0x30/0x70
ret_from_fork_asm+0x1a/0x30
-> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
__lock_acquire+0x2f52/0x5ea0
lock_acquire+0x1b1/0x540
__flush_work+0x3ac/0xb60
wb_shutdown+0x15b/0x1f0
bdi_unregister+0x172/0x5b0
del_gendisk+0x841/0xa20
sd_remove+0x85/0x130
device_release_driver_internal+0x368/0x520
bus_remove_device+0x1f1/0x3f0
device_del+0x3bd/0x9c0
__scsi_remove_device+0x272/0x340
scsi_forget_host+0xf7/0x170
scsi_remove_host+0xd2/0x2a0
sdebug_driver_remove+0x52/0x2f0 [scsi_debug]
device_release_driver_internal+0x368/0x520
bus_remove_device+0x1f1/0x3f0
device_del+0x3bd/0x9c0
device_unregister+0x13/0xa0
sdebug_do_remove_host+0x1fb/0x290 [scsi_debug]
scsi_debug_exit+0x17/0x70 [scsi_debug]
__do_sys_delete_module.isra.0+0x321/0x520
do_syscall_64+0x93/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
(work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)#16
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->q_usage_counter(queue)#16);
lock(&fs_info->dev_replace.rwsem);
lock(&q->q_usage_counter(queue)#16);
lock((work_completion)(&(&wb->dwork)->work));
*** DEADLOCK ***
5 locks held by modprobe/1110:
#0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520
#1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0
#2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520
#3: ffff8881205b6f20 (&q->q_usage_counter(queue)#16){++++}-{0:0}, at: sd_remove+0x85/0x130
#4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60
stack backtrace:
CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 #252
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x90
print_circular_bug.cold+0x1e0/0x274
check_noncircular+0x306/0x3f0
? __pfx_check_noncircular+0x10/0x10
? mark_lock+0xf5/0x1650
? __pfx_check_irq_usage+0x10/0x10
? lockdep_lock+0xca/0x1c0
? __pfx_lockdep_lock+0x10/0x10
__lock_acquire+0x2f52/0x5ea0
? __pfx___lock_acquire+0x10/0x10
? __pfx_mark_lock+0x10/0x10
lock_acquire+0x1b1/0x540
? __flush_work+0x38f/0xb60
? __pfx_lock_acquire+0x10/0x10
? __pfx_lock_release+0x10/0x10
? mark_held_locks+0x94/0xe0
? __flush_work+0x38f/0xb60
__flush_work+0x3ac/0xb60
? __flush_work+0x38f/0xb60
? __pfx_mark_lock+0x10/0x10
? __pfx___flush_work+0x10/0x10
? __pfx_wq_barrier_func+0x10/0x10
? __pfx___might_resched+0x10/0x10
? mark_held_locks+0x94/0xe0
wb_shutdown+0x15b/0x1f0
bdi_unregister+0x172/0x5b0
? __pfx_bdi_unregister+0x10/0x10
? up_write+0x1ba/0x510
del_gendisk+0x841/0xa20
? __pfx_del_gendisk+0x10/0x10
? _raw_spin_unlock_irqrestore+0x35/0x60
? __pm_runtime_resume+0x79/0x110
sd_remove+0x85/0x130
device_release_driver_internal+0x368/0x520
? kobject_put+0x5d/0x4a0
bus_remove_device+0x1f1/0x3f0
device_del+0x3bd/0x9c0
? __pfx_device_del+0x10/0x10
__scsi_remove_device+0x272/0x340
scsi_forget_host+0xf7/0x170
scsi_remove_host+0xd2/0x2a0
sdebug_driver_remove+0x52/0x2f0 [scsi_debug]
? kernfs_remove_by_name_ns+0xc0/0xf0
device_release_driver_internal+0x368/0x520
? kobject_put+0x5d/0x4a0
bus_remove_device+0x1f1/0x3f0
device_del+0x3bd/0x9c0
? __pfx_device_del+0x10/0x10
? __pfx___mutex_unlock_slowpath+0x10/0x10
device_unregister+0x13/0xa0
sdebug_do_remove_host+0x1fb/0x290 [scsi_debug]
scsi_debug_exit+0x17/0x70 [scsi_debug]
__do_sys_delete_module.isra.0+0x321/0x520
? __pfx___do_sys_delete_module.isra.0+0x10/0x10
? __pfx_slab_free_after_rcu_debug+0x10/0x10
? kasan_save_stack+0x2c/0x50
? kasan_record_aux_stack+0xa3/0xb0
? __call_rcu_common.constprop.0+0xc4/0xfb0
? kmem_cache_free+0x3a0/0x590
? __x64_sys_close+0x78/0xd0
do_syscall_64+0x93/0x180
? lock_is_held_type+0xd5/0x130
? __call_rcu_common.constprop.0+0x3c0/0xfb0
? lockdep_hardirqs_on+0x78/0x100
? __call_rcu_common.constprop.0+0x3c0/0xfb0
? __pfx___call_rcu_common.constprop.0+0x10/0x10
? kmem_cache_free+0x3a0/0x590
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? __pfx___x64_sys_openat+0x10/0x10
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f436712b68b
RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8
RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000
R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
CC: <stable@vger.kernel.org> # 6.13+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The whole tree checker returns EUCLEAN, except the one check in
btrfs_verify_level_key(). This was inherited from the function that was
moved from disk-io.c in 2cac5af16537 ("btrfs: move
btrfs_verify_level_key into tree-checker.c") but this should be unified
with the rest.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we have a failure at create_reloc_inode(), under the 'out' label we
assign an error pointer to the 'inode' variable and then return a weird
pointer because we return the expression "&inode->vfs_inode":
static noinline_for_stack struct inode *create_reloc_inode(
const struct btrfs_block_group *group)
{
(...)
out:
(...)
if (ret) {
if (inode)
iput(&inode->vfs_inode);
inode = ERR_PTR(ret);
}
return &inode->vfs_inode;
}
This can make us return a pointer that is not an error pointer and make
the caller proceed as if an error didn't happen and later result in an
invalid memory access when dereferencing the inode pointer.
Syzbot reported reported such a case with the following stack trace:
R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790
</TASK>
BTRFS info (device loop0): relocating block group 6881280 flags data|metadata
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000045: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000228-0x000000000000022f]
CPU: 0 UID: 0 PID: 5332 Comm: syz-executor215 Not tainted 6.14.0-syzkaller-13423-ga8662bcd2ff1 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971
Code: 00 74 08 (...)
RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203
RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000
RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000
FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
relocate_block_group+0xa1e/0xd50 fs/btrfs/relocation.c:3657
btrfs_relocate_block_group+0x777/0xd80 fs/btrfs/relocation.c:4011
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3511
__btrfs_balance+0x1a93/0x25e0 fs/btrfs/volumes.c:4292
btrfs_balance+0xbde/0x10c0 fs/btrfs/volumes.c:4669
btrfs_ioctl_balance+0x3f5/0x660 fs/btrfs/ioctl.c:3586
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb4ef537dd9
Code: 28 00 00 (...)
RSP: 002b:00007ffc55de5728 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc55de5750 RCX: 00007fb4ef537dd9
RDX: 0000200000000440 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 0000000000000002 R08: 00007ffc55de54c6 R09: 00007ffc55de5770
R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971
Code: 00 74 08 (...)
RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203
RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000
RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000
FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
0: 00 74 08 48 add %dh,0x48(%rax,%rcx,1)
4: 89 df mov %ebx,%edi
6: e8 f8 36 24 fe call 0xfe243703
b: 48 89 9c 24 30 01 00 mov %rbx,0x130(%rsp)
12: 00
13: 4c 89 74 24 28 mov %r14,0x28(%rsp)
18: 4d 8b 76 10 mov 0x10(%r14),%r14
1c: 49 8d 9e 98 fe ff ff lea -0x168(%r14),%rbx
23: 48 89 d8 mov %rbx,%rax
26: 48 c1 e8 03 shr $0x3,%rax
* 2a: 42 80 3c 20 00 cmpb $0x0,(%rax,%r12,1) <-- trapping instruction
2f: 74 08 je 0x39
31: 48 89 df mov %rbx,%rdi
34: e8 ca 36 24 fe call 0xfe243703
39: 4c 8b 3b mov (%rbx),%r15
3c: 48 rex.W
3d: 8b .byte 0x8b
3e: 44 rex.R
3f: 24 .byte 0x24
So fix this by returning the error immediately.
Reported-by: syzbot+7481815bb47ef3e702e2@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/67f14ee9.050a0220.0a13.023e.GAE@google.com/
Fixes: b204e5c7d4dc ("btrfs: make btrfs_iget() return a btrfs inode instead")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There was a bug report about a NULL pointer dereference in
__btrfs_add_free_space_zoned() that ultimately happens because a
conversion from the default metadata profile DUP to a RAID1 profile on two
disks.
The stack trace has the following signature:
BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
BUG: kernel NULL pointer dereference, address: 0000000000000058
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:__btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
RSP: 0018:ffffa236b6f3f6d0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff96c8132f3400 RCX: 0000000000000001
RDX: 0000000010000000 RSI: 0000000000000000 RDI: ffff96c8132f3410
RBP: 0000000010000000 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000
R13: ffff96c758f65a40 R14: 0000000000000001 R15: 000011aac0000000
FS: 00007fdab1cb2900(0000) GS:ffff96e60ca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000058 CR3: 00000001a05ae000 CR4: 0000000000350ef0
Call Trace:
<TASK>
? __die_body.cold+0x19/0x27
? page_fault_oops+0x15c/0x2f0
? exc_page_fault+0x7e/0x180
? asm_exc_page_fault+0x26/0x30
? __btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
btrfs_add_free_space_async_trimmed+0x34/0x40
btrfs_add_new_free_space+0x107/0x120
btrfs_make_block_group+0x104/0x2b0
btrfs_create_chunk+0x977/0xf20
btrfs_chunk_alloc+0x174/0x510
? srso_return_thunk+0x5/0x5f
btrfs_inc_block_group_ro+0x1b1/0x230
btrfs_relocate_block_group+0x9e/0x410
btrfs_relocate_chunk+0x3f/0x130
btrfs_balance+0x8ac/0x12b0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? __kmalloc_cache_noprof+0x14c/0x3e0
btrfs_ioctl+0x2686/0x2a80
? srso_return_thunk+0x5/0x5f
? ioctl_has_perm.constprop.0.isra.0+0xd2/0x120
__x64_sys_ioctl+0x97/0xc0
do_syscall_64+0x82/0x160
? srso_return_thunk+0x5/0x5f
? __memcg_slab_free_hook+0x11a/0x170
? srso_return_thunk+0x5/0x5f
? kmem_cache_free+0x3f0/0x450
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? syscall_exit_to_user_mode+0x10/0x210
? srso_return_thunk+0x5/0x5f
? do_syscall_64+0x8e/0x160
? sysfs_emit+0xaf/0xc0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? seq_read_iter+0x207/0x460
? srso_return_thunk+0x5/0x5f
? vfs_read+0x29c/0x370
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? syscall_exit_to_user_mode+0x10/0x210
? srso_return_thunk+0x5/0x5f
? do_syscall_64+0x8e/0x160
? srso_return_thunk+0x5/0x5f
? exc_page_fault+0x7e/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fdab1e0ca6d
RSP: 002b:00007ffeb2b60c80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fdab1e0ca6d
RDX: 00007ffeb2b60d80 RSI: 00000000c4009420 RDI: 0000000000000003
RBP: 00007ffeb2b60cd0 R08: 0000000000000000 R09: 0000000000000013
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffeb2b6343b R14: 00007ffeb2b60d80 R15: 0000000000000001
</TASK>
CR2: 0000000000000058
---[ end trace 0000000000000000 ]---
The 1st line is the most interesting here:
BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
When a RAID1 block-group is created and a write pointer mismatch between
the disks in the RAID set is detected, btrfs sets the alloc_offset to the
length of the block group marking it as full. Afterwards the code expects
that a balance operation will evacuate the data in this block-group and
repair the problems.
But before this is possible, the new space of this block-group will be
accounted in the free space cache. But in __btrfs_add_free_space_zoned()
it is being checked if it is a initial creation of a block group and if
not a reclaim decision will be made. But the decision if a block-group's
free space accounting is done for an initial creation depends on if the
size of the added free space is the whole length of the block-group and
the allocation offset is 0.
But as btrfs_load_block_group_zone_info() sets the allocation offset to
the zone capacity (i.e. marking the block-group as full) this initial
decision is not met, and the space_info pointer in the 'struct
btrfs_block_group' has not yet been assigned.
Fail creation of the block group and rely on manual user intervention to
re-balance the filesystem.
Afterwards the filesystem can be unmounted, mounted in degraded mode and
the missing device can be removed after a full balance of the filesystem.
Reported-by: 西木野羰基 <yanqiyu01@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAB_b4sBhDe3tscz=duVyhc9hNE+gu=B8CrgLO152uMyanR8BEA@mail.gmail.com/
Fixes: b1934cd60695 ("btrfs: zoned: handle broken write pointer on zones")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After enabling large data folios for tests, I hit the ASSERT() inside
GET_SUBPAGE_BITMAP() where blocks_per_folio matches BITS_PER_LONG.
The ASSERT() itself is only based on the original subpage fs block size,
where we have at most 16 blocks per page, thus
"ASSERT(blocks_per_folio < BITS_PER_LONG)".
However the experimental large data folio support will set the max folio
order according to the BITS_PER_LONG, so we can have a case where a large
folio contains exactly BITS_PER_LONG blocks.
So the ASSERT() is too strict, change it to
"ASSERT(blocks_per_folio <= BITS_PER_LONG)" to avoid the false alert.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When running btrfs/004 with 4K fs block size and 64K page size,
sometimes fsstress workload can take 100% CPU for a while, but not long
enough to trigger a 120s hang warning.
[CAUSE]
When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is
always in the call trace.
One example when this problem happens, the function
btrfs_punch_hole_lock_range() got the following parameters:
lock_start = 4096, lockend = 20469
Then we calculate @page_lockstart by rounding up lock_start to page
boundary, which is 64K (page size is 64K).
For @page_lockend, we round down the value towards page boundary, which
result 0. Then since we need to pass an inclusive end to
filemap_range_has_page(), we subtract 1 from the rounded down value,
resulting in (u64)-1.
In the above case, the range is inside the same page, and we do not even
need to call filemap_range_has_page(), not to mention to call it with
(u64)-1 at the end.
This behavior will cause btrfs_punch_hole_lock_range() to busy loop
waiting for irrelevant range to have its pages dropped.
[FIX]
Calculate @page_lockend by just rounding down @lockend, without
decreasing the value by one. So @page_lockend will no longer overflow.
Then exit early if @page_lockend is no larger than @page_lockstart.
As it means either the range is inside the same page, or the two pages
are adjacent already.
Finally only decrease @page_lockend when calling filemap_range_has_page().
Fixes: 0528476b6ac7 ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
subpage_calc_start_bit()
Inside the macro, subpage_calc_start_bit(), we need to calculate the
offset to the beginning of the folio.
But we're using offset_in_page(), on systems with 4K page size and 4K fs
block size, this means we will always return offset 0 for a large folio,
causing all kinds of errors.
Fix it by using offset_in_folio() instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC cleanups from Eric Biggers:
"Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc: remove CONFIG_LIBCRC32C
lib/crc: document all the CRC library kconfig options
lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
lib/crc: remove unnecessary prompt for CONFIG_CRC16
lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly. Then remove
LIBCRC32C.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- The series "powerpc/crash: use generic crashkernel reservation" from
Sourabh Jain changes powerpc's kexec code to use more of the generic
layers.
- The series "get_maintainer: report subsystem status separately" from
Vlastimil Babka makes some long-requested improvements to the
get_maintainer output.
- The series "ucount: Simplify refcounting with rcuref_t" from
Sebastian Siewior cleans up and optimizing the refcounting in the
ucount code.
- The series "reboot: support runtime configuration of emergency
hw_protection action" from Ahmad Fatoum improves the ability for a
driver to perform an emergency system shutdown or reboot.
- The series "Converge on using secs_to_jiffies() part two" from Easwar
Hariharan performs further migrations from msecs_to_jiffies() to
secs_to_jiffies().
- The series "lib/interval_tree: add some test cases and cleanup" from
Wei Yang permits more userspace testing of kernel library code, adds
some more tests and performs some cleanups.
- The series "hung_task: Dump the blocking task stacktrace" from Masami
Hiramatsu arranges for the hung_task detector to dump the stack of
the blocking task and not just that of the blocked task.
- The series "resource: Split and use DEFINE_RES*() macros" from Andy
Shevchenko provides some cleanups to the resource definition macros.
- Plus the usual shower of singleton patches - please see the
individual changelogs for details.
* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
mailmap: consolidate email addresses of Alexander Sverdlin
fs/procfs: fix the comment above proc_pid_wchan()
relay: use kasprintf() instead of fixed buffer formatting
resource: replace open coded variant of DEFINE_RES()
resource: replace open coded variants of DEFINE_RES_*_NAMED()
resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
samples: add hung_task detector mutex blocking sample
hung_task: show the blocker task if the task is hung on mutex
kexec_core: accept unaccepted kexec segments' destination addresses
watchdog/perf: optimize bytes copied and remove manual NUL-termination
lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
lib/interval_tree: skip the check before go to the right subtree
lib/interval_tree: add test case for span iteration
lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
lib/rbtree: add random seed
lib/rbtree: split tests
lib/rbtree: enable userland test suite for rbtree related data structure
checkpatch: describe --min-conf-desc-length
scripts/gdb/symbols: determine KASLR offset on s390
...
|
|
[BUG]
There is a syzbot report that the ASSERT() inside write_dev_supers() got
triggered:
assertion failed: folio_order(folio) == 0, in fs/btrfs/disk-io.c:3858
------------[ cut here ]------------
kernel BUG at fs/btrfs/disk-io.c:3858!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 6730 Comm: syz-executor378 Not tainted 6.14.0-syzkaller-03565-gf6e0150b2003 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:write_dev_supers fs/btrfs/disk-io.c:3858 [inline]
RIP: 0010:write_all_supers+0x400f/0x4090 fs/btrfs/disk-io.c:4155
Call Trace:
<TASK>
btrfs_commit_transaction+0x1eda/0x3750 fs/btrfs/transaction.c:2528
btrfs_quota_enable+0xfcc/0x21a0 fs/btrfs/qgroup.c:1226
btrfs_ioctl_quota_ctl+0x144/0x1c0 fs/btrfs/ioctl.c:3677
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5ad1f20289
</TASK>
---[ end trace 0000000000000000 ]---
[CAUSE]
Since commit f93ee0df5139 ("btrfs: convert super block writes to folio
in write_dev_supers()") and commit c94b7349b859 ("btrfs: convert super
block writes to folio in wait_dev_supers()"), the super block writeback
path is converted to use folio.
Since the original code is using page based interfaces, we have an
"ASSERT(folio_order(folio) == 0);" added to make sure everything is not
changed.
But the folio here is not from any btrfs inode, but from the block
device, and we have no control on the folio order in bdev, the device
can choose whatever folio size they want/need.
E.g. the bdev may even have a block size of multiple pages.
So the ASSERT() is triggered.
[FIX]
The super block writeback path has taken larger folios into
consideration, so there is no need for the ASSERT().
And since commit bc00965dbff7 ("btrfs: count super block write errors in
device instead of tracking folio error state"), the wait path no longer
checks the folio status but only wait for the folio writeback to finish,
there is nothing requiring the ASSERT() either.
So we can remove both ASSERT()s safely now.
Reported-by: syzbot+34122898a11ab689518a@syzkaller.appspotmail.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|