| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit e9e3b22ddfa760762b696ac6417c8d6edd182e49 ]
[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:
mkfs.btrfs -s 4k -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
-c "truncate 16k" $mnt/foobar
umount $mnt
This will result the following 2 file extent items to be inserted (extra
trace point added to insert_ordered_extent_file_extent()):
btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384
Note for file offset 60K, we're inserting a file extent without any
data checksum.
Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.
Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.
That will cause "btrfs check" to report error.
[CAUSE]
The sequence happens like this:
- Buffered write dirtied the page cache and updated isize
Now the inode size is 64K, with the following page cache layout:
0 16K 32K 48K 64K
|/////////////| |//| |//|
- Truncate the inode to 16K
Which will trigger writeback through:
btrfs_setsize()
|- truncate_setsize()
| Now the inode size is set to 16K
|
|- btrfs_truncate()
|- btrfs_wait_ordered_range() for [16K, u64(-1)]
|- btrfs_fdatawrite_range() for [16K, u64(-1)}
|- extent_writepage() for folio 0
|- writepage_delalloc()
| Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
|
|- extent_writepage_io()
Then inside extent_writepage_io(), the dirty fs blocks are handled
differently:
- Submit write for range [0, 16K)
As they are still inside the inode size (16K).
- Mark OE [32K, 36K) as truncated
Since we only call btrfs_lookup_first_ordered_range() once, which
returned the first OE after file offset 16K.
- Mark all OEs inside range [16K, 64K) as finished
Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
For OE [32K, 36K) since it's already marked as truncated, and its
truncated length is 0, no file extent will be inserted.
For OE [60K, 64K) it has never been submitted thus has no data
checksum, and we insert the file extent as usual.
This is the root cause of file extent at 60K to be inserted without
any data checksum.
- Clear dirty flags for range [16K, 64K)
It is the function btrfs_folio_clear_dirty() which searches and clears
any dirty blocks inside that range.
[FIX]
The bug itself was introduced a long time ago, way before subpage and
large folio support.
At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.
But later with subpage and large folios, the same range [cur, end)
can have multiple blocks and ordered extents.
Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
writeback past i_size") was fixing a bug related to subpage/large
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.
The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.
By this we always locate and truncate the OE for every dirty block.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 46a23908598f4b8e61483f04ea9f471b2affc58a ]
Instead of repeating the expression "start + len" multiple times, store it
in a variable and use it where needed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 18de34daa7c62c830be533aace6b7c271e8e95cf ]
While running test case btrfs/192 from fstests with support for large
folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic
btrfs check failures reporting that csum items were missing. Looking into
the issue it turned out that btrfs check searches for csum items of a file
extent item with a range that spans beyond the i_size of a file and we
don't have any, because the kernel's writeback code skips submitting bios
for ranges beyond eof. It's not expected however to find a file extent item
that crosses the rounded up (by the sector size) i_size value, but there is
a short time window where we can end up with a transaction commit leaving
this small inconsistency between the i_size and the last file extent item.
Example btrfs check output when this happens:
$ btrfs check /dev/sdc
Opening filesystem to check...
Checking filesystem on /dev/sdc
UUID: 69642c61-5efb-4367-aa31-cdfd4067f713
[1/8] checking log skipped (none written)
[2/8] checking root items
[3/8] checking extents
[4/8] checking free space tree
[5/8] checking fs roots
root 5 inode 332 errors 1000, some csum missing
ERROR: errors found in fs roots
(...)
Looking at a tree dump of the fs tree (root 5) for inode 332 we have:
$ btrfs inspect-internal dump-tree -t 5 /dev/sdc
(...)
item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160
generation 17 transid 19 size 610969 nbytes 86016
block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
sequence 11 flags 0x0(none)
atime 1759851068.391327881 (2025-10-07 16:31:08)
ctime 1759851068.410098267 (2025-10-07 16:31:08)
mtime 1759851068.410098267 (2025-10-07 16:31:08)
otime 1759851068.391327881 (2025-10-07 16:31:08)
item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13
index 2 namelen 3 name: f1f
item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53
generation 19 type 1 (regular)
extent data disk byte 21745664 nr 65536
extent data offset 0 nr 65536 ram 65536
extent compression 0 (none)
(...)
We can see that the file extent item for file offset 589824 has a length of
64K and its number of bytes is 64K. Looking at the inode item we see that
its i_size is 610969 bytes which falls within the range of that file extent
item [589824, 655360[.
Looking into the csum tree:
$ btrfs inspect-internal dump-tree /dev/sdc
(...)
item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200
range start 21565440 end 21770240 length 204800
item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8
range start 1104576512 end 1104584704 length 8192
(..)
We see that the csum item number 15 covers the first 24K of the file extent
item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664,
so we have:
21770240 - 21745664 = 24K
We see that the next csum item (number 16) is completely outside the range,
so the remaining 40K of the extent doesn't have csum items in the tree.
If we round up the i_size to the sector size, we get:
round_up(610969, 4096) = 614400
If we subtract from that the file offset for the extent item we get:
614400 - 589824 = 24K
So the missing 40K corresponds to the end of the file extent item's range
minus the rounded up i_size:
655360 - 614400 = 40K
Normally we don't expect a file extent item to span over the rounded up
i_size of an inode, since when truncating, doing hole punching and other
operations that trim a file extent item, the number of bytes is adjusted.
There is however a short time window where the kernel can end up,
temporarily,persisting an inode with an i_size that falls in the middle of
the last file extent item and the file extent item was not yet trimmed (its
number of bytes reduced so that it doesn't cross i_size rounded up by the
sector size).
The steps (in the kernel) that lead to such scenario are the following:
1) We have inode I as an empty file, no allocated extents, i_size is 0;
2) A buffered write is done for file range [589824, 655360[ (length of
64K) and the i_size is updated to 655360. Note that we got a single
large folio for the range (64K);
3) A truncate operation starts that reduces the inode's i_size down to
610969 bytes. The truncate sets the inode's new i_size at
btrfs_setsize() by calling truncate_setsize() and before calling
btrfs_truncate();
4) At btrfs_truncate() we trigger writeback for the range starting at
610304 (which is the new i_size rounded down to the sector size) and
ending at (u64)-1;
5) During the writeback, at extent_write_cache_pages(), we get from the
call to filemap_get_folios_tag(), the 64K folio that starts at file
offset 589824 since it contains the start offset of the writeback
range (610304);
6) At writepage_delalloc() we find the whole range of the folio is dirty
and therefore we run delalloc for that 64K range ([589824, 655360[),
reserving a 64K extent, creating an ordered extent, etc;
7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[
because the inode's i_size is 610969 bytes (rounded up by sector size
is 614400). There, in the while loop we intentionally skip IO beyond
i_size to avoid any unnecessay work and just call
btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which
has a 40K length);
8) Once the IO finishes we finish the ordered extent by ending up at
btrfs_finish_one_ordered(), join transaction N, insert a file extent
item in the inode's subvolume tree for file offset 589824 with a number
of bytes of 64K, and update the inode's delayed inode item or directly
the inode item with a call to btrfs_update_inode_fallback(), which
results in storing the new i_size of 610969 bytes;
9) Transaction N is committed either by the transaction kthread or some
other task committed it (in response to a sync or fsync for example).
At this point we have inode I persisted with an i_size of 610969 bytes
and file extent item that starts at file offset 589824 and has a number
of bytes of 64K, ending at an offset of 655360 which is beyond the
i_size rounded up to the sector size (614400).
--> So after a crash or power failure here, the btrfs check program
reports that error about missing checksum items for this inode, as
it tries to lookup for checksums covering the whole range of the
extent;
10) Only after transaction N is committed that at btrfs_truncate() the
call to btrfs_start_transaction() starts a new transaction, N + 1,
instead of joining transaction N. And it's with transaction N + 1 that
it calls btrfs_truncate_inode_items() which updates the file extent
item at file offset 589824 to reduce its number of bytes from 64K down
to 24K, so that the file extent item's range ends at the i_size
rounded up to the sector size (614400 bytes).
Fix this by truncating the ordered extent at extent_writepage_io() when we
skip writeback because the current offset in the folio is beyond i_size.
This ensures we don't ever persist a file extent item with a number of
bytes beyond the rounded up (by sector size) value of the i_size.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7893cc12251f6f19e7689a4cf3ba803bddbd8437 ]
Sheng Yong reported [1] that Android APEX images didn't work with commit
072a7c7cdbea ("erofs: don't bother with s_stack_depth increasing for
now") because "EROFS-formatted APEX file images can be stored within an
EROFS-formatted Android system partition."
In response, I sent a quick fat-fingered [PATCH v3] to address the
report. Unfortunately, the updated condition was incorrect:
if (erofs_is_fileio_mode(sbi)) {
- sb->s_stack_depth =
- file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1;
- if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
- erofs_err(sb, "maximum fs stacking depth exceeded");
+ inode = file_inode(sbi->dif0.file);
+ if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) ||
+ inode->i_sb->s_stack_depth) {
The condition `!sb->s_bdev` is always true for all file-backed EROFS
mounts, making the check effectively a no-op.
The real fix tested and confirmed by Sheng Yong [2] at that time was
[PATCH v3 RESEND], which correctly ensures the following EROFS^2 setup
works:
EROFS (on a block device) + EROFS (file-backed mount)
But sadly I screwed it up again by upstreaming the outdated [PATCH v3].
This patch applies the same logic as the delta between the upstream
[PATCH v3] and the real fix [PATCH v3 RESEND].
Reported-by: Sheng Yong <shengyong1@xiaomi.com>
Closes: https://lore.kernel.org/r/3acec686-4020-4609-aee4-5dae7b9b0093@gmail.com [1]
Fixes: 072a7c7cdbea ("erofs: don't bother with s_stack_depth increasing for now")
Link: https://lore.kernel.org/r/243f57b8-246f-47e7-9fb1-27a771e8e9e8@gmail.com [2]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 072a7c7cdbea4f91df854ee2bb216256cd619f2a ]
Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow when stacking an unlimited number of EROFS on top of
each other.
This fix breaks composefs mounts, which need EROFS+ovl^2 sometimes
(and such setups are already used in production for quite a long time).
One way to fix this regression is to bump FILESYSTEM_MAX_STACK_DEPTH
from 2 to 3, but proving that this is safe in general is a high bar.
After a long discussion on GitHub issues [1] about possible solutions,
one conclusion is that there is no need to support nesting file-backed
EROFS mounts on stacked filesystems, because there is always the option
to use loopback devices as a fallback.
As a quick fix for the composefs regression for this cycle, instead of
bumping `s_stack_depth` for file backed EROFS mounts, we disallow
nesting file-backed EROFS over EROFS and over filesystems with
`s_stack_depth` > 0.
This works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.
Essentially, we are allowing one extra unaccounted fs stacking level of
EROFS below stacking filesystems, but EROFS can only be used in the read
path (i.e. overlayfs lower layers), which typically has much lower stack
usage than the write path.
We can consider increasing FILESYSTEM_MAX_STACK_DEPTH later, after more
stack usage analysis or using alternative approaches, such as splitting
the `s_stack_depth` limitation according to different combinations of
stacking.
Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed mounts")
Reported-and-tested-by: Dusty Mabe <dusty@dustymabe.com>
Reported-by: Timothée Ravier <tim@siosm.fr>
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Reported-by: "Alekséi Naidénov" <an@digitaltide.io>
Closes: https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0QXwCpec9sXtg@mail.gmail.com
Acked-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Alexander Larsson <alexl@redhat.com>
Reviewed-and-tested-by: Sheng Yong <shengyong1@xiaomi.com>
Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 530e3d4af566ca44807d79359b90794dea24c4f3 ]
Coverity reported a NULL pointer dereference issue (CID 1666756) in
do_abort_log_replay(). When btrfs_alloc_path() fails in
replay_one_buffer(), wc->subvol_path is NULL, but btrfs_abort_log_replay()
calls do_abort_log_replay() which unconditionally dereferences
wc->subvol_path when attempting to print debug information. Fix this by
adding a NULL check before dereferencing wc->subvol_path in
do_abort_log_replay().
Fixes: 2753e4917624 ("btrfs: dump detailed info and specific messages on log replay failures")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 30bcf4e824aa37d305502f52e1527c7b1eabef3d ]
[BUG]
Since the introduction of btrfs bs < ps support, v1 cache was never on
the plan due to its hard coded PAGE_SIZE usage, and the future plan to
properly deprecate it.
However for bs < ps cases, even if 'nospace_cache,clear_cache' mount
option is specified, it's never respected and free space tree is always
enabled:
mkfs.btrfs -f -O ^bgt,fst $dev
mount $dev $mnt -o clear_cache,nospace_cache
umount $mnt
btrfs ins dump-super $dev
...
compat_ro_flags 0x3
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID )
...
This means a different behavior compared to bs >= ps cases.
[CAUSE]
The forcing usage of v2 space cache is done inside
btrfs_set_free_space_cache_settings(), however it never checks if we're
even using space cache but always enabling v2 cache.
[FIX]
Instead unconditionally enable v2 cache, only forcing v2 cache if the
old v1 cache is required.
Now v2 space cache can be properly disabled on bs < ps cases:
mkfs.btrfs -f -O ^bgt,fst $dev
mount $dev $mnt -o clear_cache,nospace_cache
umount $mnt
btrfs ins dump-super $dev
...
compat_ro_flags 0x0
...
Fixes: 9f73f1aef98b ("btrfs: force v2 space cache usage for subpage mount")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8731f2c50b0b1d2b58ed5b9671ef2c4bdc2f8347 ]
In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree()
while holding a path with a read locked leaf from a subvolume tree, and
btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can
trigger reclaim.
This can create a circular lock dependency which lockdep warns about with
the following splat:
[6.1433] ======================================================
[6.1574] WARNING: possible circular locking dependency detected
[6.1583] 6.18.0+ #4 Tainted: G U
[6.1591] ------------------------------------------------------
[6.1599] kswapd0/117 is trying to acquire lock:
[6.1606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1625]
but task is already holding lock:
[6.1633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60
[6.1646]
which lock already depends on the new lock.
[6.1657]
the existing dependency chain (in reverse order) is:
[6.1667]
-> #2 (fs_reclaim){+.+.}-{0:0}:
[6.1677] fs_reclaim_acquire+0x9d/0xd0
[6.1685] __kmalloc_cache_noprof+0x59/0x750
[6.1694] btrfs_init_file_extent_tree+0x90/0x100
[6.1702] btrfs_read_locked_inode+0xc3/0x6b0
[6.1710] btrfs_iget+0xbb/0xf0
[6.1716] btrfs_lookup_dentry+0x3c5/0x8e0
[6.1724] btrfs_lookup+0x12/0x30
[6.1731] lookup_open.isra.0+0x1aa/0x6a0
[6.1739] path_openat+0x5f7/0xc60
[6.1746] do_filp_open+0xd6/0x180
[6.1753] do_sys_openat2+0x8b/0xe0
[6.1760] __x64_sys_openat+0x54/0xa0
[6.1768] do_syscall_64+0x97/0x3e0
[6.1776] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[6.1784]
-> #1 (btrfs-tree-00){++++}-{3:3}:
[6.1794] lock_release+0x127/0x2a0
[6.1801] up_read+0x1b/0x30
[6.1808] btrfs_search_slot+0x8e0/0xff0
[6.1817] btrfs_lookup_inode+0x52/0xd0
[6.1825] __btrfs_update_delayed_inode+0x73/0x520
[6.1833] btrfs_commit_inode_delayed_inode+0x11a/0x120
[6.1842] btrfs_log_inode+0x608/0x1aa0
[6.1849] btrfs_log_inode_parent+0x249/0xf80
[6.1857] btrfs_log_dentry_safe+0x3e/0x60
[6.1865] btrfs_sync_file+0x431/0x690
[6.1872] do_fsync+0x39/0x80
[6.1879] __x64_sys_fsync+0x13/0x20
[6.1887] do_syscall_64+0x97/0x3e0
[6.1894] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[6.1903]
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
[6.1913] __lock_acquire+0x15e9/0x2820
[6.1920] lock_acquire+0xc9/0x2d0
[6.1927] __mutex_lock+0xcc/0x10a0
[6.1934] __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1944] btrfs_evict_inode+0x20b/0x4b0
[6.1952] evict+0x15a/0x2f0
[6.1958] prune_icache_sb+0x91/0xd0
[6.1966] super_cache_scan+0x150/0x1d0
[6.1974] do_shrink_slab+0x155/0x6f0
[6.1981] shrink_slab+0x48e/0x890
[6.1988] shrink_one+0x11a/0x1f0
[6.1995] shrink_node+0xbfd/0x1320
[6.1002] balance_pgdat+0x67f/0xc60
[6.1321] kswapd+0x1dc/0x3e0
[6.1643] kthread+0xff/0x240
[6.1965] ret_from_fork+0x223/0x280
[6.1287] ret_from_fork_asm+0x1a/0x30
[6.1616]
other info that might help us debug this:
[6.1561] Chain exists of:
&delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim
[6.1503] Possible unsafe locking scenario:
[6.1110] CPU0 CPU1
[6.1411] ---- ----
[6.1707] lock(fs_reclaim);
[6.1998] lock(btrfs-tree-00);
[6.1291] lock(fs_reclaim);
[6.1581] lock(&delayed_node->mutex);
[6.1874]
*** DEADLOCK ***
[6.1716] 2 locks held by kswapd0/117:
[6.1999] #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60
[6.1294] #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0
[6.1596]
stack backtrace:
[6.1183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G U 6.18.0+ #4 PREEMPT(lazy)
[6.1185] Tainted: [U]=USER
[6.1186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[6.1187] Call Trace:
[6.1187] <TASK>
[6.1189] dump_stack_lvl+0x6e/0xa0
[6.1192] print_circular_bug.cold+0x17a/0x1c0
[6.1194] check_noncircular+0x175/0x190
[6.1197] __lock_acquire+0x15e9/0x2820
[6.1200] lock_acquire+0xc9/0x2d0
[6.1201] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1204] __mutex_lock+0xcc/0x10a0
[6.1206] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1208] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1211] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1213] __btrfs_release_delayed_node.part.0+0x39/0x2f0
[6.1215] btrfs_evict_inode+0x20b/0x4b0
[6.1217] ? lock_acquire+0xc9/0x2d0
[6.1220] evict+0x15a/0x2f0
[6.1222] prune_icache_sb+0x91/0xd0
[6.1224] super_cache_scan+0x150/0x1d0
[6.1226] do_shrink_slab+0x155/0x6f0
[6.1228] shrink_slab+0x48e/0x890
[6.1229] ? shrink_slab+0x2d2/0x890
[6.1231] shrink_one+0x11a/0x1f0
[6.1234] shrink_node+0xbfd/0x1320
[6.1236] ? shrink_node+0xa2d/0x1320
[6.1236] ? shrink_node+0xbd3/0x1320
[6.1239] ? balance_pgdat+0x67f/0xc60
[6.1239] balance_pgdat+0x67f/0xc60
[6.1241] ? finish_task_switch.isra.0+0xc4/0x2a0
[6.1246] kswapd+0x1dc/0x3e0
[6.1247] ? __pfx_autoremove_wake_function+0x10/0x10
[6.1249] ? __pfx_kswapd+0x10/0x10
[6.1250] kthread+0xff/0x240
[6.1251] ? __pfx_kthread+0x10/0x10
[6.1253] ret_from_fork+0x223/0x280
[6.1255] ? __pfx_kthread+0x10/0x10
[6.1257] ret_from_fork_asm+0x1a/0x30
[6.1260] </TASK>
This is because:
1) The fsync task is holding an inode's delayed node mutex (for a
directory) while calling __btrfs_update_delayed_inode() and that needs
to do a search on the subvolume's btree (therefore read lock some
extent buffers);
2) The lookup task, at btrfs_lookup(), triggered reclaim with the
GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while
holding a read lock on a subvolume leaf;
3) The reclaim triggered kswapd which is doing inode eviction for the
directory inode the fsync task is using as an argument to
btrfs_commit_inode_delayed_inode() - but in that call chain we are
trying to read lock the same leaf that the lookup task is holding
while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL
allocation.
Fix this by calling btrfs_init_file_extent_tree() after we don't need the
path anymore and release it in btrfs_read_locked_inode().
Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/linux-btrfs/6e55113a22347c3925458a5d840a18401a38b276.camel@linux.intel.com/
Fixes: 8679d2687c35 ("btrfs: initialize inode::file_extent_tree after i_mode has been set")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 570ad253a3455a520f03c2136af8714bc780186d ]
The read result collection for buffered reads seems to run ahead of the
completion of subrequests under some circumstances, as can be seen in the
following log snippet:
9p_client_res: client 18446612686390831168 response P9_TREAD tag 0 err 0
...
netfs_sreq: R=00001b55[1] DOWN TERM f=192 s=0 5fb2/5fb2 s=5 e=0
...
netfs_collect_folio: R=00001b55 ix=00004 r=4000-5000 t=4000/5fb2
netfs_folio: i=157f3 ix=00004-00004 read-done
netfs_folio: i=157f3 ix=00004-00004 read-unlock
netfs_collect_folio: R=00001b55 ix=00005 r=5000-5fb2 t=5000/5fb2
netfs_folio: i=157f3 ix=00005-00005 read-done
netfs_folio: i=157f3 ix=00005-00005 read-unlock
...
netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff
netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=c
netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff
netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=8
...
netfs_sreq: R=00001b55[2] ZERO SUBMT f=000 s=5fb2 0/4e s=0 e=0
netfs_sreq: R=00001b55[2] ZERO TERM f=102 s=5fb2 4e/4e s=5 e=0
The 'cto=5fb2' indicates the collected file pos we've collected results to
so far - but we still have 0x4e more bytes to go - so we shouldn't have
collected folio ix=00005 yet. The 'ZERO' subreq that clears the tail
happens after we unlock the folio, allowing the application to see the
uncleared tail through mmap.
The problem is that netfs_read_unlock_folios() will unlock a folio in which
the amount of read results collected hits EOF position - but the ZERO
subreq lies beyond that and so happens after.
Fix this by changing the end check to always be the end of the folio and
never the end of the file.
In the future, I should look at clearing to the end of the folio here rather
than adding a ZERO subreq to do this. On the other hand, the ZERO subreq can
run in parallel with an async READ subreq. Further, the ZERO subreq may still
be necessary to, say, handle extents in a ceph file that don't have any
backing store and are thus implicitly all zeros.
This can be reproduced by creating a file, the size of which doesn't align
to a page boundary, e.g. 24998 (0x5fb2) bytes and then doing something
like:
xfs_io -c "mmap -r 0 0x6000" -c "madvise -d 0 0x6000" \
-c "mread -v 0 0x6000" /xfstest.test/x
The last 0x4e bytes should all be 00, but if the tail hasn't been cleared
yet, you may see rubbish there. This can be reproduced with kafs by
modifying the kernel to disable the call to netfs_read_subreq_progress()
and to stop afs_issue_read() from doing the async call for NETFS_READAHEAD.
Reproduction can be made easier by inserting an mdelay(100) in
netfs_issue_read() for the ZERO-subreq case.
AFS and CIFS are normally unlikely to show this as they dispatch READ ops
asynchronously, which allows the ZERO-subreq to finish first. 9P's READ op is
completely synchronous, so the ZERO-subreq will always happen after. It isn't
seen all the time, though, because the collection may be done in a worker
thread.
Reported-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Link: https://lore.kernel.org/r/8622834.T7Z3S40VBb@weasel/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/938162.1766233900@warthog.procyon.org.uk
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Acked-by: Dominique Martinet <asmadeus@codewreck.org>
Suggested-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 83f59076a1ae6f5c6845d6f7ed3a1a373d883684 ]
Previously, btrfs_get_or_create_delayed_node() set the delayed_node's
refcount before acquiring the root->delayed_nodes lock.
Commit e8513c012de7 ("btrfs: implement ref_tracker for delayed_nodes")
moved refcount_set inside the critical section, which means there is
no longer a memory barrier between setting the refcount and setting
btrfs_inode->delayed_node.
Without that barrier, the stores to node->refs and
btrfs_inode->delayed_node may become visible out of order. Another
thread can then read btrfs_inode->delayed_node and attempt to
increment a refcount that hasn't been set yet, leading to a
refcounting bug and a use-after-free warning.
The fix is to move refcount_set back to where it was to take
advantage of the implicit memory barrier provided by lock
acquisition.
Because the allocations now happen outside of the lock's critical
section, they can use GFP_NOFS instead of GFP_ATOMIC.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202511262228.6dda231e-lkp@intel.com
Fixes: e8513c012de7 ("btrfs: implement ref_tracker for delayed_nodes")
Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 68d4b3fa18d72b7f649e83012e7e08f1881f6b75 ]
[BUG]
There is a bug that if a subvolume has multi-level parent qgroups, and
is able to do a quick inherit, only the direct parent qgroup got
updated:
mkfs.btrfs -f -O quota $dev
mount $dev $mnt
btrfs subv create $mnt/subv1
btrfs qgroup create 1/100 $mnt
btrfs qgroup create 2/100 $mnt
btrfs qgroup assign 1/100 2/100 $mnt
btrfs qgroup assign 0/256 1/100 $mnt
btrfs qgroup show -p --sync $mnt
Qgroupid Referenced Exclusive Parent Path
-------- ---------- --------- ------ ----
0/5 16.00KiB 16.00KiB - <toplevel>
0/256 16.00KiB 16.00KiB 1/100 subv1
1/100 16.00KiB 16.00KiB 2/100 2/100<1 member qgroup>
2/100 16.00KiB 16.00KiB - <0 member qgroups>
btrfs subv snap -i 1/100 $mnt/subv1 $mnt/snap1
btrfs qgroup show -p --sync $mnt
Qgroupid Referenced Exclusive Parent Path
-------- ---------- --------- ------ ----
0/5 16.00KiB 16.00KiB - <toplevel>
0/256 16.00KiB 16.00KiB 1/100 subv1
0/257 16.00KiB 16.00KiB 1/100 snap1
1/100 32.00KiB 32.00KiB 2/100 2/100<1 member qgroup>
2/100 16.00KiB 16.00KiB - <0 member qgroups>
# Note that 2/100 is not updated, and qgroup numbers are inconsistent
umount $mnt
[CAUSE]
If the snapshot source subvolume belongs to a parent qgroup, and the new
snapshot target is also added to the new same parent qgroup, we allow a
quick update without marking qgroup inconsistent.
But that quick update only update the parent qgroup, without checking if
there is any more parent qgroups.
[FIX]
Iterate through all parent qgroups during the quick inherit.
Reported-by: Boris Burkov <boris@bur.io>
Fixes: b20fe56cd285 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7ee19a59a75e3d5b9ec00499b86af8e2a46fbe86 ]
qgroup_snapshot_quick_inherit() detects conditions where the snapshot
destination would land in the same parent qgroup as the snapshot source
subvolume. In this case we can avoid costly qgroup calculations and just
add the nodesize of the new snapshot to the parent.
However, in the case of squotas this is actually a double count, and
also an undercount for deeper qgroup nestings.
The following annotated script shows the issue:
btrfs quota enable --simple "$mnt"
# Create 2-level qgroup hierarchy
btrfs qgroup create 2/100 "$mnt" # Q2 (level 2)
btrfs qgroup create 1/100 "$mnt" # Q1 (level 1)
btrfs qgroup assign 1/100 2/100 "$mnt"
# Create base subvolume
btrfs subvolume create "$mnt/base" >/dev/null
base_id=$(btrfs subvolume show "$mnt/base" | grep 'Subvolume ID:' | awk '{print $3}')
# Create intermediate snapshot and add to Q1
btrfs subvolume snapshot "$mnt/base" "$mnt/intermediate" >/dev/null
inter_id=$(btrfs subvolume show "$mnt/intermediate" | grep 'Subvolume ID:' | awk '{print $3}')
btrfs qgroup assign "0/$inter_id" 1/100 "$mnt"
# Create working snapshot with --inherit (auto-adds to Q1)
# src=intermediate (in only Q1)
# dst=snap (inheriting only into Q1)
# This double counts the 16k nodesize of the snapshot in Q1, and
# undercounts it in Q2.
btrfs subvolume snapshot -i 1/100 "$mnt/intermediate" "$mnt/snap" >/dev/null
snap_id=$(btrfs subvolume show "$mnt/snap" | grep 'Subvolume ID:' | awk '{print $3}')
# Fully complete snapshot creation
sync
# Delete working snapshot
# Q1 and Q2 will lose the full snap usage
btrfs subvolume delete "$mnt/snap" >/dev/null
# Delete intermediate and remove from Q1
# Q1 and Q2 will lose the full intermediate usage
btrfs qgroup remove "0/$inter_id" 1/100 "$mnt"
btrfs subvolume delete "$mnt/intermediate" >/dev/null
# Q1 should be at 0, but still has 16k. Q2 is "correct" at 0 (for now...)
# Trigger cleaner, wait for deletions
mount -o remount,sync=1 "$mnt"
btrfs subvolume sync "$mnt" "$snap_id"
btrfs subvolume sync "$mnt" "$inter_id"
# Remove Q1 from Q2
# Frees 16k more from Q2, underflowing it to 16EiB
btrfs qgroup remove 1/100 2/100 "$mnt"
# And show the bad state:
btrfs qgroup show -pc "$mnt"
Qgroupid Referenced Exclusive Parent Child Path
-------- ---------- --------- ------ ----- ----
0/5 16.00KiB 16.00KiB - - <toplevel>
0/256 16.00KiB 16.00KiB - - base
1/100 16.00KiB 16.00KiB - - <0 member qgroups>
2/100 16.00EiB 16.00EiB - - <0 member qgroups>
Fix this by simply not doing this quick inheritance with squotas.
I suspect that it is also wrong in normal qgroups to not recurse up the
qgroup tree in the quick inherit case, though other consistency checks
will likely fix it anyway.
Fixes: b20fe56cd285 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a1237c203f1757480dc2f3b930608ee00072d3cc ]
This was reported by the KUnit tests in the later patches.
See MS-ERREF 2.3.1 STATUS_NO_DATA_DETECTED. Keep it consistent with the
value in the documentation.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b2b50fca34da5ec231008edba798ddf92986bd7f ]
This was reported by the KUnit tests in the later patches.
See MS-ERREF 2.3.1 STATUS_DEVICE_DOOR_OPEN. Keep it consistent with the
value in the documentation.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9f99caa8950a76f560a90074e3a4b93cfa8b3d84 ]
This was reported by the KUnit tests in the later patches.
See MS-ERREF 2.3.1 STATUS_UNABLE_TO_FREE_VM. Keep it consistent with the
value in the documentation.
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a2a8fc27dd668e7562b5326b5ed2f1604cb1e2e9 ]
When automounting, the fs_context should be fixed up to use the cred
from the parent filesystem, since the operation is just extending the
namespace. Authorisation to enter that namespace will already have been
provided by the preceding lookup.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2e47c3cc64b44b0b06cd68c2801db92ff143f2b2 ]
We have observed an NFSv4 client receiving a LOCK reply with a status of
NFS4ERR_OLD_STATEID and subsequently retrying the LOCK request with an
earlier seqid value in the stateid. As this was for a new lockowner,
that would imply that nfs_set_open_stateid_locked() had updated the open
stateid seqid with an earlier value.
Looking at nfs_set_open_stateid_locked(), if the incoming seqid is out
of sequence, the task will sleep on the state->waitq for up to 5
seconds. If the task waits for the full 5 seconds, then after finishing
the wait it'll update the open stateid seqid with whatever value the
incoming seqid has. If there are multiple waiters in this scenario,
then the last one to perform said update may not be the one with the
highest seqid.
Add a check to ensure that the seqid can only be incremented, and add a
tracepoint to indicate when old seqids are skipped.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 7ba0b6461bc4edb3005ea6e00cdae189bcf908a5 upstream.
After rename exchanging (either with the rename exchange operation or
regular renames in multiple non-atomic steps) two inodes and at least
one of them is a directory, we can end up with a log tree that contains
only of the inodes and after a power failure that can result in an attempt
to delete the other inode when it should not because it was not deleted
before the power failure. In some case that delete attempt fails when
the target inode is a directory that contains a subvolume inside it, since
the log replay code is not prepared to deal with directory entries that
point to root items (only inode items).
1) We have directories "dir1" (inode A) and "dir2" (inode B) under the
same parent directory;
2) We have a file (inode C) under directory "dir1" (inode A);
3) We have a subvolume inside directory "dir2" (inode B);
4) All these inodes were persisted in a past transaction and we are
currently at transaction N;
5) We rename the file (inode C), so at btrfs_log_new_name() we update
inode C's last_unlink_trans to N;
6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B),
so after the exchange "dir1" is inode B and "dir2" is inode A.
During the rename exchange we call btrfs_log_new_name() for inodes
A and B, but because they are directories, we don't update their
last_unlink_trans to N;
7) An fsync against the file (inode C) is done, and because its inode
has a last_unlink_trans with a value of N we log its parent directory
(inode A) (through btrfs_log_all_parents(), called from
btrfs_log_inode_parent()).
8) So we end up with inode B not logged, which now has the old name
of inode A. At copy_inode_items_to_log(), when logging inode A, we
did not check if we had any conflicting inode to log because inode
A has a generation lower than the current transaction (created in
a past transaction);
9) After a power failure, when replaying the log tree, since we find that
inode A has a new name that conflicts with the name of inode B in the
fs tree, we attempt to delete inode B... this is wrong since that
directory was never deleted before the power failure, and because there
is a subvolume inside that directory, attempting to delete it will fail
since replay_dir_deletes() and btrfs_unlink_inode() are not prepared
to deal with dir items that point to roots instead of inodes.
When that happens the mount fails and we get a stack trace like the
following:
[87.2314] BTRFS info (device dm-0): start tree-log replay
[87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259
[87.2332] ------------[ cut here ]------------
[87.2338] BTRFS: Transaction aborted (error -2)
[87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs]
[87.2368] Modules linked in: btrfs loop dm_thin_pool (...)
[87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G W 6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full)
[87.2489] Tainted: [W]=WARN
[87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs]
[87.2538] Code: c0 89 04 24 (...)
[87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286
[87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000
[87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff
[87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840
[87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0
[87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10
[87.2618] FS: 00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000
[87.2629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0
[87.2648] Call Trace:
[87.2651] <TASK>
[87.2654] btrfs_unlink_inode+0x15/0x40 [btrfs]
[87.2661] unlink_inode_for_log_replay+0x27/0xf0 [btrfs]
[87.2669] check_item_in_log+0x1ea/0x2c0 [btrfs]
[87.2676] replay_dir_deletes+0x16b/0x380 [btrfs]
[87.2684] fixup_inode_link_count+0x34b/0x370 [btrfs]
[87.2696] fixup_inode_link_counts+0x41/0x160 [btrfs]
[87.2703] btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs]
[87.2711] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
[87.2719] open_ctree+0x10bb/0x15f0 [btrfs]
[87.2726] btrfs_get_tree.cold+0xb/0x16c [btrfs]
[87.2734] ? fscontext_read+0x15c/0x180
[87.2740] ? rw_verify_area+0x50/0x180
[87.2746] vfs_get_tree+0x25/0xd0
[87.2750] vfs_cmd_create+0x59/0xe0
[87.2755] __do_sys_fsconfig+0x4f6/0x6b0
[87.2760] do_syscall_64+0x50/0x1220
[87.2764] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[87.2770] RIP: 0033:0x7f7b9625f4aa
[87.2775] Code: 73 01 c3 48 (...)
[87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
[87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa
[87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
[87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000
[87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23
[87.2877] </TASK>
[87.2882] ---[ end trace 0000000000000000 ]---
[87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry
[87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry
[87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)):
[87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610
[87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968
[87.2929] item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
[87.2929] inode generation 9 transid 10 size 0 nbytes 0
[87.2929] block group 0 mode 40755 links 1 uid 0 gid 0
[87.2929] rdev 0 sequence 7 flags 0x0
[87.2929] atime 1765464494.678070921
[87.2929] ctime 1765464494.686606513
[87.2929] mtime 1765464494.686606513
[87.2929] otime 1765464494.678070921
[87.2929] item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14
[87.2929] index 4 name_len 4
[87.2929] item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8
[87.2929] dir log end 2
[87.2929] item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8
[87.2929] dir log end 18446744073709551615
[87.2930] item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33
[87.2930] location key (258 1 0) type 1
[87.2930] transid 10 data_len 0 name_len 3
[87.2930] item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160
[87.2930] inode generation 9 transid 10 size 0 nbytes 0
[87.2930] block group 0 mode 100644 links 1 uid 0 gid 0
[87.2930] rdev 0 sequence 2 flags 0x0
[87.2930] atime 1765464494.678456467
[87.2930] ctime 1765464494.686606513
[87.2930] mtime 1765464494.678456467
[87.2930] otime 1765464494.678456467
[87.2930] item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13
[87.2930] index 3 name_len 3
[87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5
[87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry
[87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr
So fix this by changing copy_inode_items_to_log() to always detect if
there are conflicting inodes for the ref/extref of the inode being logged
even if the inode was created in a past transaction.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c6c209ceb87f64a6ceebe61761951dcbbf4a0baa upstream.
I haven't found an NFSERR_EAGAIN in RFCs 1094, 1813, 7530, or 8881.
None of these RFCs have an NFS status code that match the numeric
value "11".
Based on the meaning of the EAGAIN errno, I presume the use of this
status in NFSD means NFS4ERR_DELAY. So replace the one usage of
nfserr_eagain, and remove it from NFSD's NFS status conversion
tables.
As far as I can tell, NFSERR_EAGAIN has existed since the pre-git
era, but was not actually used by any code until commit f4e44b393389
("NFSD: delay unmount source's export after inter-server copy
completed."), at which time it become possible for NFSD to return
a status code of 11 (which is not valid NFS protocol).
Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.")
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0b88bfa42e5468baff71909c2f324a495318532b upstream.
When the NFSD instance doesn't to startup, the net ref data memory is
not properly reclaimed, which triggers the memory leak issue reported
by syzbot [1].
To avoid the problem reported in [1], the net ref data memory reclamation
action is moved outside of nfsd_net_up when the net is shutdown.
[1]
unreferenced object 0xffff88812a39dfc0 (size 64):
backtrace (crc a2262fc6):
percpu_ref_init+0x94/0x1e0 lib/percpu-refcount.c:76
nfsd_create_serv+0xbe/0x260 fs/nfsd/nfssvc.c:605
nfsd_nl_listener_set_doit+0x62/0xb00 fs/nfsd/nfsctl.c:1882
genl_family_rcv_msg_doit+0x11e/0x190 net/netlink/genetlink.c:1115
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x2fd/0x440 net/netlink/genetlink.c:1210
BUG: memory leak
Reported-by: syzbot+6ee3b889bdeada0a6226@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6ee3b889bdeada0a6226
Fixes: 39972494e318 ("nfsd: update percpu_ref to manage references on nfsd_net")
Cc: stable@vger.kernel.org
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d0424066fcd294977f310964bed6f2a487fa4515 upstream.
If we are trying to unlock the filesystem via an administrative
interface and nfsd isn't running, it crashes the server. This
happens currently because nfsd4_revoke_states() access state
structures (eg., conf_id_hashtbl) that has been freed as a part
of the server shutdown.
[ 59.465072] Call trace:
[ 59.465308] nfsd4_revoke_states+0x1b4/0x898 [nfsd] (P)
[ 59.465830] write_unlock_fs+0x258/0x440 [nfsd]
[ 59.466278] nfsctl_transaction_write+0xb0/0x120 [nfsd]
[ 59.466780] vfs_write+0x1f0/0x938
[ 59.467088] ksys_write+0xfc/0x1f8
[ 59.467395] __arm64_sys_write+0x74/0xb8
[ 59.467746] invoke_syscall.constprop.0+0xdc/0x1e8
[ 59.468177] do_el0_svc+0x154/0x1d8
[ 59.468489] el0_svc+0x40/0xe0
[ 59.468767] el0t_64_sync_handler+0xa0/0xe8
[ 59.469138] el0t_64_sync+0x1ac/0x1b0
Ensure this can't happen by taking the nfsd_mutex and checking that
the server is still up, and then holding the mutex across the call to
nfsd4_revoke_states().
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 1ac3629bf0125 ("nfsd: prepare for supporting admin-revocation of state")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fb321998de7639f1954430674475e469fb529d9c upstream.
The loop in nfsd4_revoke_states() stops one too early because
the end value given is CLIENT_HASH_MASK where it should be
CLIENT_HASH_SIZE.
This means that an admin request to drop all locks for a filesystem will
miss locks held by clients which hash to the maximum possible hash value.
Fixes: 1ac3629bf012 ("nfsd: prepare for supporting admin-revocation of state")
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2857bd59feb63fcf40fe4baf55401baea6b4feb4 upstream.
Writing to v4_end_grace can race with server shutdown and result in
memory being accessed after it was freed - reclaim_str_hashtbl in
particularly.
We cannot hold nfsd_mutex across the nfsd4_end_grace() call as that is
held while client_tracking_op->init() is called and that can wait for
an upcall to nfsdcltrack which can write to v4_end_grace, resulting in a
deadlock.
nfsd4_end_grace() is also called by the landromat work queue and this
doesn't require locking as server shutdown will stop the work and wait
for it before freeing anything that nfsd4_end_grace() might access.
However, we must be sure that writing to v4_end_grace doesn't restart
the work item after shutdown has already waited for it. For this we
add a new flag protected with nn->client_lock. It is set only while it
is safe to make client tracking calls, and v4_end_grace only schedules
work while the flag is set with the spinlock held.
So this patch adds a nfsd_net field "client_tracking_active" which is
set as described. Another field "grace_end_forced", is set when
v4_end_grace is written. After this is set, and providing
client_tracking_active is set, the laundromat is scheduled.
This "grace_end_forced" field bypasses other checks for whether the
grace period has finished.
This resolves a race which can result in use-after-free.
Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
Closes: https://lore.kernel.org/linux-nfs/20250623030015.2353515-1-neil@brown.name/T/#t
Fixes: 7f5ef2e900d9 ("nfsd: add a v4_end_grace file to /proc/fs/nfsd")
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neil@brown.name>
Tested-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e901c7fce59e72d9f3c92733c379849c4034ac50 upstream.
Commit abc02e5602f7 ("NFSD: Support write delegations in LAYOUTGET")
added NFSD_MAY_OWNER_OVERRIDE to the access flags passed from
nfsd4_layoutget() to fh_verify(). This causes LAYOUTGET to fail for
executable-only files, and causes xfstests generic/126 to fail on
pNFS SCSI.
To allow read access to executable-only files, what we really want is:
1. The "permissions" portion of the access flags (the lower 6 bits)
must be exactly NFSD_MAY_READ
2. The "hints" portion of the access flags (the upper 26 bits) can
contain any combination of NFSD_MAY_OWNER_OVERRIDE and
NFSD_MAY_READ_IF_EXEC
Fixes: abc02e5602f7 ("NFSD: Support write delegations in LAYOUTGET")
Cc: stable@vger.kernel.org # v6.6+
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3af870aedbff10bfed220e280b57a405e972229f upstream.
Commit f2060bdc21d7 ("nfs/localio: add refcounting for each iocb IO
associated with NFS pgio header") inadvertantly reintroduced the same
potential for __put_cred() triggering BUG_ON(cred == current->cred) that
commit 992203a1fba5 ("nfs/localio: restore creds before releasing pageio
data") fixed.
Fix this by saving and restoring the cred around each {read,write}_iter
call within the respective for loop of nfs_local_call_{read,write} using
scoped_with_creds().
NOTE: this fix started by first reverting the following commits:
94afb627dfc2 ("nfs: use credential guards in nfs_local_call_read()")
bff3c841f7bd ("nfs: use credential guards in nfs_local_call_write()")
1d18101a644e ("Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")
followed by narrowly fixing the cred lifetime issue by using
scoped_with_creds(). In doing so, this commit's changes appear more
extensive than they really are (as evidenced by comparing to v6.18's
fs/nfs/localio.c).
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/linux-next/20251205111942.4150b06f@canb.auug.org.au/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4012d78562193ef5eb613bad4b0c0fa187637cfe upstream.
erofs readahead could fail with ENOMEM under the memory pressure because
it tries to alloc_page with GFP_NOWAIT | GFP_NORETRY, while GFP_KERNEL
for a regular read. And if readahead fails (with non-uptodate folios),
the original request will then fall back to synchronous read, and
`.read_folio()` should return appropriate errnos.
However, in scenarios where readahead and read operations compete,
read operation could return an unintended EIO because of an incorrect
error propagation.
To resolve this, this patch modifies the behavior so that, when the
PCL is for read(which means pcl.besteffort is true), it attempts actual
decompression instead of propagating the privios error except initial EIO.
- Page size: 4K
- The original size of FileA: 16K
- Compress-ratio per PCL: 50% (Uncompressed 8K -> Compressed 4K)
[page0, page1] [page2, page3]
[PCL0]---------[PCL1]
- functions declaration:
. pread(fd, buf, count, offset)
. readahead(fd, offset, count)
- Thread A tries to read the last 4K
- Thread B tries to do readahead 8K from 4K
- RA, besteffort == false
- R, besteffort == true
<process A> <process B>
pread(FileA, buf, 4K, 12K)
do readahead(page3) // failed with ENOMEM
wait_lock(page3)
if (!uptodate(page3))
goto do_read
readahead(FileA, 4K, 8K)
// Here create PCL-chain like below:
// [null, page1] [page2, null]
// [PCL0:RA]-----[PCL1:RA]
...
do read(page3) // found [PCL1:RA] and add page3 into it,
// and then, change PCL1 from RA to R
...
// Now, PCL-chain is as below:
// [null, page1] [page2, page3]
// [PCL0:RA]-----[PCL1:R]
// try to decompress PCL-chain...
z_erofs_decompress_queue
err = 0;
// failed with ENOMEM, so page 1
// only for RA will not be uptodated.
// it's okay.
err = decompress([PCL0:RA], err)
// However, ENOMEM propagated to next
// PCL, even though PCL is not only
// for RA but also for R. As a result,
// it just failed with ENOMEM without
// trying any decompression, so page2
// and page3 will not be uptodated.
** BUG HERE ** --> err = decompress([PCL1:R], err)
return err as ENOMEM
...
wait_lock(page3)
if (!uptodate(page3))
return EIO <-- Return an unexpected EIO!
...
Fixes: 2349d2fa02db ("erofs: sunset unneeded NOFAILs")
Cc: stable@vger.kernel.org
Reviewed-by: Jaewook Kim <jw5454.kim@samsung.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Junbeom Yeom <junbeom.yeom@samsung.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1f941b2c23fd34c6f3b76d36f9d0a2528fa92b8f upstream.
In error path, call drop_client() to drop the reference
obtained by get_nfsdfs_clp().
Fixes: 78599c42ae3c ("nfsd4: add file to display list of client's opens")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Haoxiang Li <lihaoxiang@isrc.iscas.ac.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8f9e967830ff32ab7756f530a36adf74a9f12b76 upstream.
When finalizing timestamps that have never been updated and preparing to
release the delegation lease, the notify_change() call can trigger a
delegation break, and fail to update the timestamps. When this happens,
there will be messages like this in dmesg:
[ 2709.375785] Unable to update timestamps on inode 00:39:263: -11
Since this code is going to release the lease just after updating the
timestamps, breaking the delegation is undesirable. Fix this by setting
ATTR_DELEG in ia_valid, in order to avoid the delegation break.
Fixes: e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation")
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8072e34e1387d03102b788677d491e2bcceef6f5 upstream.
nfsd4_add_rdaccess_to_wrdeleg() unconditionally overwrites
fp->fi_fds[O_RDONLY] with a newly acquired nfsd_file. However, if
the client already has a SHARE_ACCESS_READ open from a previous OPEN
operation, this action overwrites the existing pointer without
releasing its reference, orphaning the previous reference.
Additionally, the function originally stored the same nfsd_file
pointer in both fp->fi_fds[O_RDONLY] and fp->fi_rdeleg_file with
only a single reference. When put_deleg_file() runs, it clears
fi_rdeleg_file and calls nfs4_file_put_access() to release the file.
However, nfs4_file_put_access() only releases fi_fds[O_RDONLY] when
the fi_access[O_RDONLY] counter drops to zero. If another READ open
exists on the file, the counter remains elevated and the nfsd_file
reference from the delegation is never released. This potentially
causes open conflicts on that file.
Then, on server shutdown, these leaks cause __nfsd_file_cache_purge()
to encounter files with an elevated reference count that cannot be
cleaned up, ultimately triggering a BUG() in kmem_cache_destroy()
because there are still nfsd_file objects allocated in that cache.
Fixes: e7a8ebc305f2 ("NFSD: Offer write delegation for OPEN with OPEN4_SHARE_ACCESS_WRITE")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a49a2a1baa0c553c3548a1c414b6a3c005a8deba upstream.
Usage of vfs_test_lock() is somewhat confused. Documentation suggests
it is given a "lock" but this is not the case. It is given a struct
file_lock which contains some details of the sort of lock it should be
looking for.
In particular passing a "file_lock" containing fl_lmops or fl_ops is
meaningless and possibly confusing.
This is particularly problematic in lockd. nlmsvc_testlock() receives
an initialised "file_lock" from xdr-decode, including manager ops and an
owner. It then mistakenly passes this to vfs_test_lock() which might
replace the owner and the ops. This can lead to confusion when freeing
the lock.
The primary role of the 'struct file_lock' passed to vfs_test_lock() is
to report a conflicting lock that was found, so it makes more sense for
nlmsvc_testlock() to pass "conflock", which it uses for returning the
conflicting lock.
With this change, freeing of the lock is not confused and code in
__nlm4svc_proc_test() and __nlmsvc_proc_test() can be simplified.
Documentation for vfs_test_lock() is improved to reflect its real
purpose, and a WARN_ON_ONCE() is added to avoid a similar problem in the
future.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Closes: https://lore.kernel.org/all/20251021130506.45065-1-okorniev@redhat.com
Signed-off-by: NeilBrown <neil@brown.name>
Fixes: 20fa19027286 ("nfs: add export operations")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e3e8e176ca4876e6212582022ad80835dddc9de4 upstream.
Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:
> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.
Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced:
- flags |= RWF_SYNC;
with:
+ kiocb.ki_flags |= IOCB_DSYNC;
which appears to be correct given:
if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;
in kiocb_set_rw_flags(). However the author of that commit did not
appreciate that the previous line in kiocb_set_rw_flags() results
in IOCB_SYNC also being set:
kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as
IOCB_SYNC. Reviewers at the time did not catch the omission.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 68f6bd128e75a032432eda9d16676ed2969a1096 upstream.
When reading a compressed file, we may read several pages in addition to
the one requested. The current code will overwrite pages in the page
cache with the data from disc which can definitely result in changes
that have been made being lost.
For example if we have four consecutie pages ABCD in the file compressed
into a single extent, on first access, we'll bring in ABCD. Then we
write to page B. Memory pressure results in the eviction of ACD.
When we attempt to write to page C, we will overwrite the data in page
B with the data currently on disk.
I haven't investigated the decompression code to check whether it's
OK to overwrite a clean page or whether it might be possible to see
corrupt data. Out of an abundance of caution, decline to overwrite
uptodate pages, not just dirty pages.
Fixes: 4342306f0f0d (fs/ntfs3: Add file operations and implementation)
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0c56693b06a68476ba113db6347e7897475f9e4c ]
In get_file_all_info(), if vfs_getattr() fails, the function returns
immediately without freeing the allocated filename, leading to a memory
leak.
Fix this by freeing the filename before returning in this error case.
Fixes: 5614c8c487f6a ("ksmbd: replace generic_fillattr with vfs_getattr")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 6e0d7f7f4a43ac8868e98c87ecf48805aa8c24dd upstream.
Fix a possible reference count leak of payload pages during
fuse argument copies.
[Joanne: simplified error cleanup]
Fixes: c090c8abae4b ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org # v6.14
Signed-off-by: Cheng Ding <cding@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bd5603eaae0aabf527bfb3ce1bb07e979ce5bd50 upstream.
Commit e26ee4efbc79 ("fuse: allocate ff->release_args only if release is
needed") skips allocating ff->release_args if the server does not
implement open. However in doing so, fuse_prepare_release() now skips
grabbing the reference on the inode, which makes it possible for an
inode to be evicted from the dcache while there are inflight readahead
requests. This causes a deadlock if the server triggers reclaim while
servicing the readahead request and reclaim attempts to evict the inode
of the file being read ahead. Since the folio is locked during
readahead, when reclaim evicts the fuse inode and fuse_evict_inode()
attempts to remove all folios associated with the inode from the page
cache (truncate_inode_pages_range()), reclaim will block forever waiting
for the lock since readahead cannot relinquish the lock because it is
itself blocked in reclaim:
>>> stack_trace(1504735)
folio_wait_bit_common (mm/filemap.c:1308:4)
folio_lock (./include/linux/pagemap.h:1052:3)
truncate_inode_pages_range (mm/truncate.c:336:10)
fuse_evict_inode (fs/fuse/inode.c:161:2)
evict (fs/inode.c:704:3)
dentry_unlink_inode (fs/dcache.c:412:3)
__dentry_kill (fs/dcache.c:615:3)
shrink_kill (fs/dcache.c:1060:12)
shrink_dentry_list (fs/dcache.c:1087:3)
prune_dcache_sb (fs/dcache.c:1168:2)
super_cache_scan (fs/super.c:221:10)
do_shrink_slab (mm/shrinker.c:435:9)
shrink_slab (mm/shrinker.c:626:10)
shrink_node (mm/vmscan.c:5951:2)
shrink_zones (mm/vmscan.c:6195:3)
do_try_to_free_pages (mm/vmscan.c:6257:3)
do_swap_page (mm/memory.c:4136:11)
handle_pte_fault (mm/memory.c:5562:10)
handle_mm_fault (mm/memory.c:5870:9)
do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
handle_page_fault (arch/x86/mm/fault.c:1481:3)
exc_page_fault (arch/x86/mm/fault.c:1539:2)
asm_exc_page_fault+0x22/0x27
Fix this deadlock by allocating ff->release_args and grabbing the
reference on the inode when preparing the file for release even if the
server does not implement open. The inode reference will be dropped when
the last reference on the fuse file is dropped (see fuse_file_put() ->
fuse_release_end()).
Fixes: e26ee4efbc79 ("fuse: allocate ff->release_args only if release is needed")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reported-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 95c39eef7c2b666026c69ab5b30471da94ea2874 upstream.
When a request is terminated before it has been committed, the request
is not removed from the queue's list. This leaves a dangling list entry
that leads to list corruption and use-after-free issues.
Remove the request from the queue's list for terminated non-committed
requests.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Fixes: c090c8abae4b ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
transaction
commit 266273eaf4d99475f1ae57f687b3e42bc71ec6f0 upstream.
We can't log a conflicting inode if it's a directory and it was moved
from one parent directory to another parent directory in the current
transaction, as this can result an attempt to have a directory with
two hard links during log replay, one for the old parent directory and
another for the new parent directory.
The following scenario triggers that issue:
1) We have directories "dir1" and "dir2" created in a past transaction.
Directory "dir1" has inode A as its parent directory;
2) We move "dir1" to some other directory;
3) We create a file with the name "dir1" in directory inode A;
4) We fsync the new file. This results in logging the inode of the new file
and the inode for the directory "dir1" that was previously moved in the
current transaction. So the log tree has the INODE_REF item for the
new location of "dir1";
5) We move the new file to some other directory. This results in updating
the log tree to included the new INODE_REF for the new location of the
file and removes the INODE_REF for the old location. This happens
during the rename when we call btrfs_log_new_name();
6) We fsync the file, and that persists the log tree changes done in the
previous step (btrfs_log_new_name() only updates the log tree in
memory);
7) We have a power failure;
8) Next time the fs is mounted, log replay happens and when processing
the inode for directory "dir1" we find a new INODE_REF and add that
link, but we don't remove the old link of the inode since we have
not logged the old parent directory of the directory inode "dir1".
As a result after log replay finishes when we trigger writeback of the
subvolume tree's extent buffers, the tree check will detect that we have
a directory a hard link count of 2 and we get a mount failure.
The errors and stack traces reported in dmesg/syslog are like this:
[ 3845.729764] BTRFS info (device dm-0): start tree-log replay
[ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c
[ 3845.731236] memcg:ffff9264c02f4e00
[ 3845.731751] aops:btree_aops [btrfs] ino:1
[ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
[ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8
[ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00
[ 3845.735305] page dumped because: eb page dump
[ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir
[ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5
[ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701
[ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
[ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384
[ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0
[ 3845.737797] rdev 0 sequence 2 flags 0x0
[ 3845.737798] atime 1764259517.0
[ 3845.737800] ctime 1764259517.572889464
[ 3845.737801] mtime 1764259517.572889464
[ 3845.737802] otime 1764259517.0
[ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
[ 3845.737805] index 0 name_len 2
[ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34
[ 3845.737808] location key (257 1 0) type 2
[ 3845.737810] transid 9 data_len 0 name_len 4
[ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34
[ 3845.737813] location key (258 1 0) type 2
[ 3845.737814] transid 9 data_len 0 name_len 4
[ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
[ 3845.737816] location key (257 1 0) type 2
[ 3845.737818] transid 9 data_len 0 name_len 4
[ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
[ 3845.737820] location key (258 1 0) type 2
[ 3845.737821] transid 9 data_len 0 name_len 4
[ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
[ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0
[ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0
[ 3845.737826] rdev 0 sequence 1 flags 0x0
[ 3845.737827] atime 1764259517.572889464
[ 3845.737828] ctime 1764259517.572889464
[ 3845.737830] mtime 1764259517.572889464
[ 3845.737831] otime 1764259517.572889464
[ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
[ 3845.737833] index 2 name_len 4
[ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14
[ 3845.737836] index 2 name_len 4
[ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33
[ 3845.737838] location key (259 1 0) type 1
[ 3845.737839] transid 10 data_len 0 name_len 3
[ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33
[ 3845.737842] location key (259 1 0) type 1
[ 3845.737843] transid 10 data_len 0 name_len 3
[ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160
[ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0
[ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0
[ 3845.737848] rdev 0 sequence 1 flags 0x0
[ 3845.737849] atime 1764259517.572889464
[ 3845.737850] ctime 1764259517.572889464
[ 3845.737851] mtime 1764259517.572889464
[ 3845.737852] otime 1764259517.572889464
[ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14
[ 3845.737855] index 3 name_len 4
[ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34
[ 3845.737857] location key (257 1 0) type 2
[ 3845.737858] transid 10 data_len 0 name_len 4
[ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34
[ 3845.737861] location key (257 1 0) type 2
[ 3845.737862] transid 10 data_len 0 name_len 4
[ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160
[ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0
[ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0
[ 3845.737867] rdev 0 sequence 2 flags 0x0
[ 3845.737868] atime 1764259517.580874966
[ 3845.737869] ctime 1764259517.586121869
[ 3845.737870] mtime 1764259517.580874966
[ 3845.737872] otime 1764259517.580874966
[ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13
[ 3845.737874] index 2 name_len 3
[ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
[ 3845.739448] ------------[ cut here ]------------
[ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs]
[ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...)
[ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full)
[ 3845.752414] Tainted: [W]=WARN
[ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs]
[ 3845.755460] Code: 31 f6 48 89 (...)
[ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246
[ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000
[ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000
[ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468
[ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680
[ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8
[ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000
[ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0
[ 3845.768009] Call Trace:
[ 3845.768392] <TASK>
[ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs]
[ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs]
[ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs]
[ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs]
[ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs]
[ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs]
[ 3845.773198] ? xas_load+0x9/0xc0
[ 3845.773589] ? xas_find+0x14d/0x1a0
[ 3845.773969] do_writepages+0xc6/0x160
[ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60
[ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80
[ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs]
[ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
[ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40
[ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs]
[ 3845.778551] ? _raw_spin_unlock+0x15/0x30
[ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs]
[ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs]
[ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
[ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs]
[ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs]
[ 3845.782764] ? fscontext_read+0x15c/0x180
[ 3845.783202] ? rw_verify_area+0x50/0x180
[ 3845.783667] vfs_get_tree+0x25/0xd0
[ 3845.784047] vfs_cmd_create+0x59/0xe0
[ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0
[ 3845.784914] do_syscall_64+0x50/0x1220
[ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 3845.785980] RIP: 0033:0x7f751fc7f4aa
[ 3845.786759] Code: 73 01 c3 48 (...)
[ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
[ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa
[ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
[ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000
[ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23
[ 3845.798633] </TASK>
[ 3845.799067] ---[ end trace 0000000000000000 ]---
[ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction)
[ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction.
[ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5)
[ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure
[ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree)
[ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5
Fix this by never logging a conflicting inode that is a directory and was
moved in the current transaction (its last_unlink_trans equals the current
transaction) and instead fallback to a transaction commit.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slva.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/7bbc9419-5c56-450a-b5a0-efeae7457113@gmail.com/
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ebae102897e760e9e6bc625f701dd666b2163bd1 upstream.
Clang is not happy about set but (in some cases) unused variable:
fs/nfsd/export.c:1027:17: error: variable 'inode' set but not used [-Werror,-Wunused-but-set-variable]
since it's used as a parameter to dprintk() which might be configured
a no-op. To avoid uglifying code with the specific ifdeffery just mark
the variable __maybe_unused.
The commit [1], which introduced this behaviour, is quite old and hence
the Fixes tag points to the first of the Git era.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=0431923fb7a1 [1]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ad3cbbb0c1892c48919727fcb8dec5965da8bacb upstream.
>From RFC 8881:
5.8.1.14. Attribute 75: suppattr_exclcreat
> The bit vector that would set all REQUIRED and RECOMMENDED
> attributes that are supported by the EXCLUSIVE4_1 method of file
> creation via the OPEN operation. The scope of this attribute
> applies to all objects with a matching fsid.
There's nothing in RFC 8881 that states that suppattr_exclcreat is
or is not allowed to contain bits for attributes that are clear in
the reported supported_attrs bitmask. But it doesn't make sense for
an NFS server to indicate that it /doesn't/ implement an attribute,
but then also indicate that clients /are/ allowed to set that
attribute using OPEN(create) with EXCLUSIVE4_1.
The FATTR4_WORD2_TIME_DELEG attributes are also not to be allowed
for OPEN(create) with EXCLUSIVE4_1. It doesn't make sense to set
a delegated timestamp on a new file.
Fixes: 7e13f4f8d27d ("nfsd: handle delegated timestamps in SETATTR")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 913f7cf77bf14c13cfea70e89bcb6d0b22239562 upstream.
An NFSv4 client that sets an ACL with a named principal during file
creation retrieves the ACL afterwards, and finds that it is only a
default ACL (based on the mode bits) and not the ACL that was
requested during file creation. This violates RFC 8881 section
6.4.1.3: "the ACL attribute is set as given".
The issue occurs in nfsd_create_setattr(), which calls
nfsd_attrs_valid() to determine whether to call nfsd_setattr().
However, nfsd_attrs_valid() checks only for iattr changes and
security labels, but not POSIX ACLs. When only an ACL is present,
the function returns false, nfsd_setattr() is skipped, and the
POSIX ACL is never applied to the inode.
Subsequently, when the client retrieves the ACL, the server finds
no POSIX ACL on the inode and returns one generated from the file's
mode bits rather than returning the originally-specified ACL.
Reported-by: Aurélien Couderc <aurelien.couderc2002@gmail.com>
Fixes: c0cbe70742f4 ("NFSD: add posix ACLs to struct nfsd_attrs")
Cc: Roland Mainz <roland.mainz@nrubsig.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 27d17641cacfedd816789b75d342430f6b912bd2 upstream.
>From RFC 8881:
5.8.1.14. Attribute 75: suppattr_exclcreat
> The bit vector that would set all REQUIRED and RECOMMENDED
> attributes that are supported by the EXCLUSIVE4_1 method of file
> creation via the OPEN operation. The scope of this attribute
> applies to all objects with a matching fsid.
There's nothing in RFC 8881 that states that suppattr_exclcreat is
or is not allowed to contain bits for attributes that are clear in
the reported supported_attrs bitmask. But it doesn't make sense for
an NFS server to indicate that it /doesn't/ implement an attribute,
but then also indicate that clients /are/ allowed to set that
attribute using OPEN(create) with EXCLUSIVE4_1.
Ensure that the SECURITY_LABEL and ACL bits are not set in the
suppattr_exclcreat bitmask when they are also not set in the
supported_attrs bitmask.
Fixes: 8c18f2052e75 ("nfsd41: SUPPATTR_EXCLCREAT attribute")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 635bc4def026a24e071436f4f356ea08c0eed6ff upstream.
inotify/fanotify do not allow users with no read access to a file to
subscribe to events (e.g. IN_ACCESS/IN_MODIFY), but they do allow the
same user to subscribe for watching events on children when the user
has access to the parent directory (e.g. /dev).
Users with no read access to a file but with read access to its parent
directory can still stat the file and see if it was accessed/modified
via atime/mtime change.
The same is not true for special files (e.g. /dev/null). Users will not
generally observe atime/mtime changes when other users read/write to
special files, only when someone sets atime/mtime via utimensat().
Align fsnotify events with this stat behavior and do not generate
ACCESS/MODIFY events to parent watchers on read/write of special files.
The events are still generated to parent watchers on utimensat(). This
closes some side-channels that could be possibly used for information
exfiltration [1].
[1] https://snee.la/pdf/pubs/file-notification-attacks.pdf
Reported-by: Sudheendra Raghav Neela <sneela@tugraz.at>
CC: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 982d2616a2906113e433fdc0cfcc122f8d1bb60a upstream.
Garbage collection assumes all zones contain the full amount of blocks.
Mkfs already ensures this happens, but make the kernel check it as well
to avoid getting into trouble due to fuzzers or mkfs bugs.
Fixes: 2167eaabe2fa ("xfs: define the zoned on-disk format")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: stable@vger.kernel.org # v6.15
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5990fd756943836978ad184aac980e2b36ab7e01 upstream.
The xchk_setup_xattr_buf function can allocate a new value buffer, which
means that any reference to ab->value before the call could become a
dangling pointer. Fix this by moving an assignment to after the buffer
setup.
Cc: stable@vger.kernel.org # v6.10
Fixes: e47dcf113ae348 ("xfs: repair extended attributes")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dc68c0f601691010dd5ae53442f8523f41a53131 upstream.
The grofs code for zoned RT subvolums already tries to check for zone
alignment, but gets it wrong by using the old instead of the new mount
structure.
Fixes: 01b71e64bb87 ("xfs: support growfs on zoned file systems")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: stable@vger.kernel.org # v6.15
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f06725052098d7b1133ac3846d693c383dc427a2 upstream.
gcc 14.2 warns about:
xfs_attr_item.c: In function ‘xfs_attr_recover_work’:
xfs_attr_item.c:785:9: warning: ‘ip’ may be used uninitialized [-Wmaybe-uninitialized]
785 | xfs_trans_ijoin(tp, ip, 0);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
xfs_attr_item.c:740:42: note: ‘ip’ was declared here
740 | struct xfs_inode *ip;
| ^~
I think this is bogus since xfs_attri_recover_work either returns a real
pointer having initialized ip or an ERR_PTR having not touched it, but
the tools are smarter than me so let's just null-init the variable
anyway.
Cc: stable@vger.kernel.org # v6.8
Fixes: e70fb328d52772 ("xfs: recreate work items when recovering intent items")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fc40459de82543b565ebc839dca8f7987f16f62e upstream.
xfs_buf_item_get_format() may allocate memory for bip->bli_formats,
free the memory in the error path.
Fixes: c3d5f0c2fb85 ("xfs: complain if anyone tries to create a too-large buffer log item")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <lihaoxiang@isrc.iscas.ac.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 039bef30e320827bac8990c9f29d2a68cd8adb5f upstream.
syzbot reported a kernel BUG in ocfs2_find_victim_chain() because the
`cl_next_free_rec` field of the allocation chain list (next free slot in
the chain list) is 0, triggring the BUG_ON(!cl->cl_next_free_rec)
condition in ocfs2_find_victim_chain() and panicking the kernel.
To fix this, an if condition is introduced in ocfs2_claim_suballoc_bits(),
just before calling ocfs2_find_victim_chain(), the code block in it being
executed when either of the following conditions is true:
1. `cl_next_free_rec` is equal to 0, indicating that there are no free
chains in the allocation chain list
2. `cl_next_free_rec` is greater than `cl_count` (the total number of
chains in the allocation chain list)
Either of them being true is indicative of the fact that there are no
chains left for usage.
This is addressed using ocfs2_error(), which prints
the error log for debugging purposes, rather than panicking the kernel.
Link: https://lkml.kernel.org/r/20251201130711.143900-1-activprithvi@gmail.com
Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com>
Reported-by: syzbot+96d38c6e1655c1420a72@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=96d38c6e1655c1420a72
Tested-by: syzbot+96d38c6e1655c1420a72@syzkaller.appspotmail.com
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 01fba45deaddcce0d0b01c411435d1acf6feab7b upstream.
With below scripts, it will trigger panic in f2fs:
mkfs.f2fs -f /dev/vdd
mount /dev/vdd /mnt/f2fs
touch /mnt/f2fs/foo
sync
echo 111 >> /mnt/f2fs/foo
f2fs_io fsync /mnt/f2fs/foo
f2fs_io shutdown 2 /mnt/f2fs
umount /mnt/f2fs
mount -o ro,norecovery /dev/vdd /mnt/f2fs
or
mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs
F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0
F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f
F2FS-fs (vdd): Stopped filesystem due to reason: 0
F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1
Filesystem f2fs get_tree() didn't set fc->root, returned 1
------------[ cut here ]------------
kernel BUG at fs/super.c:1761!
Oops: invalid opcode: 0000 [#1] SMP PTI
CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ #721 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:vfs_get_tree.cold+0x18/0x1a
Call Trace:
<TASK>
fc_mount+0x13/0xa0
path_mount+0x34e/0xc50
__x64_sys_mount+0x121/0x150
do_syscall_64+0x84/0x800
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa6cc126cfe
The root cause is we missed to handle error number returned from
f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or
ro,disable_roll_forward mount option, result in returning a positive
error number to vfs_get_tree(), fix it.
Cc: stable@kernel.org
Fixes: 6781eabba1bd ("f2fs: give -EINVAL for norecovery and rw mount")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 37345eae9deaa2e4f372eeb98f6594cd0ee0916e upstream.
w/ LFS mode, in get_left_section_blocks(), we should not account the
blocks which were used before and now are invalided, otherwise those
blocks will be counted as freed one in has_curseg_enough_space(), result
in missing to trigger GC in time.
Cc: stable@kernel.org
Fixes: 249ad438e1d9 ("f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode.")
Fixes: bf34c93d2645 ("f2fs: check curseg space before foreground GC")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|