| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit fda9522ed6afaec45cabc198d8492270c394c7bc upstream.
When a compound request such as READ + QUERY_INFO(Security) is received,
and the first command (READ) consumes most of the response buffer,
ksmbd could write beyond the allocated buffer while building a security
descriptor.
The root cause was that smb2_get_info_sec() checked buffer space using
ppntsd_size from xattr, while build_sec_desc() often synthesized a
significantly larger descriptor from POSIX ACLs.
This patch introduces smb_acl_sec_desc_scratch_len() to accurately
compute the final descriptor size beforehand, performs proper buffer
checking with smb2_calc_max_out_buf_len(), and uses exact-sized
allocation + iov pinning.
Cc: stable@vger.kernel.org
Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound")
Signed-off-by: Asim Viladi Oglu Manizada <manizada@pm.me>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 23b5df09c27aec13962b30d32a4167ebdd043f8e ]
When updating ->i_size, make sure to always update ->i_blocks as well
until we query new allocation size from the server.
generic/694 was failing because smb3_simple_falloc() was missing the
update of ->i_blocks after calling cifs_setsize(). So, fix this by
updating ->i_blocks directly in cifs_setsize(), so all places that
call it doesn't need to worry about updating ->i_blocks later.
Reported-by: Shyam Prasad N <sprasad@microsoft.com>
Closes: https://lore.kernel.org/r/CANT5p=rqgRwaADB=b_PhJkqXjtfq3SFv41SSTXSVEHnuh871pA@mail.gmail.com
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b17b79ff896305fd74980a5f72afec370ee88ca4 ]
[BUG]
When recovering relocation at mount time, merge_reloc_root() and
btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against
an impossible state: a non-zero drop_progress combined with a zero
drop_level in a root_item, which can be triggered:
------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1545!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545
Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000
Call Trace:
merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861
btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195
btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130
open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640
btrfs_fill_super fs/btrfs/super.c:987 [inline]
btrfs_get_tree_super fs/btrfs/super.c:1951 [inline]
btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline]
btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128
vfs_get_tree+0x9a/0x370 fs/super.c:1758
fc_mount fs/namespace.c:1199 [inline]
do_new_mount_fc fs/namespace.c:3642 [inline]
do_new_mount fs/namespace.c:3718 [inline]
path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
do_mount fs/namespace.c:4041 [inline]
__do_sys_mount fs/namespace.c:4229 [inline]
__se_sys_mount fs/namespace.c:4206 [inline]
__x64_sys_mount+0x282/0x320 fs/namespace.c:4206
...
RIP: 0033:0x7f969c9a8fde
Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f
---[ end trace 0000000000000000 ]---
The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic
metadata fuzzing tool that corrupts btrfs metadata at runtime.
[CAUSE]
A non-zero drop_progress.objectid means an interrupted
btrfs_drop_snapshot() left a resume point on disk, and in that case
drop_level must be greater than 0 because the checkpoint is only
saved at internal node levels.
Although this invariant is enforced when the kernel writes the root
item, it is not validated when the root item is read back from disk.
That allows on-disk corruption to provide an invalid state with
drop_progress.objectid != 0 and drop_level == 0.
When relocation recovery later processes such a root item,
merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The
same invalid metadata can also trigger the corresponding BUG_ON() in
btrfs_drop_snapshot().
[FIX]
Fix this by validating the root_item invariant in tree-checker when
reading root items from disk: if drop_progress.objectid is non-zero,
drop_level must also be non-zero. Reject such malformed metadata with
-EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot()
and triggers the BUG_ON.
After the fix, the same corruption is correctly rejected by tree-checker
and the BUG_ON is no longer triggered.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f9a4e3015db1aeafbef407650eb8555445ca943e ]
Currently our qgroup ioctls don't reserve any space, they just do a
transaction join, which does not reserve any space, neither for the quota
tree updates nor for the delayed refs generated when updating the quota
tree. The quota root uses the global block reserve, which is fine most of
the time since we don't expect a lot of updates to the quota root, or to
be too close to -ENOSPC such that other critical metadata updates need to
resort to the global reserve.
However this is not optimal, as not reserving proper space may result in a
transaction abort due to not reserving space for delayed refs and then
abusing the use of the global block reserve.
For example, the following reproducer (which is unlikely to model any
real world use case, but just to illustrate the problem), triggers such a
transaction abort due to -ENOSPC when running delayed refs:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
umount $DEV &> /dev/null
# Limit device to 1G so that it's much faster to reproduce the issue.
mkfs.btrfs -f -b 1G $DEV
mount -o commit=600 $DEV $MNT
fallocate -l 800M $MNT/filler
btrfs quota enable $MNT
for ((i = 1; i <= 400000; i++)); do
btrfs qgroup create 1/$i $MNT
done
umount $MNT
When running this, we can see in dmesg/syslog that a transaction abort
happened:
[436.490] BTRFS error (device nullb0): failed to run delayed ref for logical 30408704 num_bytes 16384 type 176 action 1 ref_mod 1: -28
[436.493] ------------[ cut here ]------------
[436.494] BTRFS: Transaction aborted (error -28)
[436.495] WARNING: fs/btrfs/extent-tree.c:2247 at btrfs_run_delayed_refs+0xd9/0x110 [btrfs], CPU#4: umount/2495372
[436.497] Modules linked in: btrfs loop (...)
[436.508] CPU: 4 UID: 0 PID: 2495372 Comm: umount Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[436.510] Tainted: [W]=WARN
[436.511] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[436.513] RIP: 0010:btrfs_run_delayed_refs+0xdf/0x110 [btrfs]
[436.514] Code: 0f 82 ea (...)
[436.518] RSP: 0018:ffffd511850b7d78 EFLAGS: 00010292
[436.519] RAX: 00000000ffffffe4 RBX: ffff8f120dad37e0 RCX: 0000000002040001
[436.520] RDX: 0000000000000002 RSI: 00000000ffffffe4 RDI: ffffffffc090fd80
[436.522] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffffc04d1867
[436.523] R10: ffff8f18dc1fffa8 R11: 0000000000000003 R12: ffff8f173aa89400
[436.524] R13: 0000000000000000 R14: ffff8f173aa89400 R15: 0000000000000000
[436.526] FS: 00007fe59045d840(0000) GS:ffff8f192e22e000(0000) knlGS:0000000000000000
[436.527] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436.528] CR2: 00007fe5905ff2b0 CR3: 000000060710a002 CR4: 0000000000370ef0
[436.530] Call Trace:
[436.530] <TASK>
[436.530] btrfs_commit_transaction+0x73/0xc00 [btrfs]
[436.531] ? btrfs_attach_transaction_barrier+0x1e/0x70 [btrfs]
[436.532] sync_filesystem+0x7a/0x90
[436.533] generic_shutdown_super+0x28/0x180
[436.533] kill_anon_super+0x12/0x40
[436.534] btrfs_kill_super+0x12/0x20 [btrfs]
[436.534] deactivate_locked_super+0x2f/0xb0
[436.534] cleanup_mnt+0xea/0x180
[436.535] task_work_run+0x58/0xa0
[436.535] exit_to_user_mode_loop+0xed/0x480
[436.536] ? __x64_sys_umount+0x68/0x80
[436.536] do_syscall_64+0x2a5/0xf20
[436.537] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[436.537] RIP: 0033:0x7fe5906b6217
[436.538] Code: 0d 00 f7 (...)
[436.540] RSP: 002b:00007ffcd87a61f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[436.541] RAX: 0000000000000000 RBX: 00005618b9ecadc8 RCX: 00007fe5906b6217
[436.541] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005618b9ecb100
[436.542] RBP: 0000000000000000 R08: 00007ffcd87a4fe0 R09: 00000000ffffffff
[436.544] R10: 0000000000000103 R11: 0000000000000246 R12: 00007fe59081626c
[436.544] R13: 00005618b9ecb100 R14: 0000000000000000 R15: 00005618b9ecacc0
[436.545] </TASK>
[436.545] ---[ end trace 0000000000000000 ]---
Fix this by changing the qgroup ioctls to use start transaction instead of
joining so that proper space is reserved for the delayed refs generated
for the updates to the quota root. This way we don't get any transaction
abort.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 77603ab10429fe713a03345553ca8dbbfb1d91c6 ]
Shin'ichiro reported sporadic hangs when running generic/013 in our CI
system. When enabling lockdep, there is a lockdep splat when calling
btrfs_get_dev_zone_info_all_devices() in the mount path that can be
triggered by i.e. generic/013:
======================================================
WARNING: possible circular locking dependency detected
7.0.0-rc1+ #355 Not tainted
------------------------------------------------------
mount/1043 is trying to acquire lock:
ffff8881020b5470 (&vblk->vdev_mutex){+.+.}-{4:4}, at: virtblk_report_zones+0xda/0x430
but task is already holding lock:
ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
__mutex_lock+0xa3/0x1360
btrfs_create_pending_block_groups+0x1f4/0x9d0
__btrfs_end_transaction+0x3e/0x2e0
btrfs_zoned_reserve_data_reloc_bg+0x2f8/0x390
open_ctree+0x1934/0x23db
btrfs_get_tree.cold+0x105/0x26c
vfs_get_tree+0x28/0xb0
__do_sys_fsconfig+0x324/0x680
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #3 (btrfs_trans_num_extwriters){++++}-{0:0}:
join_transaction+0xc2/0x5c0
start_transaction+0x17c/0xbc0
btrfs_zoned_reserve_data_reloc_bg+0x2b4/0x390
open_ctree+0x1934/0x23db
btrfs_get_tree.cold+0x105/0x26c
vfs_get_tree+0x28/0xb0
__do_sys_fsconfig+0x324/0x680
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #2 (btrfs_trans_num_writers){++++}-{0:0}:
lock_release+0x163/0x4b0
__btrfs_end_transaction+0x1c7/0x2e0
btrfs_dirty_inode+0x6f/0xd0
touch_atime+0xe5/0x2c0
btrfs_file_mmap_prepare+0x65/0x90
__mmap_region+0x4b9/0xf00
mmap_region+0xf7/0x120
do_mmap+0x43d/0x610
vm_mmap_pgoff+0xd6/0x190
ksys_mmap_pgoff+0x7e/0xc0
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (&mm->mmap_lock){++++}-{4:4}:
__might_fault+0x68/0xa0
_copy_to_user+0x22/0x70
blkdev_copy_zone_to_user+0x22/0x40
virtblk_report_zones+0x282/0x430
blkdev_report_zones_ioctl+0xfd/0x130
blkdev_ioctl+0x20f/0x2c0
__x64_sys_ioctl+0x86/0xd0
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&vblk->vdev_mutex){+.+.}-{4:4}:
__lock_acquire+0x1522/0x2680
lock_acquire+0xd5/0x2f0
__mutex_lock+0xa3/0x1360
virtblk_report_zones+0xda/0x430
blkdev_report_zones_cached+0x162/0x190
btrfs_get_dev_zones+0xdc/0x2e0
btrfs_get_dev_zone_info+0x219/0xe80
btrfs_get_dev_zone_info_all_devices+0x62/0x90
open_ctree+0x1200/0x23db
btrfs_get_tree.cold+0x105/0x26c
vfs_get_tree+0x28/0xb0
__do_sys_fsconfig+0x324/0x680
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
&vblk->vdev_mutex --> btrfs_trans_num_extwriters --> &fs_devs->device_list_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_devs->device_list_mutex);
lock(btrfs_trans_num_extwriters);
lock(&fs_devs->device_list_mutex);
lock(&vblk->vdev_mutex);
*** DEADLOCK ***
3 locks held by mount/1043:
#0: ffff88811063e878 (&fc->uapi_mutex){+.+.}-{4:4}, at: __do_sys_fsconfig+0x2ae/0x680
#1: ffff88810cb9f0e8 (&type->s_umount_key#31/1){+.+.}-{4:4}, at: alloc_super+0xc0/0x3e0
#2: ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90
stack backtrace:
CPU: 2 UID: 0 PID: 1043 Comm: mount Not tainted 7.0.0-rc1+ #355 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/10/2025
Call Trace:
<TASK>
dump_stack_lvl+0x5b/0x80
print_circular_bug.cold+0x18d/0x1d8
check_noncircular+0x10d/0x130
__lock_acquire+0x1522/0x2680
? vmap_small_pages_range_noflush+0x3ef/0x820
lock_acquire+0xd5/0x2f0
? virtblk_report_zones+0xda/0x430
? lock_is_held_type+0xcd/0x130
__mutex_lock+0xa3/0x1360
? virtblk_report_zones+0xda/0x430
? virtblk_report_zones+0xda/0x430
? __pfx_copy_zone_info_cb+0x10/0x10
? virtblk_report_zones+0xda/0x430
virtblk_report_zones+0xda/0x430
? __pfx_copy_zone_info_cb+0x10/0x10
blkdev_report_zones_cached+0x162/0x190
? __pfx_copy_zone_info_cb+0x10/0x10
btrfs_get_dev_zones+0xdc/0x2e0
btrfs_get_dev_zone_info+0x219/0xe80
btrfs_get_dev_zone_info_all_devices+0x62/0x90
open_ctree+0x1200/0x23db
btrfs_get_tree.cold+0x105/0x26c
? rcu_is_watching+0x18/0x50
vfs_get_tree+0x28/0xb0
__do_sys_fsconfig+0x324/0x680
do_syscall_64+0x92/0x4f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f615e27a40e
RSP: 002b:00007fff11b18fb8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
RAX: ffffffffffffffda RBX: 000055572e92ab10 RCX: 00007f615e27a40e
RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
RBP: 00007fff11b19100 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000055572e92bc40 R14: 00007f615e3faa60 R15: 000055572e92bd08
</TASK>
Don't hold the device_list_mutex while calling into
btrfs_get_dev_zone_info() in btrfs_get_dev_zone_info_all_devices() to
mitigate the issue. This is safe, as no other thread can touch the device
list at the moment of execution.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b854e1c4eff3473b6d3a9ae74129ac5c48bc0b61 ]
When inactivating an inode with node-format extended attributes,
xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
xfs_trans_binval(), but intentionally does not remove the corresponding
entries from their parent node blocks. The implicit assumption is that
xfs_attr_inactive() will truncate the entire attr fork to zero extents
afterwards, so log recovery will never reach the root node and follow
those stale pointers.
However, if a log shutdown occurs after the leaf/node block cancellations
commit but before the attr bmap truncation commits, this assumption
breaks. Recovery replays the attr bmap intact (the inode still has
attr fork extents), but suppresses replay of all cancelled leaf/node
blocks, maybe leaving them as stale data on disk. On the next mount,
xlog_recover_process_iunlinks() retries inactivation and attempts to
read the root node via the attr bmap. If the root node was not replayed,
reading the unreplayed root block triggers a metadata verification
failure immediately; if it was replayed, following its child pointers
to unreplayed child blocks triggers the same failure:
XFS (pmem0): Metadata corruption detected at
xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
XFS (pmem0): Unmount and run xfs_repair
XFS (pmem0): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
Fix this in two places:
In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
child block, immediately remove the entry that references it from the
parent node in the same transaction. This eliminates the window where
the parent holds a pointer to a cancelled block. Once all children are
removed, the now-empty root node is converted to a leaf block within the
same transaction. This node-to-leaf conversion is necessary for crash
safety. If the system shutdown after the empty node is written to the
log but before the second-phase bmap truncation commits, log recovery
will attempt to verify the root block on disk. xfs_da3_node_verify()
does not permit a node block with count == 0; such a block will fail
verification and trigger a metadata corruption shutdown. on the other
hand, leaf blocks are allowed to have this transient state.
In xfs_attr_inactive(), split the attr fork truncation into two explicit
phases. First, truncate all extents beyond the root block (the child
extents whose parent references have already been removed above).
Second, invalidate the root block and truncate the attr bmap to zero in
a single transaction. The two operations in the second phase must be
atomic: as long as the attr bmap has any non-zero length, recovery can
follow it to the root block, so the root block invalidation must commit
together with the bmap-to-zero truncation.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e65bb55d7f8c2041c8fdb73cd29b0b4cad4ed847 ]
Factor out wrapper xfs_attr3_leaf_init function, which exported for
external use.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Stable-dep-of: b854e1c4eff3 ("xfs: close crash window in attr dabtree inactivation")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ce4e789cf3561c9fac73cc24445bfed9ea0c514b ]
Factor out wrapper xfs_attr3_node_entry_remove function, which
exported for external use.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Stable-dep-of: b854e1c4eff3 ("xfs: close crash window in attr dabtree inactivation")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e942498385bf80f4d6d075b47174035545eb6a2e upstream.
The assertion functions properly because we currently only truncate the
attr to a zero size. Any other new size of the attr is not preempted.
Make this assertion is specific to the datafork, preparing for
subsequent patches to truncate the attribute to a non-zero size.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 519b76ac0b31d86b45784735d4ef964e8efdc56b ]
Now, only EXT4_KUNIT_TESTS=Y testcase will be compiled in 'mballoc.c'.
To solve this issue, the ext4 test code needs to be decoupled. The ext4
test module is compiled into a separate module.
Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Closes: https://patchwork.kernel.org/project/cifs-client/patch/20260118091313.1988168-2-chenxiaosong.chenxiaosong@linux.dev/
Fixes: 7c9fa399a369 ("ext4: add first unit test for ext4_mb_new_blocks_simple in mballoc")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260314075258.1317579-3-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 49504a512587147dd6da3b4b08832ccc157b97dc ]
Introduce EXPORT_SYMBOL_FOR_EXT4_TEST() helper for kuint test.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260314075258.1317579-2-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: 519b76ac0b31 ("ext4: fix mballoc-test.c is not compiled when EXT4_KUNIT_TESTS=M")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0e764b9d46071668969410ec5429be0e2f38c6d3 ]
The netfs_io_stream::front member is meant to point to the subrequest
currently being collected on a stream, but it isn't actually used this way
by direct write (which mostly ignores it). However, there's a tracepoint
which looks at it. Further, stream->front is actually redundant with
stream->subrequests.next.
Fix the potential problem in the direct code by just removing the member
and using stream->subrequests.next instead, thereby also simplifying the
code.
Fixes: a0b4c7a49137 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence")
Reported-by: Paulo Alcantara <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/4158599.1774426817@warthog.procyon.org.uk
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1c37d896b12dfd0d4c96e310b0033c6676933917 ]
Whenever we get an error updating the device stats item for a device in
btrfs_run_dev_stats() we allow the loop to go to the next device, and if
updating the stats item for the next device succeeds, we end up losing
the error we had from the previous device.
Fix this by breaking out of the loop once we get an error and make sure
it's returned to the caller. Since we are in the transaction commit path
(and in the critical section actually), returning the error will result
in a transaction abort.
Fixes: 733f4fbbc108 ("Btrfs: read device stats on mount, write modified ones during commit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a4376d9a5d4c9610e69def3fc0b32c86a7ab7a41 ]
When create_space_info_sub_group() allocates elements of
space_info->sub_group[], kobject_init_and_add() is called for each
element via btrfs_sysfs_add_space_info_type(). However, when
check_removing_space_info() frees these elements, it does not call
btrfs_sysfs_remove_space_info() on them. As a result, kobject_put() is
not called and the associated kobj->name objects are leaked.
This memory leak is reproduced by running the blktests test case
zbd/009 on kernels built with CONFIG_DEBUG_KMEMLEAK. The kmemleak
feature reports the following error:
unreferenced object 0xffff888112877d40 (size 16):
comm "mount", pid 1244, jiffies 4294996972
hex dump (first 16 bytes):
64 61 74 61 2d 72 65 6c 6f 63 00 c4 c6 a7 cb 7f data-reloc......
backtrace (crc 53ffde4d):
__kmalloc_node_track_caller_noprof+0x619/0x870
kstrdup+0x42/0xc0
kobject_set_name_vargs+0x44/0x110
kobject_init_and_add+0xcf/0x150
btrfs_sysfs_add_space_info_type+0xfc/0x210 [btrfs]
create_space_info_sub_group.constprop.0+0xfb/0x1b0 [btrfs]
create_space_info+0x211/0x320 [btrfs]
btrfs_init_space_info+0x15a/0x1b0 [btrfs]
open_ctree+0x33c7/0x4a50 [btrfs]
btrfs_get_tree.cold+0x9f/0x1ee [btrfs]
vfs_get_tree+0x87/0x2f0
vfs_cmd_create+0xbd/0x280
__do_sys_fsconfig+0x3df/0x990
do_syscall_64+0x136/0x1540
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid the leak, call btrfs_sysfs_remove_space_info() instead of
kfree() for the elements.
Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group")
Link: https://lore.kernel.org/linux-block/b9488881-f18d-4f47-91a5-3c9bf63955a5@wdc.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b52fe51f724385b3ed81e37e510a4a33107e8161 ]
Fix the superblock offset mismatch error message in
btrfs_validate_super(): we changed it so that it considers all the
superblocks, but the message still assumes we're only looking at the
first one.
The change from %u to %llu is because we're changing from a constant to
a u64.
Fixes: 069ec957c35e ("btrfs: Refactor btrfs_check_super_valid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7e57523490cd2efb52b1ea97f2e0a74c0fb634cd ]
Under certain circumstances, all the remaining subrequests from a read
request will get abandoned during retry. The abandonment process expects
the 'subreq' variable to be set to the place to start abandonment from, but
it doesn't always have a useful value (it will be uninitialised on the
first pass through the loop and it may point to a deleted subrequest on
later passes).
Fix the first jump to "abandon:" to set subreq to the start of the first
subrequest expected to need retry (which, in this abandonment case, turned
out unexpectedly to no longer have NEED_RETRY set).
Also clear the subreq pointer after discarding superfluous retryable
subrequests to cause an oops if we do try to access it.
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/3775287.1773848338@warthog.procyon.org.uk
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e9075e420a1eb3b52c60f3b95893a55e77419ce8 ]
When a write subrequest is marked NETFS_SREQ_NEED_RETRY, the retry path
in netfs_unbuffered_write() unconditionally calls stream->prepare_write()
without checking if it is NULL.
Filesystems such as 9P do not set the prepare_write operation, so
stream->prepare_write remains NULL. When get_user_pages() fails with
-EFAULT and the subrequest is flagged for retry, this results in a NULL
pointer dereference at fs/netfs/direct_write.c:189.
Fix this by mirroring the pattern already used in write_retry.c: if
stream->prepare_write is NULL, skip renegotiation and directly reissue
the subrequest via netfs_reissue_write(), which handles iterator reset,
IN_PROGRESS flag, stats update and reissue internally.
Fixes: a0b4c7a49137 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence")
Reported-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7227db0fbac9f348dba0
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Link: https://patch.msgid.link/20260307043947.347092-1-kartikey406@gmail.com
Tested-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 67e467a11f62ff64ad219dc6aa5459e132c79d14 ]
When a process crashes and the kernel writes a core dump to a 9P
filesystem, __kernel_write() creates an ITER_KVEC iterator. This
iterator reaches netfs_limit_iter() via netfs_unbuffered_write(), which
only handles ITER_FOLIOQ, ITER_BVEC and ITER_XARRAY iterator types,
hitting the BUG() for any other type.
Fix this by adding netfs_limit_kvec() following the same pattern as
netfs_limit_bvec(), since both kvec and bvec are simple segment arrays
with pointer and length fields. Dispatch it from netfs_limit_iter() when
the iterator type is ITER_KVEC.
Fixes: cae932d3aee5 ("netfs: Add func to calculate pagecount/size-limited span of an iterator")
Reported-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9c058f0d63475adc97fd
Tested-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Link: https://patch.msgid.link/20260307090041.359870-1-kartikey406@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 48623ec358c1c600fa1e38368746f933e0f1a617 ]
smb_grant_oplock() has two issues in the oplock publication sequence:
1) opinfo is linked into ci->m_op_list (via opinfo_add) before
add_lease_global_list() is called. If add_lease_global_list()
fails (kmalloc returns NULL), the error path frees the opinfo
via __free_opinfo() while it is still linked in ci->m_op_list.
Concurrent m_op_list readers (opinfo_get_list, or direct iteration
in smb_break_all_levII_oplock) dereference the freed node.
2) opinfo->o_fp is assigned after add_lease_global_list() publishes
the opinfo on the global lease list. A concurrent
find_same_lease_key() can walk the lease list and dereference
opinfo->o_fp->f_ci while o_fp is still NULL.
Fix by restructuring the publication sequence to eliminate post-publish
failure:
- Set opinfo->o_fp before any list publication (fixes NULL deref).
- Preallocate lease_table via alloc_lease_table() before opinfo_add()
so add_lease_global_list() becomes infallible after publication.
- Keep the original m_op_list publication order (opinfo_add before
lease list) so concurrent opens via same_client_has_lease() and
opinfo_get_list() still see the in-flight grant.
- Use opinfo_put() instead of __free_opinfo() on err_out so that
the RCU-deferred free path is used.
This also requires splitting add_lease_global_list() to take a
preallocated lease_table and changing its return type from int to void,
since it can no longer fail.
Fixes: 1dfd062caa16 ("ksmbd: fix use-after-free by using call_rcu() for oplock_info")
Cc: stable@vger.kernel.org
Signed-off-by: Werner Kasselman <werner@verivus.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
[ adapted kmalloc_obj() macro to kmalloc(sizeof()) ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9ee29d20aab228adfb02ca93f87fb53c56c2f3af upstream.
While reviewing recent ext4 patch[1], Sashiko raised the following
concern[2]:
> If the filesystem is initially mounted with the discard option,
> deleting files will populate sbi->s_discard_list and queue
> s_discard_work. If it is then remounted with nodiscard, the
> EXT4_MOUNT_DISCARD flag is cleared, but the pending s_discard_work is
> neither cancelled nor flushed.
[1] https://lore.kernel.org/r/20260319094545.19291-1-qiang.zhang@linux.dev/
[2] https://sashiko.dev/#/patchset/20260319094545.19291-1-qiang.zhang%40linux.dev
The concern was valid, but it had nothing to do with the patch[1].
One of the problems with Sashiko in its current (early) form is that
it will detect pre-existing issues and report it as a problem with the
patch that it is reviewing.
In practice, it would be hard to hit deliberately (unless you are a
malicious syzkaller fuzzer), since it would involve mounting the file
system with -o discard, and then deleting a large number of files,
remounting the file system with -o nodiscard, and then immediately
unmounting the file system before the queued discard work has a change
to drain on its own.
Fix it because it's a real bug, and to avoid Sashiko from raising this
concern when analyzing future patches to mballoc.c.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread contex")
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ec0a7500d8eace5b4f305fa0c594dd148f0e8d29 upstream.
During code review, Joseph found that ext4_fc_replay_inode() calls
ext4_get_fc_inode_loc() to get the inode location, which holds a
reference to iloc.bh that must be released via brelse().
However, several error paths jump to the 'out' label without
releasing iloc.bh:
- ext4_handle_dirty_metadata() failure
- sync_dirty_buffer() failure
- ext4_mark_inode_used() failure
- ext4_iget() failure
Fix this by introducing an 'out_brelse' label placed just before
the existing 'out' label to ensure iloc.bh is always released.
Additionally, make ext4_fc_replay_inode() propagate errors
properly instead of always returning 0.
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bb81702370fad22c06ca12b6e1648754dbc37e0f upstream.
Commit 4865c768b563 ("ext4: always allocate blocks only from groups
inode can use") restricts what blocks will be allocated for indirect
block based files to block numbers that fit within 32-bit block
numbers.
However, when using a review bot running on the latest Gemini LLM to
check this commit when backporting into an LTS based kernel, it raised
this concern:
If ac->ac_g_ex.fe_group is >= ngroups (for instance, if the goal
group was populated via stream allocation from s_mb_last_groups),
then start will be >= ngroups.
Does this allow allocating blocks beyond the 32-bit limit for
indirect block mapped files? The commit message mentions that
ext4_mb_scan_groups_linear() takes care to not select unsupported
groups. However, its loop uses group = *start, and the very first
iteration will call ext4_mb_scan_group() with this unsupported
group because next_linear_group() is only called at the end of the
iteration.
After reviewing the code paths involved and considering the LLM
review, I determined that this can happen when there is a file system
where some files/directories are extent-mapped and others are
indirect-block mapped. To address this, add a safety clamp in
ext4_mb_scan_groups().
Fixes: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use")
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20260326045834.1175822-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 496bb99b7e66f48b178126626f47e9ba79e2d0fa upstream.
Use the kvfree() in the RCU read critical section can trigger
the following warnings:
EXT4-fs (vdb): unmounting filesystem cd983e5b-3c83-4f5a-a136-17b00eb9d018.
WARNING: suspicious RCU usage
./include/linux/rcupdate.h:409 Illegal context switch in RCU read-side critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
Call Trace:
<TASK>
dump_stack_lvl+0xbb/0xd0
dump_stack+0x14/0x20
lockdep_rcu_suspicious+0x15a/0x1b0
__might_resched+0x375/0x4d0
? put_object.part.0+0x2c/0x50
__might_sleep+0x108/0x160
vfree+0x58/0x910
? ext4_group_desc_free+0x27/0x270
kvfree+0x23/0x40
ext4_group_desc_free+0x111/0x270
ext4_put_super+0x3c8/0xd40
generic_shutdown_super+0x14c/0x4a0
? __pfx_shrinker_free+0x10/0x10
kill_block_super+0x40/0x90
ext4_kill_sb+0x6d/0xb0
deactivate_locked_super+0xb4/0x180
deactivate_super+0x7e/0xa0
cleanup_mnt+0x296/0x3e0
__cleanup_mnt+0x16/0x20
task_work_run+0x157/0x250
? __pfx_task_work_run+0x10/0x10
? exit_to_user_mode_loop+0x6a/0x550
exit_to_user_mode_loop+0x102/0x550
do_syscall_64+0x44a/0x500
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
BUG: sleeping function called from invalid context at mm/vmalloc.c:3441
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 556, name: umount
preempt_count: 1, expected: 0
CPU: 3 UID: 0 PID: 556 Comm: umount
Call Trace:
<TASK>
dump_stack_lvl+0xbb/0xd0
dump_stack+0x14/0x20
__might_resched+0x275/0x4d0
? put_object.part.0+0x2c/0x50
__might_sleep+0x108/0x160
vfree+0x58/0x910
? ext4_group_desc_free+0x27/0x270
kvfree+0x23/0x40
ext4_group_desc_free+0x111/0x270
ext4_put_super+0x3c8/0xd40
generic_shutdown_super+0x14c/0x4a0
? __pfx_shrinker_free+0x10/0x10
kill_block_super+0x40/0x90
ext4_kill_sb+0x6d/0xb0
deactivate_locked_super+0xb4/0x180
deactivate_super+0x7e/0xa0
cleanup_mnt+0x296/0x3e0
__cleanup_mnt+0x16/0x20
task_work_run+0x157/0x250
? __pfx_task_work_run+0x10/0x10
? exit_to_user_mode_loop+0x6a/0x550
exit_to_user_mode_loop+0x102/0x550
do_syscall_64+0x44a/0x500
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The above scenarios occur in initialization failures and teardown
paths, there are no parallel operations on the resources released
by kvfree(), this commit therefore remove rcu_read_lock/unlock() and
use rcu_access_pointer() instead of rcu_dereference() operations.
Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access")
Fixes: df3da4ea5a0f ("ext4: fix potential race between s_group_info online resizing and access")
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Link: https://patch.msgid.link/20260319094545.19291-1-qiang.zhang@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d15e4b0a418537aafa56b2cb80d44add83e83697 upstream.
Commit b98535d09179 ("ext4: fix bug_on in start_this_handle during umount
filesystem") moved ext4_unregister_sysfs() before flushing s_sb_upd_work
to prevent new error work from being queued via /proc/fs/ext4/xx/mb_groups
reads during unmount. However, this introduced a use-after-free because
update_super_work calls ext4_notify_error_sysfs() -> sysfs_notify() which
accesses the kobject's kernfs_node after it has been freed by kobject_del()
in ext4_unregister_sysfs():
update_super_work ext4_put_super
----------------- --------------
ext4_unregister_sysfs(sb)
kobject_del(&sbi->s_kobj)
__kobject_del()
sysfs_remove_dir()
kobj->sd = NULL
sysfs_put(sd)
kernfs_put() // RCU free
ext4_notify_error_sysfs(sbi)
sysfs_notify(&sbi->s_kobj)
kn = kobj->sd // stale pointer
kernfs_get(kn) // UAF on freed kernfs_node
ext4_journal_destroy()
flush_work(&sbi->s_sb_upd_work)
Instead of reordering the teardown sequence, fix this by making
ext4_notify_error_sysfs() detect that sysfs has already been torn down
by checking s_kobj.state_in_sysfs, and skipping the sysfs_notify() call
in that case. A dedicated mutex (s_error_notify_mutex) serializes
ext4_notify_error_sysfs() against kobject_del() in ext4_unregister_sysfs()
to prevent TOCTOU races where the kobject could be deleted between the
state_in_sysfs check and the sysfs_notify() call.
Fixes: b98535d09179 ("ext4: fix bug_on in start_this_handle during umount filesystem")
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260319120336.157873-1-jiayuan.chen@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3822743dc20386d9897e999dbb990befa3a5b3f8 upstream.
bigalloc with s_first_data_block != 0 is not supported, reject mounting
it.
Signed-off-by: Helen Koike <koike@igalia.com>
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b73703b873a33d8eb8f6
Link: https://patch.msgid.link/20260317142325.135074-1-koike@igalia.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 46066e3a06647c5b186cc6334409722622d05c44 upstream.
There's issue as follows:
...
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2243 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2239 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): error count since last fsck: 1
EXT4-fs (mmcblk0p1): initial error at time 1765597433: ext4_mb_generate_buddy:760
EXT4-fs (mmcblk0p1): last error at time 1765597433: ext4_mb_generate_buddy:760
...
According to the log analysis, blocks are always requested from the
corrupted block group. This may happen as follows:
ext4_mb_find_by_goal
ext4_mb_load_buddy
ext4_mb_load_buddy_gfp
ext4_mb_init_cache
ext4_read_block_bitmap_nowait
ext4_wait_block_bitmap
ext4_validate_block_bitmap
if (!grp || EXT4_MB_GRP_BBITMAP_CORRUPT(grp))
return -EFSCORRUPTED; // There's no logs.
if (err)
return err; // Will return error
ext4_lock_group(ac->ac_sb, group);
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) // Unreachable
goto out;
After commit 9008a58e5dce ("ext4: make the bitmap read routines return
real error codes") merged, Commit 163a203ddb36 ("ext4: mark block group
as corrupt on block bitmap error") is no real solution for allocating
blocks from corrupted block groups. This is because if
'EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)' is true, then
'ext4_mb_load_buddy()' may return an error. This means that the block
allocation will fail.
Therefore, check block group if corrupted when ext4_mb_load_buddy()
returns error.
Fixes: 163a203ddb36 ("ext4: mark block group as corrupt on block bitmap error")
Fixes: 9008a58e5dce ("ext4: make the bitmap read routines return real error codes")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260302134619.3145520-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5422fe71d26d42af6c454ca9527faaad4e677d6c upstream.
On the mkdir/mknod path, when mapping logical blocks to physical blocks,
if inserting a new extent into the extent tree fails (in this example,
because the file system disabled the huge file feature when marking the
inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to
reclaim the physical block without deleting the corresponding data in
the extent tree. This causes subsequent mkdir operations to reference
the previously reclaimed physical block number again, even though this
physical block is already being used by the xattr block. Therefore, a
situation arises where both the directory and xattr are using the same
buffer head block in memory simultaneously.
The above causes ext4_xattr_block_set() to enter an infinite loop about
"inserted" and cannot release the inode lock, ultimately leading to the
143s blocking problem mentioned in [1].
If the metadata is corrupted, then trying to remove some extent space
can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE
was passed, remove space wrongly update quota information.
Jan Kara suggests distinguishing between two cases:
1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully
consistent and we must maintain its consistency including all the
accounting. However these errors can happen only early before we've
inserted the extent into the extent tree. So current code works correctly
for this case.
2) Some other error - this means metadata is corrupted. We should strive to
do as few modifications as possible to limit damage. So I'd just skip
freeing of allocated blocks.
[1]
INFO: task syz.0.17:5995 blocked for more than 143 seconds.
Call Trace:
inode_lock_nested include/linux/fs.h:1073 [inline]
__start_dirop fs/namei.c:2923 [inline]
start_dirop fs/namei.c:2934 [inline]
Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7
Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f
Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com
Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2acb5c12ebd860f30e4faf67e6cc8c44ddfe5fe8 upstream.
ext4_ext_correct_indexes() walks up the extent tree correcting
index entries when the first extent in a leaf is modified. Before
accessing path[k].p_idx->ei_block, there is no validation that
p_idx falls within the valid range of index entries for that
level.
If the on-disk extent header contains a corrupted or crafted
eh_entries value, p_idx can point past the end of the allocated
buffer, causing a slab-out-of-bounds read.
Fix this by validating path[k].p_idx against EXT_LAST_INDEX() at
both access sites: before the while loop and inside it. Return
-EFSCORRUPTED if the index pointer is out of range, consistent
with how other bounds violations are handled in the ext4 extent
tree code.
Reported-by: syzbot+04c4e65cab786a2e5b7e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=04c4e65cab786a2e5b7e
Signed-off-by: Tejas Bharambe <tejas.bharambe@outlook.com>
Link: https://patch.msgid.link/JH0PR06MB66326016F9B6AD24097D232B897CA@JH0PR06MB6632.apcprd06.prod.outlook.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 73bf12adbea10b13647864cd1c62410d19e21086 upstream.
The commit aa373cf55099 ("writeback: stop background/kupdate works from
livelocking other works") introduced an issue where unmounting a filesystem
in a multi-logical-partition scenario could lead to batch file data loss.
This problem was not fixed until the commit d92109891f21 ("fs/writeback:
bail out if there is no more inodes for IO and queued once"). It took
considerable time to identify the root cause. Additionally, in actual
production environments, we frequently encountered file data loss after
normal system reboots. Therefore, we are adding a check in the inode
release flow to verify whether all dirty pages have been flushed to disk,
in order to determine whether the data loss is caused by a logic issue in
the filesystem code.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260303012242.3206465-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1aec30021edd410b986c156f195f3d23959a9d11 upstream.
ext4_inode_attach_jinode() publishes ei->jinode to concurrent users.
It used to set ei->jinode before jbd2_journal_init_jbd_inode(),
allowing a reader to observe a non-NULL jinode with i_vfs_inode
still unset.
The fast commit flush path can then pass this jinode to
jbd2_wait_inode_data(), which dereferences i_vfs_inode->i_mapping and
may crash.
Below is the crash I observe:
```
BUG: unable to handle page fault for address: 000000010beb47f4
PGD 110e51067 P4D 110e51067 PUD 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 1 UID: 0 PID: 4850 Comm: fc_fsync_bench_ Not tainted 6.18.0-00764-g795a690c06a5 #1 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
RIP: 0010:xas_find_marked+0x3d/0x2e0
Code: e0 03 48 83 f8 02 0f 84 f0 01 00 00 48 8b 47 08 48 89 c3 48 39 c6 0f 82 fd 01 00 00 48 85 c9 74 3d 48 83 f9 03 77 63 4c 8b 0f <49> 8b 71 08 48 c7 47 18 00 00 00 00 48 89 f1 83 e1 03 48 83 f9 02
RSP: 0018:ffffbbee806e7bf0 EFLAGS: 00010246
RAX: 000000000010beb4 RBX: 000000000010beb4 RCX: 0000000000000003
RDX: 0000000000000001 RSI: 0000002000300000 RDI: ffffbbee806e7c10
RBP: 0000000000000001 R08: 0000002000300000 R09: 000000010beb47ec
R10: ffff9ea494590090 R11: 0000000000000000 R12: 0000002000300000
R13: ffffbbee806e7c90 R14: ffff9ea494513788 R15: ffffbbee806e7c88
FS: 00007fc2f9e3e6c0(0000) GS:ffff9ea6b1444000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000010beb47f4 CR3: 0000000119ac5000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
<TASK>
filemap_get_folios_tag+0x87/0x2a0
__filemap_fdatawait_range+0x5f/0xd0
? srso_alias_return_thunk+0x5/0xfbef5
? __schedule+0x3e7/0x10c0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? preempt_count_sub+0x5f/0x80
? srso_alias_return_thunk+0x5/0xfbef5
? cap_safe_nice+0x37/0x70
? srso_alias_return_thunk+0x5/0xfbef5
? preempt_count_sub+0x5f/0x80
? srso_alias_return_thunk+0x5/0xfbef5
filemap_fdatawait_range_keep_errors+0x12/0x40
ext4_fc_commit+0x697/0x8b0
? ext4_file_write_iter+0x64b/0x950
? srso_alias_return_thunk+0x5/0xfbef5
? preempt_count_sub+0x5f/0x80
? srso_alias_return_thunk+0x5/0xfbef5
? vfs_write+0x356/0x480
? srso_alias_return_thunk+0x5/0xfbef5
? preempt_count_sub+0x5f/0x80
ext4_sync_file+0xf7/0x370
do_fsync+0x3b/0x80
? syscall_trace_enter+0x108/0x1d0
__x64_sys_fdatasync+0x16/0x20
do_syscall_64+0x62/0x2c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
...
```
Fix this by initializing the jbd2_inode first.
Use smp_wmb() and WRITE_ONCE() to publish ei->jinode after
initialization. Readers use READ_ONCE() to fetch the pointer.
Fixes: a361293f5fede ("jbd2: Fix oops in jbd2_journal_file_inode()")
Cc: stable@vger.kernel.org
Signed-off-by: Li Chen <me@linux.beauty>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260225082617.147957-1-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 356227096eb66e41b23caf7045e6304877322edf upstream.
Replace BUG_ON() with proper error handling when inline data size
exceeds PAGE_SIZE. This prevents kernel panic and allows the system to
continue running while properly reporting the filesystem corruption.
The error is logged via ext4_error_inode(), the buffer head is released
to prevent memory leak, and -EFSCORRUPTED is returned to indicate
filesystem corruption.
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Link: https://patch.msgid.link/20260223123345.14838-2-ytohnuki@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bd060afa7cc3e0ad30afa9ecc544a78638498555 upstream.
recently_deleted() checks whether inode has been used in the near past.
However this can give false positive result when inode table is not
initialized yet and we are in fact comparing to random garbage (or stale
itable block of a filesystem before mkfs). Ultimately this results in
uninitialized inodes being skipped during inode allocation and possibly
they are never initialized and thus e2fsck complains. Verify if the
inode has been initialized before checking for dtime.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260216164848.3074-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1308255bbf8452762f89f44f7447ce137ecdbcff upstream.
When inode metadata is changed, we sometimes just call
ext4_mark_inode_dirty() to track modified metadata. This copies inode
metadata into block buffer which is enough when we are journalling
metadata. However when we are running in nojournal mode we currently
fail to write the dirtied inode buffer during fsync(2) because the inode
is not marked as dirty. Use explicit ext4_write_inode() call to make
sure the inode table buffer is written to the disk. This is a band aid
solution but proper solution requires a much larger rewrite including
changes in metadata bh tracking infrastructure.
Reported-by: Free Ekanayaka <free.ekanayaka@gmail.com>
Link: https://lore.kernel.org/all/87il8nhxdm.fsf@x1.mail-host-address-is-not-set/
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260216164848.3074-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 84e21e3fb8fd99ea460eb7274584750d11cf3e9f upstream.
Commit '5f920d5d6083 ("ext4: verify fast symlink length")' causes the
generic/475 test to fail during orphan cleanup of zero-length symlinks.
generic/475 84s ... _check_generic_filesystem: filesystem on /dev/vde is inconsistent
The fsck reports are provided below:
Deleted inode 9686 has zero dtime.
Deleted inode 158230 has zero dtime.
...
Inode bitmap differences: -9686 -158230
Orphan file (inode 12) block 13 is not clean.
Failed to initialize orphan file.
In ext4_symlink(), a newly created symlink can be added to the orphan
list due to ENOSPC. Its data has not been initialized, and its size is
zero. Therefore, we need to disregard the length check of the symbolic
link when cleaning up orphan inodes. Instead, we should ensure that the
nlink count is zero.
Fixes: 5f920d5d6083 ("ext4: verify fast symlink length")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260131091156.1733648-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f4a2b42e78914ff15630e71289adc589c3a8eb45 upstream.
There are cases where ext4_bio_write_page() gets called for a page which
has no buffers to submit. This happens e.g. when the part of the file is
actually a hole, when we cannot allocate blocks due to being called from
jbd2, or in data=journal mode when checkpointing writes the buffers
earlier. In these cases we just return from ext4_bio_write_page()
however if the page didn't need redirtying, we will leave stale DIRTY
and/or TOWRITE tags in xarray because those get cleared only in
__folio_start_writeback(). As a result we can leave these tags set in
mappings even after a final sync on filesystem that's getting remounted
read-only or that's being frozen. Various assertions can then get upset
when writeback is started on such filesystems (Gerald reported assertion
in ext4_journal_check_start() firing).
Fix the problem by cycling the page through writeback state even if we
decide nothing needs to be written for it so that xarray tags get
properly updated. This is slightly silly (we could update the xarray
tags directly) but I don't think a special helper messing with xarray
tags is really worth it in this relatively rare corner case.
Reported-by: Gerald Yang <gerald.yang@canonical.com>
Link: https://lore.kernel.org/all/20260128074515.2028982-1-gerald.yang@canonical.com
Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260205092223.21287-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ed9356a30e59c7cc3198e7fc46cfedf3767b9b17 upstream.
Add a check in ext4_setattr() to convert files from inline data storage
to extent-based storage when truncate() grows the file size beyond the
inline capacity. This prevents the filesystem from entering an
inconsistent state where the inline data flag is set but the file size
exceeds what can be stored inline.
Without this fix, the following sequence causes a kernel BUG_ON():
1. Mount filesystem with inode that has inline flag set and small size
2. truncate(file, 50MB) - grows size but inline flag remains set
3. sendfile() attempts to write data
4. ext4_write_inline_data() hits BUG_ON(write_size > inline_capacity)
The crash occurs because ext4_write_inline_data() expects inline storage
to accommodate the write, but the actual inline capacity (~60 bytes for
i_block + ~96 bytes for xattrs) is far smaller than the file size and
write request.
The fix checks if the new size from setattr exceeds the inode's actual
inline capacity (EXT4_I(inode)->i_inline_size) and converts the file to
extent-based storage before proceeding with the size change.
This addresses the root cause by ensuring the inline data flag and file
size remain consistent during truncate operations.
Reported-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7de5fe447862fc37576f
Tested-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Link: https://patch.msgid.link/20260207043607.1175976-1-kartikey406@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b1d682f1990c19fb1d5b97d13266210457092bcd upstream.
Fix an issue arising when ext4 features has_journal, ea_inode, and encrypt
are activated simultaneously, leading to ENOSPC when creating an encrypted
file.
Fix by passing XATTR_CREATE flag to xattr_set_handle function if a handle
is specified, i.e., when the function is called in the control flow of
creating a new inode. This aligns the number of jbd2 credits set_handle
checks for with the number allocated for creating a new inode.
ext4_set_context must not be called with a non-null handle (fs_data) if
fscrypt context xattr is not guaranteed to not exist yet. The only other
usage of this function currently is when handling the ioctl
FS_IOC_SET_ENCRYPTION_POLICY, which calls it with fs_data=NULL.
Fixes: c1a5d5f6ab21eb7e ("ext4: improve journal credit handling in set xattr paths")
Co-developed-by: Anthony Durrer <anthonydev@fastmail.com>
Signed-off-by: Anthony Durrer <anthonydev@fastmail.com>
Signed-off-by: Simon Weber <simon.weber.39@gmail.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260207100148.724275-4-simon.weber.39@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e31c53a8060e134111ed095783fee0aa0c43b080 upstream.
The xfile/xmbuf shmem file descriptions are no longer as detailed as
they were when online fsck was first merged, because moving to static
strings in commit 60382993a2e180 ("xfs: get rid of the
xchk_xfile_*_descr calls") removed a memory allocation and hence a
source of failure.
However this makes encoding the description in the tracepoints sort of a
waste of memory. David Laight also points out that file_path doesn't
zero the whole buffer which causes exposure of stale trace bytes, and
Steven Rostedt wonders why we're not using a dynamic array for the file
path.
I don't think this is worth fixing, so let's just rip it out.
Cc: rostedt@goodmis.org
Cc: david.laight.linux@gmail.com
Link: https://lore.kernel.org/linux-xfs/20260323172204.work.979-kees@kernel.org/
Cc: stable@vger.kernel.org # v6.11
Fixes: 19ebc8f84ea12e ("xfs: fix file_path handling in tracepoints")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 70685c291ef82269180758130394ecdc4496b52c upstream.
xlog_recovery_iget* never set @ip to a valid pointer if they return
an error, so this irele will walk off a dangling pointer. Fix that.
Cc: stable@vger.kernel.org # v6.10
Fixes: ae673f534a3097 ("xfs: record inode generation in xattr update log intent items")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d72f2084e30966097c8eae762e31986a33c3c0ae upstream.
The ri_total checks for SET/REPLACE operations are hardcoded to 3,
but xfs_attri_item_size() only emits a value iovec when value_len > 0,
so ri_total is 2 when value_len == 0.
For PPTR_SET/PPTR_REMOVE/PPTR_REPLACE, value_len is validated by
xfs_attri_validate() to be exactly sizeof(struct xfs_parent_rec) and
is never zero, so their hardcoded checks remain correct.
This problem may cause log recovery failures. The following script can be
used to reproduce the problem:
#!/bin/bash
mkfs.xfs -f /dev/sda
mount /dev/sda /mnt/test/
touch /mnt/test/file
for i in {1..200}; do
attr -s "user.attr_$i" -V "value_$i" /mnt/test/file > /dev/null
done
echo 1 > /sys/fs/xfs/debug/larp
echo 1 > /sys/fs/xfs/sda/errortag/larp
attr -s "user.zero" -V "" /mnt/test/file
echo 0 > /sys/fs/xfs/sda/errortag/larp
umount /mnt/test
mount /dev/sda /mnt/test/ # mount failed
Fix this by deriving the expected count dynamically as "2 + !!value_len"
for SET/REPLACE operations.
Cc: stable@vger.kernel.org # v6.9
Fixes: ad206ae50eca ("xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 268378b6ad20569af0d1957992de1c8b16c6e900 upstream.
xchk_quota_item can return early after calling xchk_fblock_process_error.
When that helper returns false, the function returned immediately without
dropping dq->q_qlock, which can leave the dquot lock held and risk lock
leaks or deadlocks in later quota operations.
Fix this by unlocking dq->q_qlock before the early return.
Signed-off-by: hongao <hongao@uniontech.com>
Fixes: 7d1f0e167a067e ("xfs: check the ondisk space mapping behind a dquot")
Cc: <stable@vger.kernel.org> # v6.8
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 79ef34ec0554ec04bdbafafbc9836423734e1bd6 upstream.
After xfsaild_push_item() calls iop_push(), the log item may have been
freed if the AIL lock was dropped during the push. Background inode
reclaim or the dquot shrinker can free the log item while the AIL lock
is not held, and the tracepoints in the switch statement dereference
the log item after iop_push() returns.
Fix this by capturing the log item type, flags, and LSN before calling
xfsaild_push_item(), and introducing a new xfs_ail_push_class trace
event class that takes these pre-captured values and the ailp pointer
instead of the log item pointer.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 394d70b86fae9fe865e7e6d9540b7696f73aa9b6 upstream.
In xfs_inode_item_push() and xfs_qm_dquot_logitem_push(), the AIL lock
is dropped to perform buffer IO. Once the cluster buffer no longer
protects the log item from reclaim, the log item may be freed by
background reclaim or the dquot shrinker. The subsequent spin_lock()
call dereferences lip->li_ailp, which is a use-after-free.
Fix this by saving the ailp pointer in a local variable while the AIL
lock is held and the log item is guaranteed to be valid.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f24a767e3d64a5f58c595b5c29b6063a201f1e3 upstream.
The unmount sequence in xfs_unmount_flush_inodes() pushed the AIL while
background reclaim and inodegc are still running. This is broken
independently of any use-after-free issues - background reclaim and
inodegc should not be running while the AIL is being pushed during
unmount, as inodegc can dirty and insert inodes into the AIL during the
flush, and background reclaim can race to abort and free dirty inodes.
Reorder xfs_unmount_flush_inodes() to stop inodegc and cancel background
reclaim before pushing the AIL. Stop inodegc before cancelling
m_reclaim_work because the inodegc worker can re-queue m_reclaim_work
via xfs_inodegc_set_reclaimable.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bd71fb3fea9945987053968f028a948997cba8cc upstream.
Commit aa35dd5cbc06 ("iomap: fix invalid folio access after
folio_end_read()") partially addressed invalid folio access for folios
without an ifs attached, but it did not handle the case where
1 << inode->i_blkbits matches the folio size but is different from the
granularity used for the IO, which means IO can be submitted for less
than the full folio for the !ifs case.
In this case, the condition:
if (*bytes_submitted == folio_len)
ctx->cur_folio = NULL;
in iomap_read_folio_iter() will not invalidate ctx->cur_folio, and
iomap_read_end() will still be called on the folio even though the IO
helper owns it and will finish the read on it.
Fix this by unconditionally invalidating ctx->cur_folio for the !ifs
case.
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/linux-fsdevel/b3dfe271-4e3d-4922-b618-e73731242bca@wdc.com/
Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260317203935.830549-1-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bac3190a8e79beff6ed221975e0c9b1b5f2a21da upstream.
This patch targets two internal state machine invariants in checkpoint.c
residing inside functions that natively return integer error codes.
- In jbd2_cleanup_journal_tail(): A blocknr of 0 indicates a severely
corrupted journal superblock. Replaced the J_ASSERT with a WARN_ON_ONCE
and a graceful journal abort, returning -EFSCORRUPTED.
- In jbd2_log_do_checkpoint(): Replaced the J_ASSERT_BH checking for
an unexpected buffer_jwrite state. If the warning triggers, we
explicitly drop the just-taken get_bh() reference and call __flush_batch()
to safely clean up any previously queued buffers in the j_chkpt_bhs array,
preventing a memory leak before returning -EFSCORRUPTED.
Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260311041548.159424-1-nikic.milos@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 53a7c171e9dd833f0a96b545adcb89bd57387239 upstream.
The implicit FILEID_INO32_GEN encoder was changed to be explicit,
so we need to fix the detection.
When mounting overlayfs with upperdir and lowerdir on different ext4
filesystems, the expected kmsg log is:
overlayfs: "xino" feature enabled using 32 upper inode bits.
But instead, since the regressing commit, the kmsg log was:
overlayfs: "xino" feature enabled using 2 upper inode bits.
Fixes: e21fc2038c1b9 ("exportfs: make ->encode_fh() a mandatory method for NFS export")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1f6ee9be92f8df85a8c9a5a78c20fd39c0c21a95 upstream.
Commit 7d6899fb69d25 ("ovl: fsync after metadata copy-up") was done to
fix durability of overlayfs copy up on an upper filesystem which does
not enforce ordering on storing of metadata changes (e.g. ubifs).
In an earlier revision of the regressing commit by Lei Lv, the metadata
fsync behavior was opt-in via a new "fsync=strict" mount option.
We were hoping that the opt-in mount option could be avoided, so the
change was only made to depend on metacopy=off, in the hope of not
hurting performance of metadata heavy workloads, which are more likely
to be using metacopy=on.
This hope was proven wrong by a performance regression report from Google
COS workload after upgrade to kernel 6.12.
This is an adaptation of Lei's original "fsync=strict" mount option
to the existing upstream code.
The new mount option is mutually exclusive with the "volatile" mount
option, so the latter is now an alias to the "fsync=volatile" mount
option.
Reported-by: Chenglong Tang <chenglongtang@google.com>
Closes: https://lore.kernel.org/linux-unionfs/CAOdxtTadAFH01Vui1FvWfcmQ8jH1O45owTzUcpYbNvBxnLeM7Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/
Fixes: 7d6899fb69d25 ("ovl: fsync after metadata copy-up")
Depends: 50e638beb67e0 ("ovl: Use str_on_off() helper in ovl_show_options()")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Fei Lv <feilv@asrmicro.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 76f9377cd2ab7a9220c25d33940d9ca20d368172 upstream.
Add a SB_I_NO_DATA_INTEGRITY superblock flag for filesystems that cannot
guarantee data persistence on sync (eg fuse). For superblocks with this
flag set, sync kicks off writeback of dirty inodes but does not wait
for the flusher threads to complete the writeback.
This replaces the per-inode AS_NO_DATA_INTEGRITY mapping flag added in
commit f9a49aa302a0 ("fs/writeback: skip AS_NO_DATA_INTEGRITY mappings
in wait_sb_inodes()"). The flag belongs at the superblock level because
data integrity is a filesystem-wide property, not a per-inode one.
Having this flag at the superblock level also allows us to skip having
to iterate every dirty inode in wait_sb_inodes() only to skip each inode
individually.
Prior to this commit, mappings with no data integrity guarantees skipped
waiting on writeback completion but still waited on the flusher threads
to finish initiating the writeback. Waiting on the flusher threads is
unnecessary. This commit kicks off writeback but does not wait on the
flusher threads. This change properly addresses a recent report [1] for
a suspend-to-RAM hang seen on fuse-overlayfs that was caused by waiting
on the flusher threads to finish:
Workqueue: pm_fs_sync pm_fs_sync_work_fn
Call Trace:
<TASK>
__schedule+0x457/0x1720
schedule+0x27/0xd0
wb_wait_for_completion+0x97/0xe0
sync_inodes_sb+0xf8/0x2e0
__iterate_supers+0xdc/0x160
ksys_sync+0x43/0xb0
pm_fs_sync_work_fn+0x17/0xa0
process_one_work+0x193/0x350
worker_thread+0x1a1/0x310
kthread+0xfc/0x240
ret_from_fork+0x243/0x280
ret_from_fork_asm+0x1a/0x30
</TASK>
On fuse this is problematic because there are paths that may cause the
flusher thread to block (eg if systemd freezes the user session cgroups
first, which freezes the fuse daemon, before invoking the kernel
suspend. The kernel suspend triggers ->write_node() which on fuse issues
a synchronous setattr request, which cannot be processed since the
daemon is frozen. Or if the daemon is buggy and cannot properly complete
writeback, initiating writeback on a dirty folio already under writeback
leads to writeback_get_folio() -> folio_prepare_writeback() ->
unconditional wait on writeback to finish, which will cause a hang).
This commit restores fuse to its prior behavior before tmp folios were
removed, where sync was essentially a no-op.
[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1a-asuvfrbKXbEwwDSctvemF+6zfhdnuzO65Pt8HsFSRw@mail.gmail.com/T/#m632c4648e9cafc4239299887109ebd880ac6c5c1
Fixes: 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree")
Reported-by: John <therealgraysky@proton.me>
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260320005145.2483161-2-joannelkoong@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c23df30915f83e7257c8625b690a1cece94142a0 upstream.
The bio completion path in the process context (e.g. dm-verity)
will directly call into decompression rather than trigger another
workqueue context for minimal scheduling latencies, which can
then call vm_map_ram() with GFP_KERNEL.
Due to insufficient memory, vm_map_ram() may generate memory
swapping I/O, which can cause submit_bio_wait to deadlock
in some scenarios.
Trimmed down the call stack, as follows:
f2fs_submit_read_io
submit_bio //bio_list is initialized.
mmc_blk_mq_recovery
z_erofs_endio
vm_map_ram
__pte_alloc_kernel
__alloc_pages_direct_reclaim
shrink_folio_list
__swap_writepage
submit_bio_wait //bio_list is non-NULL, hang!!!
Use memalloc_noio_{save,restore}() to wrap up this path.
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jiucheng Xu <jiucheng.xu@amlogic.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|