| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 1c37d896b12dfd0d4c96e310b0033c6676933917 ]
Whenever we get an error updating the device stats item for a device in
btrfs_run_dev_stats() we allow the loop to go to the next device, and if
updating the stats item for the next device succeeds, we end up losing
the error we had from the previous device.
Fix this by breaking out of the loop once we get an error and make sure
it's returned to the caller. Since we are in the transaction commit path
(and in the critical section actually), returning the error will result
in a transaction abort.
Fixes: 733f4fbbc108 ("Btrfs: read device stats on mount, write modified ones during commit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a4376d9a5d4c9610e69def3fc0b32c86a7ab7a41 ]
When create_space_info_sub_group() allocates elements of
space_info->sub_group[], kobject_init_and_add() is called for each
element via btrfs_sysfs_add_space_info_type(). However, when
check_removing_space_info() frees these elements, it does not call
btrfs_sysfs_remove_space_info() on them. As a result, kobject_put() is
not called and the associated kobj->name objects are leaked.
This memory leak is reproduced by running the blktests test case
zbd/009 on kernels built with CONFIG_DEBUG_KMEMLEAK. The kmemleak
feature reports the following error:
unreferenced object 0xffff888112877d40 (size 16):
comm "mount", pid 1244, jiffies 4294996972
hex dump (first 16 bytes):
64 61 74 61 2d 72 65 6c 6f 63 00 c4 c6 a7 cb 7f data-reloc......
backtrace (crc 53ffde4d):
__kmalloc_node_track_caller_noprof+0x619/0x870
kstrdup+0x42/0xc0
kobject_set_name_vargs+0x44/0x110
kobject_init_and_add+0xcf/0x150
btrfs_sysfs_add_space_info_type+0xfc/0x210 [btrfs]
create_space_info_sub_group.constprop.0+0xfb/0x1b0 [btrfs]
create_space_info+0x211/0x320 [btrfs]
btrfs_init_space_info+0x15a/0x1b0 [btrfs]
open_ctree+0x33c7/0x4a50 [btrfs]
btrfs_get_tree.cold+0x9f/0x1ee [btrfs]
vfs_get_tree+0x87/0x2f0
vfs_cmd_create+0xbd/0x280
__do_sys_fsconfig+0x3df/0x990
do_syscall_64+0x136/0x1540
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid the leak, call btrfs_sysfs_remove_space_info() instead of
kfree() for the elements.
Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group")
Link: https://lore.kernel.org/linux-block/b9488881-f18d-4f47-91a5-3c9bf63955a5@wdc.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b52fe51f724385b3ed81e37e510a4a33107e8161 ]
Fix the superblock offset mismatch error message in
btrfs_validate_super(): we changed it so that it considers all the
superblocks, but the message still assumes we're only looking at the
first one.
The change from %u to %llu is because we're changing from a constant to
a u64.
Fixes: 069ec957c35e ("btrfs: Refactor btrfs_check_super_valid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
It's actually a stable-only issue from backporting 9e2f9d34dd12
("erofs: handle overlapped pclusters out of crafted images properly")
We missed to update `oldpage` after `pcl->compressed_bvecs[nr].page`
is updated, so that the following cmpxchg() will fail; the original
upstream commit doesn't behave like this due to new features and
refactoring.
This backport issue only impacts some specific crafted images and
normal filesystems won't be impacted at all.
Fixes: 1bf7e414cac3 ("erofs: handle overlapped pclusters out of crafted images properly") # 6.6.y
Closes: https://syzkaller.appspot.com/bug?extid=b6353e35ae2bab997538
Reported-and-tested-by: syzbot+b6353e35ae2bab997538@syzkaller.appspotmail.com [1]
[1] https://lore.kernel.org/r/69c3b299.a70a0220.234938.004b.GAE@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 394d70b86fae9fe865e7e6d9540b7696f73aa9b6 upstream.
In xfs_inode_item_push() and xfs_qm_dquot_logitem_push(), the AIL lock
is dropped to perform buffer IO. Once the cluster buffer no longer
protects the log item from reclaim, the log item may be freed by
background reclaim or the dquot shrinker. The subsequent spin_lock()
call dereferences lip->li_ailp, which is a use-after-free.
Fix this by saving the ailp pointer in a local variable while the AIL
lock is held and the log item is guaranteed to be valid.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 79ef34ec0554ec04bdbafafbc9836423734e1bd6 upstream.
After xfsaild_push_item() calls iop_push(), the log item may have been
freed if the AIL lock was dropped during the push. Background inode
reclaim or the dquot shrinker can free the log item while the AIL lock
is not held, and the tracepoints in the switch statement dereference
the log item after iop_push() returns.
Fix this by capturing the log item type, flags, and LSN before calling
xfsaild_push_item(), and introducing a new xfs_ail_push_class trace
event class that takes these pre-captured values and the ailp pointer
instead of the log item pointer.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit a71874379ec8c6e788a61d71b3ad014a8d9a5c08 ]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ Only switch to CLASS(fd) in v6.6.y for fd_empty() was introduced in commit
88a2f6468d01 ("struct fd: representation change") in 6.12. ]
Signed-off-by: Alva Lan <alvalan9@foxmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 28c4d9bc0708956c1a736a9e49fee71b65deee81 ]
In gdlm_put_lock(), there is a small window of time in which the
DFL_UNMOUNT flag has been set but the lockspace hasn't been released,
yet. In that window, dlm may still call gdlm_ast() and gdlm_bast().
To prevent it from dereferencing freed glock objects, only free the
glock if the lockspace has actually been released.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
[ Minor context change fixed. ]
Signed-off-by: Robert Garcia <rob_garcia@163.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 309b44ed684496ed3f9c5715d10b899338623512 ]
smb2_lock() has three error handling issues after list_del() detaches
smb_lock from lock_list at no_check_cl:
1) If vfs_lock_file() returns an unexpected error in the non-UNLOCK
path, goto out leaks smb_lock and its flock because the out:
handler only iterates lock_list and rollback_list, neither of
which contains the detached smb_lock.
2) If vfs_lock_file() returns -ENOENT in the UNLOCK path, goto out
leaks smb_lock and flock for the same reason. The error code
returned to the dispatcher is also stale.
3) In the rollback path, smb_flock_init() can return NULL on
allocation failure. The result is dereferenced unconditionally,
causing a kernel NULL pointer dereference. Add a NULL check to
prevent the crash and clean up the bookkeeping; the VFS lock
itself cannot be rolled back without the allocation and will be
released at file or connection teardown.
Fix cases 1 and 2 by hoisting the locks_free_lock()/kfree() to before
the if(!rc) check in the UNLOCK branch so all exit paths share one
free site, and by freeing smb_lock and flock before goto out in the
non-UNLOCK branch. Propagate the correct error code in both cases.
Fix case 3 by wrapping the VFS unlock in an if(rlock) guard and adding
a NULL check for locks_free_lock(rlock) in the shared cleanup.
Found via call-graph analysis using sqry.
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Suggested-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Werner Kasselman <werner@verivus.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
[ adapted rlock->c.flc_type to rlock->fl_type ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 48623ec358c1c600fa1e38368746f933e0f1a617 ]
smb_grant_oplock() has two issues in the oplock publication sequence:
1) opinfo is linked into ci->m_op_list (via opinfo_add) before
add_lease_global_list() is called. If add_lease_global_list()
fails (kmalloc returns NULL), the error path frees the opinfo
via __free_opinfo() while it is still linked in ci->m_op_list.
Concurrent m_op_list readers (opinfo_get_list, or direct iteration
in smb_break_all_levII_oplock) dereference the freed node.
2) opinfo->o_fp is assigned after add_lease_global_list() publishes
the opinfo on the global lease list. A concurrent
find_same_lease_key() can walk the lease list and dereference
opinfo->o_fp->f_ci while o_fp is still NULL.
Fix by restructuring the publication sequence to eliminate post-publish
failure:
- Set opinfo->o_fp before any list publication (fixes NULL deref).
- Preallocate lease_table via alloc_lease_table() before opinfo_add()
so add_lease_global_list() becomes infallible after publication.
- Keep the original m_op_list publication order (opinfo_add before
lease list) so concurrent opens via same_client_has_lease() and
opinfo_get_list() still see the in-flight grant.
- Use opinfo_put() instead of __free_opinfo() on err_out so that
the RCU-deferred free path is used.
This also requires splitting add_lease_global_list() to take a
preallocated lease_table and changing its return type from int to void,
since it can no longer fail.
Fixes: 1dfd062caa16 ("ksmbd: fix use-after-free by using call_rcu() for oplock_info")
Cc: stable@vger.kernel.org
Signed-off-by: Werner Kasselman <werner@verivus.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
[ replaced kmalloc_obj() and KSMBD_DEFAULT_GFP with kmalloc(sizeof(), GFP_KERNEL) ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9ee29d20aab228adfb02ca93f87fb53c56c2f3af upstream.
While reviewing recent ext4 patch[1], Sashiko raised the following
concern[2]:
> If the filesystem is initially mounted with the discard option,
> deleting files will populate sbi->s_discard_list and queue
> s_discard_work. If it is then remounted with nodiscard, the
> EXT4_MOUNT_DISCARD flag is cleared, but the pending s_discard_work is
> neither cancelled nor flushed.
[1] https://lore.kernel.org/r/20260319094545.19291-1-qiang.zhang@linux.dev/
[2] https://sashiko.dev/#/patchset/20260319094545.19291-1-qiang.zhang%40linux.dev
The concern was valid, but it had nothing to do with the patch[1].
One of the problems with Sashiko in its current (early) form is that
it will detect pre-existing issues and report it as a problem with the
patch that it is reviewing.
In practice, it would be hard to hit deliberately (unless you are a
malicious syzkaller fuzzer), since it would involve mounting the file
system with -o discard, and then deleting a large number of files,
remounting the file system with -o nodiscard, and then immediately
unmounting the file system before the queued discard work has a change
to drain on its own.
Fix it because it's a real bug, and to avoid Sashiko from raising this
concern when analyzing future patches to mballoc.c.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread contex")
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ec0a7500d8eace5b4f305fa0c594dd148f0e8d29 upstream.
During code review, Joseph found that ext4_fc_replay_inode() calls
ext4_get_fc_inode_loc() to get the inode location, which holds a
reference to iloc.bh that must be released via brelse().
However, several error paths jump to the 'out' label without
releasing iloc.bh:
- ext4_handle_dirty_metadata() failure
- sync_dirty_buffer() failure
- ext4_mark_inode_used() failure
- ext4_iget() failure
Fix this by introducing an 'out_brelse' label placed just before
the existing 'out' label to ensure iloc.bh is always released.
Additionally, make ext4_fc_replay_inode() propagate errors
properly instead of always returning 0.
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 496bb99b7e66f48b178126626f47e9ba79e2d0fa upstream.
Use the kvfree() in the RCU read critical section can trigger
the following warnings:
EXT4-fs (vdb): unmounting filesystem cd983e5b-3c83-4f5a-a136-17b00eb9d018.
WARNING: suspicious RCU usage
./include/linux/rcupdate.h:409 Illegal context switch in RCU read-side critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
Call Trace:
<TASK>
dump_stack_lvl+0xbb/0xd0
dump_stack+0x14/0x20
lockdep_rcu_suspicious+0x15a/0x1b0
__might_resched+0x375/0x4d0
? put_object.part.0+0x2c/0x50
__might_sleep+0x108/0x160
vfree+0x58/0x910
? ext4_group_desc_free+0x27/0x270
kvfree+0x23/0x40
ext4_group_desc_free+0x111/0x270
ext4_put_super+0x3c8/0xd40
generic_shutdown_super+0x14c/0x4a0
? __pfx_shrinker_free+0x10/0x10
kill_block_super+0x40/0x90
ext4_kill_sb+0x6d/0xb0
deactivate_locked_super+0xb4/0x180
deactivate_super+0x7e/0xa0
cleanup_mnt+0x296/0x3e0
__cleanup_mnt+0x16/0x20
task_work_run+0x157/0x250
? __pfx_task_work_run+0x10/0x10
? exit_to_user_mode_loop+0x6a/0x550
exit_to_user_mode_loop+0x102/0x550
do_syscall_64+0x44a/0x500
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
BUG: sleeping function called from invalid context at mm/vmalloc.c:3441
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 556, name: umount
preempt_count: 1, expected: 0
CPU: 3 UID: 0 PID: 556 Comm: umount
Call Trace:
<TASK>
dump_stack_lvl+0xbb/0xd0
dump_stack+0x14/0x20
__might_resched+0x275/0x4d0
? put_object.part.0+0x2c/0x50
__might_sleep+0x108/0x160
vfree+0x58/0x910
? ext4_group_desc_free+0x27/0x270
kvfree+0x23/0x40
ext4_group_desc_free+0x111/0x270
ext4_put_super+0x3c8/0xd40
generic_shutdown_super+0x14c/0x4a0
? __pfx_shrinker_free+0x10/0x10
kill_block_super+0x40/0x90
ext4_kill_sb+0x6d/0xb0
deactivate_locked_super+0xb4/0x180
deactivate_super+0x7e/0xa0
cleanup_mnt+0x296/0x3e0
__cleanup_mnt+0x16/0x20
task_work_run+0x157/0x250
? __pfx_task_work_run+0x10/0x10
? exit_to_user_mode_loop+0x6a/0x550
exit_to_user_mode_loop+0x102/0x550
do_syscall_64+0x44a/0x500
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The above scenarios occur in initialization failures and teardown
paths, there are no parallel operations on the resources released
by kvfree(), this commit therefore remove rcu_read_lock/unlock() and
use rcu_access_pointer() instead of rcu_dereference() operations.
Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access")
Fixes: df3da4ea5a0f ("ext4: fix potential race between s_group_info online resizing and access")
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Link: https://patch.msgid.link/20260319094545.19291-1-qiang.zhang@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d15e4b0a418537aafa56b2cb80d44add83e83697 upstream.
Commit b98535d09179 ("ext4: fix bug_on in start_this_handle during umount
filesystem") moved ext4_unregister_sysfs() before flushing s_sb_upd_work
to prevent new error work from being queued via /proc/fs/ext4/xx/mb_groups
reads during unmount. However, this introduced a use-after-free because
update_super_work calls ext4_notify_error_sysfs() -> sysfs_notify() which
accesses the kobject's kernfs_node after it has been freed by kobject_del()
in ext4_unregister_sysfs():
update_super_work ext4_put_super
----------------- --------------
ext4_unregister_sysfs(sb)
kobject_del(&sbi->s_kobj)
__kobject_del()
sysfs_remove_dir()
kobj->sd = NULL
sysfs_put(sd)
kernfs_put() // RCU free
ext4_notify_error_sysfs(sbi)
sysfs_notify(&sbi->s_kobj)
kn = kobj->sd // stale pointer
kernfs_get(kn) // UAF on freed kernfs_node
ext4_journal_destroy()
flush_work(&sbi->s_sb_upd_work)
Instead of reordering the teardown sequence, fix this by making
ext4_notify_error_sysfs() detect that sysfs has already been torn down
by checking s_kobj.state_in_sysfs, and skipping the sysfs_notify() call
in that case. A dedicated mutex (s_error_notify_mutex) serializes
ext4_notify_error_sysfs() against kobject_del() in ext4_unregister_sysfs()
to prevent TOCTOU races where the kobject could be deleted between the
state_in_sysfs check and the sysfs_notify() call.
Fixes: b98535d09179 ("ext4: fix bug_on in start_this_handle during umount filesystem")
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260319120336.157873-1-jiayuan.chen@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3822743dc20386d9897e999dbb990befa3a5b3f8 upstream.
bigalloc with s_first_data_block != 0 is not supported, reject mounting
it.
Signed-off-by: Helen Koike <koike@igalia.com>
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b73703b873a33d8eb8f6
Link: https://patch.msgid.link/20260317142325.135074-1-koike@igalia.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 46066e3a06647c5b186cc6334409722622d05c44 upstream.
There's issue as follows:
...
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2243 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2239 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): error count since last fsck: 1
EXT4-fs (mmcblk0p1): initial error at time 1765597433: ext4_mb_generate_buddy:760
EXT4-fs (mmcblk0p1): last error at time 1765597433: ext4_mb_generate_buddy:760
...
According to the log analysis, blocks are always requested from the
corrupted block group. This may happen as follows:
ext4_mb_find_by_goal
ext4_mb_load_buddy
ext4_mb_load_buddy_gfp
ext4_mb_init_cache
ext4_read_block_bitmap_nowait
ext4_wait_block_bitmap
ext4_validate_block_bitmap
if (!grp || EXT4_MB_GRP_BBITMAP_CORRUPT(grp))
return -EFSCORRUPTED; // There's no logs.
if (err)
return err; // Will return error
ext4_lock_group(ac->ac_sb, group);
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) // Unreachable
goto out;
After commit 9008a58e5dce ("ext4: make the bitmap read routines return
real error codes") merged, Commit 163a203ddb36 ("ext4: mark block group
as corrupt on block bitmap error") is no real solution for allocating
blocks from corrupted block groups. This is because if
'EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)' is true, then
'ext4_mb_load_buddy()' may return an error. This means that the block
allocation will fail.
Therefore, check block group if corrupted when ext4_mb_load_buddy()
returns error.
Fixes: 163a203ddb36 ("ext4: mark block group as corrupt on block bitmap error")
Fixes: 9008a58e5dce ("ext4: make the bitmap read routines return real error codes")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260302134619.3145520-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5422fe71d26d42af6c454ca9527faaad4e677d6c upstream.
On the mkdir/mknod path, when mapping logical blocks to physical blocks,
if inserting a new extent into the extent tree fails (in this example,
because the file system disabled the huge file feature when marking the
inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to
reclaim the physical block without deleting the corresponding data in
the extent tree. This causes subsequent mkdir operations to reference
the previously reclaimed physical block number again, even though this
physical block is already being used by the xattr block. Therefore, a
situation arises where both the directory and xattr are using the same
buffer head block in memory simultaneously.
The above causes ext4_xattr_block_set() to enter an infinite loop about
"inserted" and cannot release the inode lock, ultimately leading to the
143s blocking problem mentioned in [1].
If the metadata is corrupted, then trying to remove some extent space
can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE
was passed, remove space wrongly update quota information.
Jan Kara suggests distinguishing between two cases:
1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully
consistent and we must maintain its consistency including all the
accounting. However these errors can happen only early before we've
inserted the extent into the extent tree. So current code works correctly
for this case.
2) Some other error - this means metadata is corrupted. We should strive to
do as few modifications as possible to limit damage. So I'd just skip
freeing of allocated blocks.
[1]
INFO: task syz.0.17:5995 blocked for more than 143 seconds.
Call Trace:
inode_lock_nested include/linux/fs.h:1073 [inline]
__start_dirop fs/namei.c:2923 [inline]
start_dirop fs/namei.c:2934 [inline]
Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7
Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f
Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com
Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 356227096eb66e41b23caf7045e6304877322edf upstream.
Replace BUG_ON() with proper error handling when inline data size
exceeds PAGE_SIZE. This prevents kernel panic and allows the system to
continue running while properly reporting the filesystem corruption.
The error is logged via ext4_error_inode(), the buffer head is released
to prevent memory leak, and -EFSCORRUPTED is returned to indicate
filesystem corruption.
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Link: https://patch.msgid.link/20260223123345.14838-2-ytohnuki@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bd060afa7cc3e0ad30afa9ecc544a78638498555 upstream.
recently_deleted() checks whether inode has been used in the near past.
However this can give false positive result when inode table is not
initialized yet and we are in fact comparing to random garbage (or stale
itable block of a filesystem before mkfs). Ultimately this results in
uninitialized inodes being skipped during inode allocation and possibly
they are never initialized and thus e2fsck complains. Verify if the
inode has been initialized before checking for dtime.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260216164848.3074-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1308255bbf8452762f89f44f7447ce137ecdbcff upstream.
When inode metadata is changed, we sometimes just call
ext4_mark_inode_dirty() to track modified metadata. This copies inode
metadata into block buffer which is enough when we are journalling
metadata. However when we are running in nojournal mode we currently
fail to write the dirtied inode buffer during fsync(2) because the inode
is not marked as dirty. Use explicit ext4_write_inode() call to make
sure the inode table buffer is written to the disk. This is a band aid
solution but proper solution requires a much larger rewrite including
changes in metadata bh tracking infrastructure.
Reported-by: Free Ekanayaka <free.ekanayaka@gmail.com>
Link: https://lore.kernel.org/all/87il8nhxdm.fsf@x1.mail-host-address-is-not-set/
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260216164848.3074-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f4a2b42e78914ff15630e71289adc589c3a8eb45 upstream.
There are cases where ext4_bio_write_page() gets called for a page which
has no buffers to submit. This happens e.g. when the part of the file is
actually a hole, when we cannot allocate blocks due to being called from
jbd2, or in data=journal mode when checkpointing writes the buffers
earlier. In these cases we just return from ext4_bio_write_page()
however if the page didn't need redirtying, we will leave stale DIRTY
and/or TOWRITE tags in xarray because those get cleared only in
__folio_start_writeback(). As a result we can leave these tags set in
mappings even after a final sync on filesystem that's getting remounted
read-only or that's being frozen. Various assertions can then get upset
when writeback is started on such filesystems (Gerald reported assertion
in ext4_journal_check_start() firing).
Fix the problem by cycling the page through writeback state even if we
decide nothing needs to be written for it so that xarray tags get
properly updated. This is slightly silly (we could update the xarray
tags directly) but I don't think a special helper messing with xarray
tags is really worth it in this relatively rare corner case.
Reported-by: Gerald Yang <gerald.yang@canonical.com>
Link: https://lore.kernel.org/all/20260128074515.2028982-1-gerald.yang@canonical.com
Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260205092223.21287-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ed9356a30e59c7cc3198e7fc46cfedf3767b9b17 upstream.
Add a check in ext4_setattr() to convert files from inline data storage
to extent-based storage when truncate() grows the file size beyond the
inline capacity. This prevents the filesystem from entering an
inconsistent state where the inline data flag is set but the file size
exceeds what can be stored inline.
Without this fix, the following sequence causes a kernel BUG_ON():
1. Mount filesystem with inode that has inline flag set and small size
2. truncate(file, 50MB) - grows size but inline flag remains set
3. sendfile() attempts to write data
4. ext4_write_inline_data() hits BUG_ON(write_size > inline_capacity)
The crash occurs because ext4_write_inline_data() expects inline storage
to accommodate the write, but the actual inline capacity (~60 bytes for
i_block + ~96 bytes for xattrs) is far smaller than the file size and
write request.
The fix checks if the new size from setattr exceeds the inode's actual
inline capacity (EXT4_I(inode)->i_inline_size) and converts the file to
extent-based storage before proceeding with the size change.
This addresses the root cause by ensuring the inline data flag and file
size remain consistent during truncate operations.
Reported-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7de5fe447862fc37576f
Tested-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Link: https://patch.msgid.link/20260207043607.1175976-1-kartikey406@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b1d682f1990c19fb1d5b97d13266210457092bcd upstream.
Fix an issue arising when ext4 features has_journal, ea_inode, and encrypt
are activated simultaneously, leading to ENOSPC when creating an encrypted
file.
Fix by passing XATTR_CREATE flag to xattr_set_handle function if a handle
is specified, i.e., when the function is called in the control flow of
creating a new inode. This aligns the number of jbd2 credits set_handle
checks for with the number allocated for creating a new inode.
ext4_set_context must not be called with a non-null handle (fs_data) if
fscrypt context xattr is not guaranteed to not exist yet. The only other
usage of this function currently is when handling the ioctl
FS_IOC_SET_ENCRYPTION_POLICY, which calls it with fs_data=NULL.
Fixes: c1a5d5f6ab21eb7e ("ext4: improve journal credit handling in set xattr paths")
Co-developed-by: Anthony Durrer <anthonydev@fastmail.com>
Signed-off-by: Anthony Durrer <anthonydev@fastmail.com>
Signed-off-by: Simon Weber <simon.weber.39@gmail.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260207100148.724275-4-simon.weber.39@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d72f2084e30966097c8eae762e31986a33c3c0ae upstream.
The ri_total checks for SET/REPLACE operations are hardcoded to 3,
but xfs_attri_item_size() only emits a value iovec when value_len > 0,
so ri_total is 2 when value_len == 0.
For PPTR_SET/PPTR_REMOVE/PPTR_REPLACE, value_len is validated by
xfs_attri_validate() to be exactly sizeof(struct xfs_parent_rec) and
is never zero, so their hardcoded checks remain correct.
This problem may cause log recovery failures. The following script can be
used to reproduce the problem:
#!/bin/bash
mkfs.xfs -f /dev/sda
mount /dev/sda /mnt/test/
touch /mnt/test/file
for i in {1..200}; do
attr -s "user.attr_$i" -V "value_$i" /mnt/test/file > /dev/null
done
echo 1 > /sys/fs/xfs/debug/larp
echo 1 > /sys/fs/xfs/sda/errortag/larp
attr -s "user.zero" -V "" /mnt/test/file
echo 0 > /sys/fs/xfs/sda/errortag/larp
umount /mnt/test
mount /dev/sda /mnt/test/ # mount failed
Fix this by deriving the expected count dynamically as "2 + !!value_len"
for SET/REPLACE operations.
Cc: stable@vger.kernel.org # v6.9
Fixes: ad206ae50eca ("xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f24a767e3d64a5f58c595b5c29b6063a201f1e3 upstream.
The unmount sequence in xfs_unmount_flush_inodes() pushed the AIL while
background reclaim and inodegc are still running. This is broken
independently of any use-after-free issues - background reclaim and
inodegc should not be running while the AIL is being pushed during
unmount, as inodegc can dirty and insert inodes into the AIL during the
flush, and background reclaim can race to abort and free dirty inodes.
Reorder xfs_unmount_flush_inodes() to stop inodegc and cancel background
reclaim before pushing the AIL. Stop inodegc before cancelling
m_reclaim_work because the inodegc worker can re-queue m_reclaim_work
via xfs_inodegc_set_reclaimable.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bac3190a8e79beff6ed221975e0c9b1b5f2a21da upstream.
This patch targets two internal state machine invariants in checkpoint.c
residing inside functions that natively return integer error codes.
- In jbd2_cleanup_journal_tail(): A blocknr of 0 indicates a severely
corrupted journal superblock. Replaced the J_ASSERT with a WARN_ON_ONCE
and a graceful journal abort, returning -EFSCORRUPTED.
- In jbd2_log_do_checkpoint(): Replaced the J_ASSERT_BH checking for
an unexpected buffer_jwrite state. If the warning triggers, we
explicitly drop the just-taken get_bh() reference and call __flush_batch()
to safely clean up any previously queued buffers in the j_chkpt_bhs array,
preventing a memory leak before returning -EFSCORRUPTED.
Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260311041548.159424-1-nikic.milos@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c23df30915f83e7257c8625b690a1cece94142a0 upstream.
The bio completion path in the process context (e.g. dm-verity)
will directly call into decompression rather than trigger another
workqueue context for minimal scheduling latencies, which can
then call vm_map_ram() with GFP_KERNEL.
Due to insufficient memory, vm_map_ram() may generate memory
swapping I/O, which can cause submit_bio_wait to deadlock
in some scenarios.
Trimmed down the call stack, as follows:
f2fs_submit_read_io
submit_bio //bio_list is initialized.
mmc_blk_mq_recovery
z_erofs_endio
vm_map_ram
__pte_alloc_kernel
__alloc_pages_direct_reclaim
shrink_folio_list
__swap_writepage
submit_bio_wait //bio_list is non-NULL, hang!!!
Use memalloc_noio_{save,restore}() to wrap up this path.
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jiucheng Xu <jiucheng.xu@amlogic.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9bbb19d21ded7d78645506f20d8c44895e3d0fb9 upstream.
When a multichannel session binding request fails (e.g. wrong password),
the error path unconditionally sets sess->state = SMB2_SESSION_EXPIRED.
However, during binding, sess points to the target session looked up via
ksmbd_session_lookup_slowpath() -- which belongs to another connection's
user. This allows a remote attacker to invalidate any active session by
simply sending a binding request with a wrong password (DoS).
Fix this by skipping session expiration when the failed request was
a binding attempt, since the session does not belong to the current
connection. The reference taken by ksmbd_session_lookup_slowpath() is
still correctly released via ksmbd_user_session_put().
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit beef2634f81f1c086208191f7228bce1d366493d upstream.
When a compound request consists of QUERY_DIRECTORY + QUERY_INFO
(FILE_ALL_INFORMATION) and the first command consumes nearly the entire
max_trans_size, get_file_all_info() would blindly call smbConvertToUTF16()
with PATH_MAX, causing out-of-bounds write beyond the response buffer.
In get_file_all_info(), there was a missing validation check for
the client-provided OutputBufferLength before copying the filename into
FileName field of the smb2_file_all_info structure.
If the filename length exceeds the available buffer space, it could lead to
potential buffer overflows or memory corruption during smbConvertToUTF16
conversion. This calculating the actual free buffer size using
smb2_calc_max_out_buf_len() and returning -EINVAL if the buffer is
insufficient and updating smbConvertToUTF16 to use the actual filename
length (clamped by PATH_MAX) to ensure a safe copy operation.
Cc: stable@vger.kernel.org
Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound")
Reported-by: Asim Viladi Oglu Manizada <manizada@pm.me>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0e55f63dd08f09651d39e1b709a91705a8a0ddcb upstream.
After this commit (e2b76ab8b5c9 "ksmbd: add support for read compound"),
response buffer management was changed to use dynamic iov array.
In the new design, smb2_calc_max_out_buf_len() expects the second
argument (hdr2_len) to be the offset of ->Buffer field in the
response structure, not a hardcoded magic number.
Fix the remaining call sites to use the correct offsetof() value.
Cc: stable@vger.kernel.org
Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5131fa077f9bb386a1b901bf5b247041f0ec8f80 ]
We have recently observed a number of subvolumes with broken dentries.
ls-ing the parent dir looks like:
drwxrwxrwt 1 root root 16 Jan 23 16:49 .
drwxr-xr-x 1 root root 24 Jan 23 16:48 ..
d????????? ? ? ? ? ? broken_subvol
and similarly stat-ing the file fails.
In this state, deleting the subvol fails with ENOENT, but attempting to
create a new file or subvol over it errors out with EEXIST and even
aborts the fs. Which leaves us a bit stuck.
dmesg contains a single notable error message reading:
"could not do orphan cleanup -2"
2 is ENOENT and the error comes from the failure handling path of
btrfs_orphan_cleanup(), with the stack leading back up to
btrfs_lookup().
btrfs_lookup
btrfs_lookup_dentry
btrfs_orphan_cleanup // prints that message and returns -ENOENT
After some detailed inspection of the internal state, it became clear
that:
- there are no orphan items for the subvol
- the subvol is otherwise healthy looking, it is not half-deleted or
anything, there is no drop progress, etc.
- the subvol was created a while ago and does the meaningful first
btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much
later.
- after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT,
which results in a negative dentry for the subvolume via
d_splice_alias(NULL, dentry), leading to the observed behavior. The
bug can be mitigated by dropping the dentry cache, at which point we
can successfully delete the subvolume if we want.
i.e.,
btrfs_lookup()
btrfs_lookup_dentry()
if (!sb_rdonly(inode->vfs_inode)->vfs_inode)
btrfs_orphan_cleanup(sub_root)
test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
btrfs_search_slot() // finds orphan item for inode N
...
prints "could not do orphan cleanup -2"
if (inode == ERR_PTR(-ENOENT))
inode = NULL;
return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume
btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
on the root when it runs, so it cannot run more than once on a given
root, so something else must run concurrently. However, the obvious
routes to deleting an orphan when nlinks goes to 0 should not be able to
run without first doing a lookup into the subvolume, which should run
btrfs_orphan_cleanup() and set the bit.
The final important observation is that create_subvol() calls
d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if
the dentry cache gets dropped, the next lookup into the subvolume will
make a real call into btrfs_orphan_cleanup() for the first time. This
opens up the possibility of concurrently deleting the inode/orphan items
but most typical evict() paths will be holding a reference on the parent
dentry (child dentry holds parent->d_lockref.count via dget in
d_alloc(), released in __dentry_kill()) and prevent the parent from
being removed from the dentry cache.
The one exception is delayed iputs. Ordered extent creation calls
igrab() on the inode. If the file is unlinked and closed while those
refs are held, iput() in __dentry_kill() decrements i_count but does
not trigger eviction (i_count > 0). The child dentry is freed and the
subvol dentry's d_lockref.count drops to 0, making it evictable while
the inode is still alive.
Since there are two races (the race between writeback and unlink and
the race between lookup and delayed iputs), and there are too many moving
parts, the following three diagrams show the complete picture.
(Only the second and third are races)
Phase 1:
Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set
btrfs_mksubvol()
lookup_one_len()
__lookup_slow()
d_alloc_parallel()
__d_alloc() // d_lockref.count = 1
create_subvol(dentry)
// doesn't touch the bit..
d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1
Phase 2:
Create a delayed iput for a file in the subvol but leave the subvol in
state where its dentry can be evicted (d_lockref.count == 0)
T1 (task) T2 (writeback) T3 (OE workqueue)
write() // dirty pages
btrfs_writepages()
btrfs_run_delalloc_range()
cow_file_range()
btrfs_alloc_ordered_extent()
igrab() // i_count: 1 -> 2
btrfs_unlink_inode()
btrfs_orphan_add()
close()
__fput()
dput()
finish_dput()
__dentry_kill()
dentry_unlink_inode()
iput() // 2 -> 1
--parent->d_lockref.count // 1 -> 0; evictable
finish_ordered_fn()
btrfs_finish_ordered_io()
btrfs_put_ordered_extent()
btrfs_add_delayed_iput()
Phase 3:
Once the delayed iput is pending and the subvol dentry is evictable,
the shrinker can free it, causing the next lookup to go through
btrfs_lookup() and call btrfs_orphan_cleanup() for the first time.
If the cleaner kthread processes the delayed iput concurrently, the
two race:
T1 (shrinker) T2 (cleaner kthread) T3 (lookup)
super_cache_scan()
prune_dcache_sb()
__dentry_kill()
// subvol dentry freed
btrfs_run_delayed_iputs()
iput() // i_count -> 0
evict() // sets I_FREEING
btrfs_evict_inode()
// truncation loop
btrfs_lookup()
btrfs_lookup_dentry()
btrfs_orphan_cleanup()
// first call (bit never set)
btrfs_iget()
// blocks on I_FREEING
btrfs_orphan_del()
// inode freed
// returns -ENOENT
btrfs_del_orphan_item()
// -ENOENT
// "could not do orphan cleanup -2"
d_splice_alias(NULL, dentry)
// negative dentry for valid subvol
The most straightforward fix is to ensure the invariant that a dentry
for a subvolume can exist if and only if that subvolume has
BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no
orphans or ran btrfs_orphan_cleanup()).
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b425e4d0eb321a1116ddbf39636333181675d8f4 ]
parse_durable_handle_context() unconditionally assigns dh_info->fp->conn
to the current connection when handling a DURABLE_REQ_V2 context with
SMB2_FLAGS_REPLAY_OPERATION. ksmbd_lookup_fd_cguid() does not filter by
fp->conn, so it returns file handles that are already actively connected.
The unconditional overwrite replaces fp->conn, and when the overwriting
connection is subsequently freed, __ksmbd_close_fd() dereferences the
stale fp->conn via spin_lock(&fp->conn->llist_lock), causing a
use-after-free.
KASAN report:
[ 7.349357] ==================================================================
[ 7.349607] BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0
[ 7.349811] Write of size 4 at addr ffff8881056ac18c by task kworker/1:2/108
[ 7.350010]
[ 7.350064] CPU: 1 UID: 0 PID: 108 Comm: kworker/1:2 Not tainted 7.0.0-rc3+ #58 PREEMPTLAZY
[ 7.350068] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 7.350070] Workqueue: ksmbd-io handle_ksmbd_work
[ 7.350083] Call Trace:
[ 7.350087] <TASK>
[ 7.350087] dump_stack_lvl+0x64/0x80
[ 7.350094] print_report+0xce/0x660
[ 7.350100] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 7.350101] ? __pfx___mod_timer+0x10/0x10
[ 7.350106] ? _raw_spin_lock+0x75/0xe0
[ 7.350108] kasan_report+0xce/0x100
[ 7.350109] ? _raw_spin_lock+0x75/0xe0
[ 7.350114] kasan_check_range+0x105/0x1b0
[ 7.350116] _raw_spin_lock+0x75/0xe0
[ 7.350118] ? __pfx__raw_spin_lock+0x10/0x10
[ 7.350119] ? __call_rcu_common.constprop.0+0x25e/0x780
[ 7.350125] ? close_id_del_oplock+0x2cc/0x4e0
[ 7.350128] __ksmbd_close_fd+0x27f/0xaf0
[ 7.350131] ksmbd_close_fd+0x135/0x1b0
[ 7.350133] smb2_close+0xb19/0x15b0
[ 7.350142] ? __pfx_smb2_close+0x10/0x10
[ 7.350143] ? xas_load+0x18/0x270
[ 7.350146] ? _raw_spin_lock+0x84/0xe0
[ 7.350148] ? __pfx__raw_spin_lock+0x10/0x10
[ 7.350150] ? _raw_spin_unlock+0xe/0x30
[ 7.350151] ? ksmbd_smb2_check_message+0xeb2/0x24c0
[ 7.350153] ? ksmbd_tree_conn_lookup+0xcd/0xf0
[ 7.350154] handle_ksmbd_work+0x40f/0x1080
[ 7.350156] process_one_work+0x5fa/0xef0
[ 7.350162] ? assign_work+0x122/0x3e0
[ 7.350163] worker_thread+0x54b/0xf70
[ 7.350165] ? __pfx_worker_thread+0x10/0x10
[ 7.350166] kthread+0x346/0x470
[ 7.350170] ? recalc_sigpending+0x19b/0x230
[ 7.350176] ? __pfx_kthread+0x10/0x10
[ 7.350178] ret_from_fork+0x4fb/0x6c0
[ 7.350183] ? __pfx_ret_from_fork+0x10/0x10
[ 7.350185] ? __switch_to+0x36c/0xbe0
[ 7.350188] ? __pfx_kthread+0x10/0x10
[ 7.350190] ret_from_fork_asm+0x1a/0x30
[ 7.350197] </TASK>
[ 7.350197]
[ 7.355160] Allocated by task 123:
[ 7.355261] kasan_save_stack+0x33/0x60
[ 7.355373] kasan_save_track+0x14/0x30
[ 7.355484] __kasan_kmalloc+0x8f/0xa0
[ 7.355593] ksmbd_conn_alloc+0x44/0x6d0
[ 7.355711] ksmbd_kthread_fn+0x243/0xd70
[ 7.355839] kthread+0x346/0x470
[ 7.355942] ret_from_fork+0x4fb/0x6c0
[ 7.356051] ret_from_fork_asm+0x1a/0x30
[ 7.356164]
[ 7.356214] Freed by task 134:
[ 7.356305] kasan_save_stack+0x33/0x60
[ 7.356416] kasan_save_track+0x14/0x30
[ 7.356527] kasan_save_free_info+0x3b/0x60
[ 7.356646] __kasan_slab_free+0x43/0x70
[ 7.356761] kfree+0x1ca/0x430
[ 7.356862] ksmbd_tcp_disconnect+0x59/0xe0
[ 7.356993] ksmbd_conn_handler_loop+0x77e/0xd40
[ 7.357138] kthread+0x346/0x470
[ 7.357240] ret_from_fork+0x4fb/0x6c0
[ 7.357350] ret_from_fork_asm+0x1a/0x30
[ 7.357463]
[ 7.357513] The buggy address belongs to the object at ffff8881056ac000
[ 7.357513] which belongs to the cache kmalloc-1k of size 1024
[ 7.357857] The buggy address is located 396 bytes inside of
[ 7.357857] freed 1024-byte region [ffff8881056ac000, ffff8881056ac400)
Fix by removing the unconditional fp->conn assignment and rejecting the
replay when fp->conn is non-NULL. This is consistent with
ksmbd_lookup_durable_fd(), which also rejects file handles with a
non-NULL fp->conn. For disconnected file handles (fp->conn == NULL),
ksmbd_reopen_durable_fd() handles setting fp->conn.
Fixes: c8efcc786146 ("ksmbd: add support for durable handles v1/v2")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c33615f995aee80657b9fdfbc4ee7f49c2bd733d ]
smb2_get_ksmbd_tcon() reuses work->tcon in compound requests without
validating tcon->t_state. ksmbd_tree_conn_lookup() checks t_state ==
TREE_CONNECTED on the initial lookup path, but the compound reuse path
bypasses this check entirely.
If a prior command in the compound (SMB2_TREE_DISCONNECT) sets t_state
to TREE_DISCONNECTED and frees share_conf via ksmbd_share_config_put(),
subsequent commands dereference the freed share_conf through
work->tcon->share_conf.
KASAN report:
[ 4.144653] ==================================================================
[ 4.145059] BUG: KASAN: slab-use-after-free in smb2_write+0xc74/0xe70
[ 4.145415] Read of size 4 at addr ffff88810430c194 by task kworker/1:1/44
[ 4.145772]
[ 4.145867] CPU: 1 UID: 0 PID: 44 Comm: kworker/1:1 Not tainted 7.0.0-rc3+ #60 PREEMPTLAZY
[ 4.145871] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 4.145875] Workqueue: ksmbd-io handle_ksmbd_work
[ 4.145888] Call Trace:
[ 4.145892] <TASK>
[ 4.145894] dump_stack_lvl+0x64/0x80
[ 4.145910] print_report+0xce/0x660
[ 4.145919] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 4.145928] ? smb2_write+0xc74/0xe70
[ 4.145931] kasan_report+0xce/0x100
[ 4.145934] ? smb2_write+0xc74/0xe70
[ 4.145937] smb2_write+0xc74/0xe70
[ 4.145939] ? __pfx_smb2_write+0x10/0x10
[ 4.145942] ? _raw_spin_unlock+0xe/0x30
[ 4.145945] ? ksmbd_smb2_check_message+0xeb2/0x24c0
[ 4.145948] ? smb2_tree_disconnect+0x31c/0x480
[ 4.145951] handle_ksmbd_work+0x40f/0x1080
[ 4.145953] process_one_work+0x5fa/0xef0
[ 4.145962] ? assign_work+0x122/0x3e0
[ 4.145964] worker_thread+0x54b/0xf70
[ 4.145967] ? __pfx_worker_thread+0x10/0x10
[ 4.145970] kthread+0x346/0x470
[ 4.145976] ? recalc_sigpending+0x19b/0x230
[ 4.145980] ? __pfx_kthread+0x10/0x10
[ 4.145984] ret_from_fork+0x4fb/0x6c0
[ 4.145992] ? __pfx_ret_from_fork+0x10/0x10
[ 4.145995] ? __switch_to+0x36c/0xbe0
[ 4.145999] ? __pfx_kthread+0x10/0x10
[ 4.146003] ret_from_fork_asm+0x1a/0x30
[ 4.146013] </TASK>
[ 4.146014]
[ 4.149858] Allocated by task 44:
[ 4.149953] kasan_save_stack+0x33/0x60
[ 4.150061] kasan_save_track+0x14/0x30
[ 4.150169] __kasan_kmalloc+0x8f/0xa0
[ 4.150274] ksmbd_share_config_get+0x1dd/0xdd0
[ 4.150401] ksmbd_tree_conn_connect+0x7e/0x600
[ 4.150529] smb2_tree_connect+0x2e6/0x1000
[ 4.150645] handle_ksmbd_work+0x40f/0x1080
[ 4.150761] process_one_work+0x5fa/0xef0
[ 4.150873] worker_thread+0x54b/0xf70
[ 4.150978] kthread+0x346/0x470
[ 4.151071] ret_from_fork+0x4fb/0x6c0
[ 4.151176] ret_from_fork_asm+0x1a/0x30
[ 4.151286]
[ 4.151332] Freed by task 44:
[ 4.151418] kasan_save_stack+0x33/0x60
[ 4.151526] kasan_save_track+0x14/0x30
[ 4.151634] kasan_save_free_info+0x3b/0x60
[ 4.151751] __kasan_slab_free+0x43/0x70
[ 4.151861] kfree+0x1ca/0x430
[ 4.151952] __ksmbd_tree_conn_disconnect+0xc8/0x190
[ 4.152088] smb2_tree_disconnect+0x1cd/0x480
[ 4.152211] handle_ksmbd_work+0x40f/0x1080
[ 4.152326] process_one_work+0x5fa/0xef0
[ 4.152438] worker_thread+0x54b/0xf70
[ 4.152545] kthread+0x346/0x470
[ 4.152638] ret_from_fork+0x4fb/0x6c0
[ 4.152743] ret_from_fork_asm+0x1a/0x30
[ 4.152853]
[ 4.152900] The buggy address belongs to the object at ffff88810430c180
[ 4.152900] which belongs to the cache kmalloc-96 of size 96
[ 4.153226] The buggy address is located 20 bytes inside of
[ 4.153226] freed 96-byte region [ffff88810430c180, ffff88810430c1e0)
[ 4.153549]
[ 4.153596] The buggy address belongs to the physical page:
[ 4.153750] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88810430ce80 pfn:0x10430c
[ 4.154000] flags: 0x100000000000200(workingset|node=0|zone=2)
[ 4.154160] page_type: f5(slab)
[ 4.154251] raw: 0100000000000200 ffff888100041280 ffff888100040110 ffff888100040110
[ 4.154461] raw: ffff88810430ce80 0000000800200009 00000000f5000000 0000000000000000
[ 4.154668] page dumped because: kasan: bad access detected
[ 4.154820]
[ 4.154866] Memory state around the buggy address:
[ 4.155002] ffff88810430c080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155196] ffff88810430c100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155391] >ffff88810430c180: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[ 4.155587] ^
[ 4.155693] ffff88810430c200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155891] ffff88810430c280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.156087] ==================================================================
Add the same t_state validation to the compound reuse path, consistent
with ksmbd_tree_conn_lookup().
Fixes: 5005bcb42191 ("ksmbd: validate session id and tree id in the compound request")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fc1cd1f18c34f91e78362f9629ab9fd43b9dcab9 ]
Fix tree-checker error message to report "invalid root drop_level"
instead of the misleading "invalid root level".
Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9573a365ff9ff45da9222d3fe63695ce562beb24 ]
If we log the parent directory of a conflicting inode, we are not logging
the new dentries of the directory, so when we finish we have the parent
directory's inode marked as logged but we did not log its new dentries.
As a consequence if the parent directory is explicitly fsynced later and
it does not have any new changes since we logged it, the fsync is a no-op
and after a power failure the new dentries are missing.
Example scenario:
$ mkdir foo
$ sync
$rmdir foo
$ mkdir dir1
$ mkdir dir2
# A file with the same name and parent as the directory we just deleted
# and was persisted in a past transaction. So the deleted directory's
# inode is a conflicting inode of this new file's inode.
$ touch foo
$ ln foo dir2/link
# The fsync on dir2 will log the parent directory (".") because the
# conflicting inode (deleted directory) does not exists anymore, but it
# it does not log its new dentries (dir1).
$ xfs_io -c "fsync" dir2
# This fsync on the parent directory is no-op, since the previous fsync
# logged it (but without logging its new dentries).
$ xfs_io -c "fsync" .
<power failure>
# After log replay dir1 is missing.
Fix this by ensuring we log new dir dentries whenever we log the parent
directory of a no longer existing conflicting inode.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/182055fa-e9ce-4089-9f5f-4b8a23e8dd91@gmail.com/
Fixes: a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5133b61aaf437e5f25b1b396b14242a6bb0508e2 ]
The NFSv4.0 replay cache uses a fixed 112-byte inline buffer
(rp_ibuf[NFSD4_REPLAY_ISIZE]) to store encoded operation responses.
This size was calculated based on OPEN responses and does not account
for LOCK denied responses, which include the conflicting lock owner as
a variable-length field up to 1024 bytes (NFS4_OPAQUE_LIMIT).
When a LOCK operation is denied due to a conflict with an existing lock
that has a large owner, nfsd4_encode_operation() copies the full encoded
response into the undersized replay buffer via read_bytes_from_xdr_buf()
with no bounds check. This results in a slab-out-of-bounds write of up
to 944 bytes past the end of the buffer, corrupting adjacent heap memory.
This can be triggered remotely by an unauthenticated attacker with two
cooperating NFSv4.0 clients: one sets a lock with a large owner string,
then the other requests a conflicting lock to provoke the denial.
We could fix this by increasing NFSD4_REPLAY_ISIZE to allow for a full
opaque, but that would increase the size of every stateowner, when most
lockowners are not that large.
Instead, fix this by checking the encoded response length against
NFSD4_REPLAY_ISIZE before copying into the replay buffer. If the
response is too large, set rp_buflen to 0 to skip caching the replay
payload. The status is still cached, and the client already received the
correct response on the original request.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Reported-by: Nicholas Carlini <npc@anthropic.com>
Tested-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[ replaced `op_status_offset + XDR_UNIT` with existing `post_err_offset` variable ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 2d1ababdedd4ba38867c2500eb7f95af5ddeeef7 ]
If we attempt to create several files with names that result in the same
hash, we have to pack them in same dir item and that has a limit inherent
to the leaf size. However if we reach that limit, we trigger a transaction
abort and turns the filesystem into RO mode. This allows for a malicious
user to disrupt a system, without the need to have administration
privileges/capabilities.
Reproducer:
$ cat exploit-hash-collisions.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
# Use smallest node size to make the test faster and require fewer file
# names that result in hash collision.
mkfs.btrfs -f --nodesize 4K $DEV
mount $DEV $MNT
# List of names that result in the same crc32c hash for btrfs.
declare -a names=(
'foobar'
'%a8tYkxfGMLWRGr55QSeQc4PBNH9PCLIvR6jZnkDtUUru1t@RouaUe_L:@xGkbO3nCwvLNYeK9vhE628gss:T$yZjZ5l-Nbd6CbC$M=hqE-ujhJICXyIxBvYrIU9-TDC'
'AQci3EUB%shMsg-N%frgU:02ByLs=IPJU0OpgiWit5nexSyxZDncY6WB:=zKZuk5Zy0DD$Ua78%MelgBuMqaHGyKsJUFf9s=UW80PcJmKctb46KveLSiUtNmqrMiL9-Y0I_l5Fnam04CGIg=8@U:Z'
'CvVqJpJzueKcuA$wqwePfyu7VxuWNN3ho$p0zi2H8QFYK$7YlEqOhhb%:hHgjhIjW5vnqWHKNP4'
'ET:vk@rFU4tsvMB0$C_p=xQHaYZjvoF%-BTc%wkFW8yaDAPcCYoR%x$FH5O:'
'HwTon%v7SGSP4FE08jBwwiu5aot2CFKXHTeEAa@38fUcNGOWvE@Mz6WBeDH_VooaZ6AgsXPkVGwy9l@@ZbNXabUU9csiWrrOp0MWUdfi$EZ3w9GkIqtz7I_eOsByOkBOO'
'Ij%2VlFGXSuPvxJGf5UWy6O@1svxGha%b@=%wjkq:CIgE6u7eJOjmQY5qTtxE2Rjbis9@us'
'KBkjG5%9R8K9sOG8UTnAYjxLNAvBmvV5vz3IiZaPmKuLYO03-6asI9lJ_j4@6Xo$KZicaLWJ3Pv8XEwVeUPMwbHYWwbx0pYvNlGMO9F:ZhHAwyctnGy%_eujl%WPd4U2BI7qooOSr85J-C2V$LfY'
'NcRfDfuUQ2=zP8K3CCF5dFcpfiOm6mwenShsAb_F%n6GAGC7fT2JFFn:c35X-3aYwoq7jNX5$ZJ6hI3wnZs$7KgGi7wjulffhHNUxAT0fRRLF39vJ@NvaEMxsMO'
'Oj42AQAEzRoTxa5OuSKIr=A_lwGMy132v4g3Pdq1GvUG9874YseIFQ6QU'
'Ono7avN5GjC:_6dBJ_'
'WHmN2gnmaN-9dVDy4aWo:yNGFzz8qsJyJhWEWcud7$QzN2D9R0efIWWEdu5kwWr73NZm4=@CoCDxrrZnRITr-kGtU_cfW2:%2_am'
'WiFnuTEhAG9FEC6zopQmj-A-$LDQ0T3WULz%ox3UZAPybSV6v1Z$b4L_XBi4M4BMBtJZpz93r9xafpB77r:lbwvitWRyo$odnAUYlYMmU4RvgnNd--e=I5hiEjGLETTtaScWlQp8mYsBovZwM2k'
'XKyH=OsOAF3p%uziGF_ZVr$ivrvhVgD@1u%5RtrV-gl_vqAwHkK@x7YwlxX3qT6WKKQ%PR56NrUBU2dOAOAdzr2=5nJuKPM-T-$ZpQfCL7phxQbUcb:BZOTPaFExc-qK-gDRCDW2'
'd3uUR6OFEwZr%ns1XH_@tbxA@cCPmbBRLdyh7p6V45H$P2$F%w0RqrD3M0g8aGvWpoTFMiBdOTJXjD:JF7=h9a_43xBywYAP%r$SPZi%zDg%ql-KvkdUCtF9OLaQlxmd'
'ePTpbnit%hyNm@WELlpKzNZYOzOTf8EQ$sEfkMy1VOfIUu3coyvIr13-Y7Sv5v-Ivax2Go_GQRFMU1b3362nktT9WOJf3SpT%z8sZmM3gvYQBDgmKI%%RM-G7hyrhgYflOw%z::ZRcv5O:lDCFm'
'evqk743Y@dvZAiG5J05L_ROFV@$2%rVWJ2%3nxV72-W7$e$-SK3tuSHA2mBt$qloC5jwNx33GmQUjD%akhBPu=VJ5g$xhlZiaFtTrjeeM5x7dt4cHpX0cZkmfImndYzGmvwQG:$euFYmXn$_2rA9mKZ'
'gkgUtnihWXsZQTEkrMAWIxir09k3t7jk_IK25t1:cy1XWN0GGqC%FrySdcmU7M8MuPO_ppkLw3=Dfr0UuBAL4%GFk2$Ma10V1jDRGJje%Xx9EV2ERaWKtjpwiZwh0gCSJsj5UL7CR8RtW5opCVFKGGy8Cky'
'hNgsG_8lNRik3PvphqPm0yEH3P%%fYG:kQLY=6O-61Wa6nrV_WVGR6TLB09vHOv%g4VQRP8Gzx7VXUY1qvZyS'
'isA7JVzN12xCxVPJZ_qoLm-pTBuhjjHMvV7o=F:EaClfYNyFGlsfw-Kf%uxdqW-kwk1sPl2vhbjyHU1A6$hz'
'kiJ_fgcdZFDiOptjgH5PN9-PSyLO4fbk_:u5_2tz35lV_iXiJ6cx7pwjTtKy-XGaQ5IefmpJ4N_ZqGsqCsKuqOOBgf9LkUdffHet@Wu'
'lvwtxyhE9:%Q3UxeHiViUyNzJsy:fm38pg_b6s25JvdhOAT=1s0$pG25x=LZ2rlHTszj=gN6M4zHZYr_qrB49i=pA--@WqWLIuX7o1S_SfS@2FSiUZN'
'rC24cw3UBDZ=5qJBUMs9e$=S4Y94ni%Z8639vnrGp=0Hv4z3dNFL0fBLmQ40=EYIY:Z=SLc@QLMSt2zsss2ZXrP7j4='
'uwGl2s-fFrf@GqS=DQqq2I0LJSsOmM%xzTjS:lzXguE3wChdMoHYtLRKPvfaPOZF2fER@j53evbKa7R%A7r4%YEkD=kicJe@SFiGtXHbKe4gCgPAYbnVn'
'UG37U6KKua2bgc:IHzRs7BnB6FD:2Mt5Cc5NdlsW%$1tyvnfz7S27FvNkroXwAW:mBZLA1@qa9WnDbHCDmQmfPMC9z-Eq6QT0jhhPpqyymaD:R02ghwYo%yx7SAaaq-:x33LYpei$5g8DMl3C'
'y2vjek0FE1PDJC0qpfnN:x8k2wCFZ9xiUF2ege=JnP98R%wxjKkdfEiLWvQzmnW'
'8-HCSgH5B%K7P8_jaVtQhBXpBk:pE-$P7ts58U0J@iR9YZntMPl7j$s62yAJO@_9eanFPS54b=UTw$94C-t=HLxT8n6o9P=QnIxq-f1=Ne2dvhe6WbjEQtc'
'YPPh:IFt2mtR6XWSmjHptXL_hbSYu8bMw-JP8@PNyaFkdNFsk$M=xfL6LDKCDM-mSyGA_2MBwZ8Dr4=R1D%7-mCaaKGxb990jzaagRktDTyp'
'9hD2ApKa_t_7x-a@GCG28kY:7$M@5udI1myQ$x5udtggvagmCQcq9QXWRC5hoB0o-_zHQUqZI5rMcz_kbMgvN5jr63LeYA4Cj-c6F5Ugmx6DgVf@2Jqm%MafecpgooqreJ53P-QTS'
)
# Now create files with all those names in the same parent directory.
# It should not fail since a 4K leaf has enough space for them.
for name in "${names[@]}"; do
touch $MNT/$name
done
# Now add one more file name that causes a crc32c hash collision.
# This should fail, but it should not turn the filesystem into RO mode
# (which could be exploited by malicious users) due to a transaction
# abort.
touch $MNT/'W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt'
# Check that we are able to create another file, with a name that does not cause
# a crc32c hash collision.
echo -n "hello world" > $MNT/baz
# Unmount and mount again, verify file baz exists and with the right content.
umount $MNT
mount $DEV $MNT
echo "File baz content: $(cat $MNT/baz)"
umount $MNT
When running the reproducer:
$ ./exploit-hash-collisions.sh
(...)
touch: cannot touch '/mnt/sdi/W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt': Value too large for defined data type
./exploit-hash-collisions.sh: line 57: /mnt/sdi/baz: Read-only file system
cat: /mnt/sdi/baz: No such file or directory
File baz content:
And the transaction abort stack trace in dmesg/syslog:
$ dmesg
(...)
[758240.509761] ------------[ cut here ]------------
[758240.510668] BTRFS: Transaction aborted (error -75)
[758240.511577] WARNING: fs/btrfs/inode.c:6854 at btrfs_create_new_inode+0x805/0xb50 [btrfs], CPU#6: touch/888644
[758240.513513] Modules linked in: btrfs dm_zero (...)
[758240.523221] CPU: 6 UID: 0 PID: 888644 Comm: touch Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[758240.524621] Tainted: [W]=WARN
[758240.525037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[758240.526331] RIP: 0010:btrfs_create_new_inode+0x80b/0xb50 [btrfs]
[758240.527093] Code: 0f 82 cf (...)
[758240.529211] RSP: 0018:ffffce64418fbb48 EFLAGS: 00010292
[758240.529935] RAX: 00000000ffffffd3 RBX: 0000000000000000 RCX: 00000000ffffffb5
[758240.531040] RDX: 0000000d04f33e06 RSI: 00000000ffffffb5 RDI: ffffffffc0919dd0
[758240.531920] RBP: ffffce64418fbc10 R08: 0000000000000000 R09: 00000000ffffffb5
[758240.532928] R10: 0000000000000000 R11: ffff8e52c0000000 R12: ffff8e53eee7d0f0
[758240.533818] R13: ffff8e57f70932a0 R14: ffff8e5417629568 R15: 0000000000000000
[758240.534664] FS: 00007f1959a2a740(0000) GS:ffff8e5b27cae000(0000) knlGS:0000000000000000
[758240.535821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[758240.536644] CR2: 00007f1959b10ce0 CR3: 000000012a2cc005 CR4: 0000000000370ef0
[758240.537517] Call Trace:
[758240.537828] <TASK>
[758240.538099] btrfs_create_common+0xbf/0x140 [btrfs]
[758240.538760] path_openat+0x111a/0x15b0
[758240.539252] do_filp_open+0xc2/0x170
[758240.539699] ? preempt_count_add+0x47/0xa0
[758240.540200] ? __virt_addr_valid+0xe4/0x1a0
[758240.540800] ? __check_object_size+0x1b3/0x230
[758240.541661] ? alloc_fd+0x118/0x180
[758240.542315] do_sys_openat2+0x70/0xd0
[758240.543012] __x64_sys_openat+0x50/0xa0
[758240.543723] do_syscall_64+0x50/0xf20
[758240.544462] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[758240.545397] RIP: 0033:0x7f1959abc687
[758240.546019] Code: 48 89 fa (...)
[758240.548522] RSP: 002b:00007ffe16ff8690 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[758240.566278] RAX: ffffffffffffffda RBX: 00007f1959a2a740 RCX: 00007f1959abc687
[758240.567068] RDX: 0000000000000941 RSI: 00007ffe16ffa333 RDI: ffffffffffffff9c
[758240.567860] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[758240.568707] R10: 00000000000001b6 R11: 0000000000000202 R12: 0000561eec7c4b90
[758240.569712] R13: 0000561eec7c311f R14: 00007ffe16ffa333 R15: 0000000000000000
[758240.570758] </TASK>
[758240.571040] ---[ end trace 0000000000000000 ]---
[758240.571681] BTRFS: error (device sdi state A) in btrfs_create_new_inode:6854: errno=-75 unknown
[758240.572899] BTRFS info (device sdi state EA): forced readonly
Fix this by checking for hash collision, and if the adding a new name is
possible, early in btrfs_create_new_inode() before we do any tree updates,
so that we don't need to abort the transaction if we cannot add the new
name due to the leaf size limit.
A test case for fstests will be sent soon.
Fixes: caae78e03234 ("btrfs: move common inode creation code into btrfs_create_new_inode()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 87f2c46003fce4d739138aab4af1942b1afdadac ]
If the set received ioctl fails due to an item overflow when attempting to
add the BTRFS_UUID_KEY_RECEIVED_SUBVOL we have to abort the transaction
since we did some metadata updates before.
This means that if a user calls this ioctl with the same received UUID
field for a lot of subvolumes, we will hit the overflow, trigger the
transaction abort and turn the filesystem into RO mode. A malicious user
could exploit this, and this ioctl does not even requires that a user
has admin privileges (CAP_SYS_ADMIN), only that he/she owns the subvolume.
Fix this by doing an early check for item overflow before starting a
transaction. This is also race safe because we are holding the subvol_sem
semaphore in exclusive (write) mode.
A test case for fstests will follow soon.
Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree")
CC: stable@vger.kernel.org # 3.12+
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ adapted BTRFS_PATH_AUTO_FREE macro to manual btrfs_free_path calls ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e1b18b959025e6b5dbad668f391f65d34b39595a ]
Currently a user can trigger a transaction abort by snapshotting a
previously received snapshot a bunch of times until we reach a
BTRFS_UUID_KEY_RECEIVED_SUBVOL item overflow (the maximum item size we
can store in a leaf). This is very likely not common in practice, but
if it happens, it turns the filesystem into RO mode. The snapshot, send
and set_received_subvol and subvol_setflags (used by receive) don't
require CAP_SYS_ADMIN, just inode_owner_or_capable(). A malicious user
could use this to turn a filesystem into RO mode and disrupt a system.
Reproducer script:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
# Use smallest node size to make the test faster.
mkfs.btrfs -f --nodesize 4K $DEV
mount $DEV $MNT
# Create a subvolume and set it to RO so that it can be used for send.
btrfs subvolume create $MNT/sv
touch $MNT/sv/foo
btrfs property set $MNT/sv ro true
# Send and receive the subvolume into snaps/sv.
mkdir $MNT/snaps
btrfs send $MNT/sv | btrfs receive $MNT/snaps
# Now snapshot the received subvolume, which has a received_uuid, a
# lot of times to trigger the leaf overflow.
total=500
for ((i = 1; i <= $total; i++)); do
echo -ne "\rCreating snapshot $i/$total"
btrfs subvolume snapshot -r $MNT/snaps/sv $MNT/snaps/sv_$i > /dev/null
done
echo
umount $MNT
When running the test:
$ ./test.sh
(...)
Create subvolume '/mnt/sdi/sv'
At subvol /mnt/sdi/sv
At subvol sv
Creating snapshot 496/500ERROR: Could not create subvolume: Value too large for defined data type
Creating snapshot 497/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 498/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 499/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 500/500ERROR: Could not create subvolume: Read-only file system
And in dmesg/syslog:
$ dmesg
(...)
[251067.627338] BTRFS warning (device sdi): insert uuid item failed -75 (0x4628b21c4ac8d898, 0x2598bee2b1515c91) type 252!
[251067.629212] ------------[ cut here ]------------
[251067.630033] BTRFS: Transaction aborted (error -75)
[251067.630871] WARNING: fs/btrfs/transaction.c:1907 at create_pending_snapshot.cold+0x52/0x465 [btrfs], CPU#10: btrfs/615235
[251067.632851] Modules linked in: btrfs dm_zero (...)
[251067.644071] CPU: 10 UID: 0 PID: 615235 Comm: btrfs Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[251067.646165] Tainted: [W]=WARN
[251067.646733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[251067.648735] RIP: 0010:create_pending_snapshot.cold+0x55/0x465 [btrfs]
[251067.649984] Code: f0 48 0f (...)
[251067.653313] RSP: 0018:ffffce644908fae8 EFLAGS: 00010292
[251067.653987] RAX: 00000000ffffff01 RBX: ffff8e5639e63a80 RCX: 00000000ffffffd3
[251067.655042] RDX: ffff8e53faa76b00 RSI: 00000000ffffffb5 RDI: ffffffffc0919750
[251067.656077] RBP: ffffce644908fbd8 R08: 0000000000000000 R09: ffffce644908f820
[251067.657068] R10: ffff8e5adc1fffa8 R11: 0000000000000003 R12: ffff8e53c0431bd0
[251067.658050] R13: ffff8e5414593600 R14: ffff8e55efafd000 R15: 00000000ffffffb5
[251067.659019] FS: 00007f2a4944b3c0(0000) GS:ffff8e5b27dae000(0000) knlGS:0000000000000000
[251067.660115] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[251067.660943] CR2: 00007ffc5aa57898 CR3: 00000005813a2003 CR4: 0000000000370ef0
[251067.661972] Call Trace:
[251067.662292] <TASK>
[251067.662653] create_pending_snapshots+0x97/0xc0 [btrfs]
[251067.663413] btrfs_commit_transaction+0x26e/0xc00 [btrfs]
[251067.664257] ? btrfs_qgroup_convert_reserved_meta+0x35/0x390 [btrfs]
[251067.665238] ? _raw_spin_unlock+0x15/0x30
[251067.665837] ? record_root_in_trans+0xa2/0xd0 [btrfs]
[251067.666531] btrfs_mksubvol+0x330/0x580 [btrfs]
[251067.667145] btrfs_mksnapshot+0x74/0xa0 [btrfs]
[251067.667827] __btrfs_ioctl_snap_create+0x194/0x1d0 [btrfs]
[251067.668595] btrfs_ioctl_snap_create_v2+0x107/0x130 [btrfs]
[251067.669479] btrfs_ioctl+0x1580/0x2690 [btrfs]
[251067.670093] ? count_memcg_events+0x6d/0x180
[251067.670849] ? handle_mm_fault+0x1a0/0x2a0
[251067.671652] __x64_sys_ioctl+0x92/0xe0
[251067.672406] do_syscall_64+0x50/0xf20
[251067.673129] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[251067.674096] RIP: 0033:0x7f2a495648db
[251067.674812] Code: 00 48 89 (...)
[251067.678227] RSP: 002b:00007ffc5aa57840 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[251067.679691] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2a495648db
[251067.681145] RDX: 00007ffc5aa588b0 RSI: 0000000050009417 RDI: 0000000000000004
[251067.682511] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
[251067.683842] R10: 000000000000000a R11: 0000000000000246 R12: 00007ffc5aa59910
[251067.685176] R13: 00007ffc5aa588b0 R14: 0000000000000004 R15: 0000000000000006
[251067.686524] </TASK>
[251067.686972] ---[ end trace 0000000000000000 ]---
[251067.687890] BTRFS: error (device sdi state A) in create_pending_snapshot:1907: errno=-75 unknown
[251067.689049] BTRFS info (device sdi state EA): forced readonly
[251067.689054] BTRFS warning (device sdi state EA): Skipping commit of aborted transaction.
[251067.690119] BTRFS: error (device sdi state EA) in cleanup_transaction:2043: errno=-75 unknown
[251067.702028] BTRFS info (device sdi state EA): last unmount of filesystem 46dc3975-30a2-4a69-a18f-418b859cccda
Fix this by ignoring -EOVERFLOW errors from btrfs_uuid_tree_add() in the
snapshot creation code when attempting to add the
BTRFS_UUID_KEY_RECEIVED_SUBVOL item. This is OK because it's not critical
and we are still able to delete the snapshot, as snapshot/subvolume
deletion ignores if a BTRFS_UUID_KEY_RECEIVED_SUBVOL is missing (see
inode.c:btrfs_delete_subvolume()). As for send/receive, we can still do
send/receive operations since it always peeks the first root ID in the
existing BTRFS_UUID_KEY_RECEIVED_SUBVOL (it could peek any since all
snapshots have the same content), and even if the key is missing, it
falls back to searching by BTRFS_UUID_KEY_SUBVOL key.
A test case for fstests will be sent soon.
Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree")
CC: stable@vger.kernel.org # 3.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ adapted error check condition to omit unlikely() wrapper ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 282343cf8a4a5a3603b1cb0e17a7083e4a593b03 upstream.
When a multichannel SMB2_SESSION_SETUP request with
SMB2_SESSION_REQ_FLAG_BINDING fails ksmbd sets conn->binding = true
but never clears it on the error path. This leaves the connection in
a binding state where all subsequent ksmbd_session_lookup_all() calls
fall back to the global sessions table. This fix it by clearing
conn->binding = false in the error path.
Cc: stable@vger.kernel.org
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 12b4c5d98cd7ca46d5035a57bcd995df614c14e1 upstream.
Customer reported that some of their krb5 mounts were failing against
a single server as the client was trying to mount the shares with
wrong credentials. It turned out the client was reusing SMB session
from first mount to try mounting the other shares, even though a
different username= option had been specified to the other mounts.
By using username mount option along with sec=krb5 to search for
principals from keytab is supported by cifs.upcall(8) since
cifs-utils-4.8. So fix this by matching username mount option in
match_session() even with Kerberos.
For example, the second mount below should fail with -ENOKEY as there
is no 'foobar' principal in keytab (/etc/krb5.keytab). The client
ends up reusing SMB session from first mount to perform the second
one, which is wrong.
```
$ ktutil
ktutil: add_entry -password -p testuser -k 1 -e aes256-cts
Password for testuser@ZELDA.TEST:
ktutil: write_kt /etc/krb5.keytab
ktutil: quit
$ klist -ke
Keytab name: FILE:/etc/krb5.keytab
KVNO Principal
---- ----------------------------------------------------------------
1 testuser@ZELDA.TEST (aes256-cts-hmac-sha1-96)
$ mount.cifs //w22-root2/scratch /mnt/1 -o sec=krb5,username=testuser
$ mount.cifs //w22-root2/scratch /mnt/2 -o sec=krb5,username=foobar
$ mount -t cifs | grep -Po 'username=\K\w+'
testuser
testuser
```
Reported-by: Oscar Santos <ossantos@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e7fcf179b82d3a3730fd8615da01b087cc654d0b upstream.
The /proc/fs/nfs/exports proc entry is created at module init
and persists for the module's lifetime. exports_proc_open()
captures the caller's current network namespace and stores
its svc_export_cache in seq->private, but takes no reference
on the namespace. If the namespace is subsequently torn down
(e.g. container destruction after the opener does setns() to a
different namespace), nfsd_net_exit() calls nfsd_export_shutdown()
which frees the cache. Subsequent reads on the still-open fd
dereference the freed cache_detail, walking a freed hash table.
Hold a reference on the struct net for the lifetime of the open
file descriptor. This prevents nfsd_net_exit() from running --
and thus prevents nfsd_export_shutdown() from freeing the cache
-- while any exports fd is open. cache_detail already stores
its net pointer (cd->net, set by cache_create_net()), so
exports_release() can retrieve it without additional per-file
storage.
Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: 96d851c4d28d ("nfsd: use proper net while reading "exports" file")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 90f601b497d76f40fa66795c3ecf625b6aced9fd ]
bm_register_write() opens an executable file using open_exec(), which
internally calls do_open_execat() and denies write access on the file to
avoid modification while it is being executed.
However, when an error occurs, bm_register_write() closes the file using
filp_close() directly. This does not restore the write permission, which
may cause subsequent write operations on the same file to fail.
Fix this by calling exe_file_allow_write_access() before filp_close() to
restore the write permission properly.
Fixes: e7850f4d844e ("binfmt_misc: fix possible deadlock in bm_register_write")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Link: https://patch.msgid.link/20251105022923.1813587-1-zilin@seu.edu.cn
Signed-off-by: Christian Brauner <brauner@kernel.org>
[ Use allow_write_access() instead of exe_file_allow_write_access()
according to commit 0357ef03c94ef
("fs: don't block write during exec on pre-content watched files"). ]
Signed-off-by: Robert Garcia <rob_garcia@163.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 773704c1ef96a8b70d0d186ab725f50548de82c4 ]
w/ below testcase, it will cause inconsistence in between SIT and SSA.
create_null_blk 512 2 1024 1024
mkfs.f2fs -m /dev/nullb0
mount /dev/nullb0 /mnt/f2fs/
touch /mnt/f2fs/file
f2fs_io pinfile set /mnt/f2fs/file
fallocate -l 4GiB /mnt/f2fs/file
F2FS-fs (nullb0): Inconsistent segment (0) type [1, 0] in SSA and SIT
CPU: 5 UID: 0 PID: 2398 Comm: fallocate Tainted: G O 6.13.0-rc1 #84
Tainted: [O]=OOT_MODULE
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
<TASK>
dump_stack_lvl+0xb3/0xd0
dump_stack+0x14/0x20
f2fs_handle_critical_error+0x18c/0x220 [f2fs]
f2fs_stop_checkpoint+0x38/0x50 [f2fs]
do_garbage_collect+0x674/0x6e0 [f2fs]
f2fs_gc_range+0x12b/0x230 [f2fs]
f2fs_allocate_pinning_section+0x5c/0x150 [f2fs]
f2fs_expand_inode_data+0x1cc/0x3c0 [f2fs]
f2fs_fallocate+0x3c3/0x410 [f2fs]
vfs_fallocate+0x15f/0x4b0
__x64_sys_fallocate+0x4a/0x80
x64_sys_call+0x15e8/0x1b80
do_syscall_64+0x68/0x130
entry_SYSCALL_64_after_hwframe+0x67/0x6f
RIP: 0033:0x7f9dba5197ca
F2FS-fs (nullb0): Stopped filesystem due to reason: 4
The reason is f2fs_gc_range() may try to migrate block in curseg, however,
its SSA block is not uptodate due to the last summary block data is still
in cache of curseg.
In this patch, we add a condition in f2fs_gc_range() to check whether
section is opened or not, and skip block migration for opened section.
Fixes: 9703d69d9d15 ("f2fs: support file pinning for zoned devices")
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Cc: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
[ Minor conflict resolved. ]
Signed-off-by: Li hongliang <1468888505@139.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4e159150a9a56d66d247f4b5510bed46fe58aa1c ]
[BUG]
There is an internal report that over 1000 processes are
waiting at the io_schedule_timeout() of balance_dirty_pages(), causing
a system hang and trigger a kernel coredump.
The kernel is v6.4 kernel based, but the root problem still applies to
any upstream kernel before v6.18.
[CAUSE]
>From Jan Kara for his wisdom on the dirty page balance behavior first.
This cgroup dirty limit was what was actually playing the role here
because the cgroup had only a small amount of memory and so the dirty
limit for it was something like 16MB.
Dirty throttling is responsible for enforcing that nobody can dirty
(significantly) more dirty memory than there's dirty limit. Thus when
a task is dirtying pages it periodically enters into balance_dirty_pages()
and we let it sleep there to slow down the dirtying.
When the system is over dirty limit already (either globally or within
a cgroup of the running task), we will not let the task exit from
balance_dirty_pages() until the number of dirty pages drops below the
limit.
So in this particular case, as I already mentioned, there was a cgroup
with relatively small amount of memory and as a result with dirty limit
set at 16MB. A task from that cgroup has dirtied about 28MB worth of
pages in btrfs btree inode and these were practically the only dirty
pages in that cgroup.
So that means the only way to reduce the dirty pages of that cgroup is
to writeback the dirty pages of btrfs btree inode, and only after that
those processes can exit balance_dirty_pages().
Now back to the btrfs part, btree_writepages() is responsible for
writing back dirty btree inode pages.
The problem here is, there is a btrfs internal threshold that if the
btree inode's dirty bytes are below the 32M threshold, it will not
do any writeback.
This behavior is to batch as much metadata as possible so we won't write
back those tree blocks and then later re-COW them again for another
modification.
This internal 32MiB is higher than the existing dirty page size (28MiB),
meaning no writeback will happen, causing a deadlock between btrfs and
cgroup:
- Btrfs doesn't want to write back btree inode until more dirty pages
- Cgroup/MM doesn't want more dirty pages for btrfs btree inode
Thus any process touching that btree inode is put into sleep until
the number of dirty pages is reduced.
Thanks Jan Kara a lot for the analysis of the root cause.
[ENHANCEMENT]
Since kernel commit b55102826d7d ("btrfs: set AS_KERNEL_FILE on the
btree_inode"), btrfs btree inode pages will only be charged to the root
cgroup which should have a much larger limit than btrfs' 32MiB
threshold.
So it should not affect newer kernels.
But for all current LTS kernels, they are all affected by this problem,
and backporting the whole AS_KERNEL_FILE may not be a good idea.
Even for newer kernels I still think it's a good idea to get
rid of the internal threshold at btree_writepages(), since for most cases
cgroup/MM has a better view of full system memory usage than btrfs' fixed
threshold.
For internal callers using btrfs_btree_balance_dirty() since that
function is already doing internal threshold check, we don't need to
bother them.
But for external callers of btree_writepages(), just respect their
requests and write back whatever they want, ignoring the internal
btrfs threshold to avoid such deadlock on btree inode dirty page
balancing.
CC: stable@vger.kernel.org
CC: Jan Kara <jack@suse.cz>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ The context change is due to the commit 41044b41ad2c
("btrfs: add helper to get fs_info from struct inode pointer")
in v6.9 and the commit c66f2afc7148
("btrfs: remove pointless writepages callback wrapper")
in v6.10 which are irrelevant to the logic of this patch. ]
Signed-off-by: Rahul Sharma <black.hawk@163.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7fd8720dff2d9c70cf5a1a13b7513af01952ec02 upstream.
Since commit 222f2c7c6d14 ("iomap: always run error completions in user
context"), read error completions are deferred to s_dio_done_wq. This
means the workqueue also needs to be allocated for async reads.
Fixes: 222f2c7c6d14 ("iomap: always run error completions in user context")
Reported-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251124140013.902853-1-hch@lst.de
Tested-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Chen Yu <xnguchen@sina.cn>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4865c768b563deff1b6a6384e74a62f143427b42 ]
For filesystems with more than 2^32 blocks inodes using indirect block
based format cannot use blocks beyond the 32-bit limit.
ext4_mb_scan_groups_linear() takes care to not select these unsupported
groups for such inodes however other functions selecting groups for
allocation don't. So far this is harmless because the other selection
functions are used only with mb_optimize_scan and this is currently
disabled for inodes with indirect blocks however in the following patch
we want to enable mb_optimize_scan regardless of inode format.
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Cc: stable@kernel.org
Link: https://patch.msgid.link/20260114182836.14120-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[ Drop a few hunks not needed in older trees ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 857bf9056291a16785ae3be1d291026b2437fc48 ]
Ben Coddington reports seeing a hang in the following stack trace:
0 [ffffd0b50e1774e0] __schedule at ffffffff9ca05415
1 [ffffd0b50e177548] schedule at ffffffff9ca05717
2 [ffffd0b50e177558] bit_wait at ffffffff9ca061e1
3 [ffffd0b50e177568] __wait_on_bit at ffffffff9ca05cfb
4 [ffffd0b50e1775c8] out_of_line_wait_on_bit at ffffffff9ca05ea5
5 [ffffd0b50e177618] pnfs_roc at ffffffffc154207b [nfsv4]
6 [ffffd0b50e1776b8] _nfs4_proc_delegreturn at ffffffffc1506586 [nfsv4]
7 [ffffd0b50e177788] nfs4_proc_delegreturn at ffffffffc1507480 [nfsv4]
8 [ffffd0b50e1777f8] nfs_do_return_delegation at ffffffffc1523e41 [nfsv4]
9 [ffffd0b50e177838] nfs_inode_set_delegation at ffffffffc1524a75 [nfsv4]
10 [ffffd0b50e177888] nfs4_process_delegation at ffffffffc14f41dd [nfsv4]
11 [ffffd0b50e1778a0] _nfs4_opendata_to_nfs4_state at ffffffffc1503edf [nfsv4]
12 [ffffd0b50e1778c0] _nfs4_open_and_get_state at ffffffffc1504e56 [nfsv4]
13 [ffffd0b50e177978] _nfs4_do_open at ffffffffc15051b8 [nfsv4]
14 [ffffd0b50e1779f8] nfs4_do_open at ffffffffc150559c [nfsv4]
15 [ffffd0b50e177a80] nfs4_atomic_open at ffffffffc15057fb [nfsv4]
16 [ffffd0b50e177ad0] nfs4_file_open at ffffffffc15219be [nfsv4]
17 [ffffd0b50e177b78] do_dentry_open at ffffffff9c09e6ea
18 [ffffd0b50e177ba8] vfs_open at ffffffff9c0a082e
19 [ffffd0b50e177bd0] dentry_open at ffffffff9c0a0935
The issue is that the delegreturn is being asked to wait for a layout
return that cannot complete because a state recovery was initiated. The
state recovery cannot complete until the open() finishes processing the
delegations it was given.
The solution is to propagate the existing flags that indicate a
non-blocking call to the function pnfs_roc(), so that it knows not to
wait in this situation.
Reported-by: Benjamin Coddington <bcodding@hammerspace.com>
Fixes: 29ade5db1293 ("pNFS: Wait on outstanding layoutreturns to complete in pnfs_roc()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[ Minor conflict resolved. ]
Signed-off-by: Li hongliang <1468888505@139.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit cce0be6eb4971456b703aaeafd571650d314bcca ]
Wang Zhaolong reports a deadlock involving NFSv4.1 state recovery
waiting on kthreadd, which is attempting to reclaim memory by calling
nfs_release_folio(). The latter cannot make progress due to state
recovery being needed.
It seems that the only safe thing to do here is to kick off a writeback
of the folio, without waiting for completion, or else kicking off an
asynchronous commit.
Reported-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[ Minor conflict resolved. ]
Signed-off-by: Li hongliang <1468888505@139.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit fada32ed6dbc748f447c8d050a961b75d946055a ]
nfs_folio_length is unsafe to use without having the folio locked and a
check for a NULL ->f_mapping that protects against truncations and can
lead to kernel crashes. E.g. when running xfstests generic/065 with
all nfs trace points enabled.
Follow the model of the XFS trace points and pass in an explŃ–cit offset
and length. This has the additional benefit that these values can
be more accurate as some of the users touch partial folio ranges.
Fixes: eb5654b3b89d ("NFS: Enable tracing of nfs_invalidate_folio() and nfs_launder_folio()")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
[ Minor conflict resolved. ]
Signed-off-by: Li hongliang <1468888505@139.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|