summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
3 daysext4: handle wraparound when searching for blocks for indirect mapped blocksTheodore Ts'o1-0/+2
[ Upstream commit bb81702370fad22c06ca12b6e1648754dbc37e0f ] Commit 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") restricts what blocks will be allocated for indirect block based files to block numbers that fit within 32-bit block numbers. However, when using a review bot running on the latest Gemini LLM to check this commit when backporting into an LTS based kernel, it raised this concern: If ac->ac_g_ex.fe_group is >= ngroups (for instance, if the goal group was populated via stream allocation from s_mb_last_groups), then start will be >= ngroups. Does this allow allocating blocks beyond the 32-bit limit for indirect block mapped files? The commit message mentions that ext4_mb_scan_groups_linear() takes care to not select unsupported groups. However, its loop uses group = *start, and the very first iteration will call ext4_mb_scan_group() with this unsupported group because next_linear_group() is only called at the end of the iteration. After reviewing the code paths involved and considering the LLM review, I determined that this can happen when there is a file system where some files/directories are extent-mapped and others are indirect-block mapped. To address this, add a safety clamp in ext4_mb_scan_groups(). Fixes: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Cc: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20260326045834.1175822-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daysext4: publish jinode after initializationLi Chen2-6/+13
[ Upstream commit 1aec30021edd410b986c156f195f3d23959a9d11 ] ext4_inode_attach_jinode() publishes ei->jinode to concurrent users. It used to set ei->jinode before jbd2_journal_init_jbd_inode(), allowing a reader to observe a non-NULL jinode with i_vfs_inode still unset. The fast commit flush path can then pass this jinode to jbd2_wait_inode_data(), which dereferences i_vfs_inode->i_mapping and may crash. Below is the crash I observe: ``` BUG: unable to handle page fault for address: 000000010beb47f4 PGD 110e51067 P4D 110e51067 PUD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 1 UID: 0 PID: 4850 Comm: fc_fsync_bench_ Not tainted 6.18.0-00764-g795a690c06a5 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 RIP: 0010:xas_find_marked+0x3d/0x2e0 Code: e0 03 48 83 f8 02 0f 84 f0 01 00 00 48 8b 47 08 48 89 c3 48 39 c6 0f 82 fd 01 00 00 48 85 c9 74 3d 48 83 f9 03 77 63 4c 8b 0f <49> 8b 71 08 48 c7 47 18 00 00 00 00 48 89 f1 83 e1 03 48 83 f9 02 RSP: 0018:ffffbbee806e7bf0 EFLAGS: 00010246 RAX: 000000000010beb4 RBX: 000000000010beb4 RCX: 0000000000000003 RDX: 0000000000000001 RSI: 0000002000300000 RDI: ffffbbee806e7c10 RBP: 0000000000000001 R08: 0000002000300000 R09: 000000010beb47ec R10: ffff9ea494590090 R11: 0000000000000000 R12: 0000002000300000 R13: ffffbbee806e7c90 R14: ffff9ea494513788 R15: ffffbbee806e7c88 FS: 00007fc2f9e3e6c0(0000) GS:ffff9ea6b1444000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000010beb47f4 CR3: 0000000119ac5000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> filemap_get_folios_tag+0x87/0x2a0 __filemap_fdatawait_range+0x5f/0xd0 ? srso_alias_return_thunk+0x5/0xfbef5 ? __schedule+0x3e7/0x10c0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 ? cap_safe_nice+0x37/0x70 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 filemap_fdatawait_range_keep_errors+0x12/0x40 ext4_fc_commit+0x697/0x8b0 ? ext4_file_write_iter+0x64b/0x950 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 ? vfs_write+0x356/0x480 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ext4_sync_file+0xf7/0x370 do_fsync+0x3b/0x80 ? syscall_trace_enter+0x108/0x1d0 __x64_sys_fdatasync+0x16/0x20 do_syscall_64+0x62/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ... ``` Fix this by initializing the jbd2_inode first. Use smp_wmb() and WRITE_ONCE() to publish ei->jinode after initialization. Readers use READ_ONCE() to fetch the pointer. Fixes: a361293f5fede ("jbd2: Fix oops in jbd2_journal_file_inode()") Cc: stable@vger.kernel.org Signed-off-by: Li Chen <me@linux.beauty> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260225082617.147957-1-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org [ adapted READ_ONCE(ei->jinode) to use pos->jinode ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daysgfs2: Validate i_depth for exhash directoriesAndrew Price2-4/+8
[ Upstream commit 557c024ca7250bb65ae60f16c02074106c2f197b ] A fuzzer test introduced corruption that ends up with a depth of 0 in dir_e_read(), causing an undefined shift by 32 at: index = hash >> (32 - dip->i_depth); As calculated in an open-coded way in dir_make_exhash(), the minimum depth for an exhash directory is ilog2(sdp->sd_hash_ptrs) and 0 is invalid as sdp->sd_hash_ptrs is fixed as sdp->bsize / 16 at mount time. So we can avoid the undefined behaviour by checking for depth values lower than the minimum in gfs2_dinode_in(). Values greater than the maximum are already being checked for there. Also switch the calculation in dir_make_exhash() to use ilog2() to clarify how the depth is calculated. Tested with the syzkaller repro.c and xfstests '-g quick'. Reported-by: syzbot+4708579bb230a0582a57@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysgfs2: Improve gfs2_consist_inode() usageAndrew Price3-40/+53
[ Upstream commit 10398ef57aa189153406c110f5957145030f08fe ] gfs2_consist_inode() logs an error message with the source file and line number. When we jump before calling it, the line number becomes less useful as it no longer relates to the source of the error. To aid troubleshooting, replace the gotos with the gfs2_consist_inode() calls so that the error messages are more informative. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Ruohan Lan <ruohanlan@aliyun.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysbtrfs: do not free data reservation in fallback from inline due to -ENOSPCFilipe Manana1-1/+5
[ Upstream commit f8da41de0bff9eb1d774a7253da0c9f637c4470a ] If we fail to create an inline extent due to -ENOSPC, we will attempt to go through the normal COW path, reserve an extent, create an ordered extent, etc. However we were always freeing the reserved qgroup data, which is wrong since we will use data. Fix this by freeing the reserved qgroup data in __cow_file_range_inline() only if we are not doing the fallback (ret is <= 0). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysbtrfs: fix the qgroup data free range for inline data extentsQu Wenruo1-1/+1
[ Upstream commit 0bb067ca64e35536f1f5d9ef6aaafc40f4833623 ] Inside function __cow_file_range_inline() since the inlined data no longer take any data space, we need to free up the reserved space. However the code is still using the old page size == sector size assumption, and will not handle subpage case well. Thankfully it is not going to cause any problems because we have two extra safe nets: - Inline data extents creation is disabled for sector size < page size cases for now But it won't stay that for long. - btrfs_qgroup_free_data() will only clear ranges which have been already reserved So even if we pass a range larger than what we need, it should still be fine, especially there is only reserved space for a single block at file offset 0 of an inline data extent. But just for the sake of consistency, fix the call site to use sectorsize instead of page size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: f8da41de0bff ("btrfs: do not free data reservation in fallback from inline due to -ENOSPC") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysbtrfs: reject root items with drop_progress and zero drop_levelZhengYuan Huang1-0/+17
[ Upstream commit b17b79ff896305fd74980a5f72afec370ee88ca4 ] [BUG] When recovering relocation at mount time, merge_reloc_root() and btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against an impossible state: a non-zero drop_progress combined with a zero drop_level in a root_item, which can be triggered: ------------[ cut here ]------------ kernel BUG at fs/btrfs/relocation.c:1545! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2 RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545 Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000 Call Trace: merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861 btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195 btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130 open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640 btrfs_fill_super fs/btrfs/super.c:987 [inline] btrfs_get_tree_super fs/btrfs/super.c:1951 [inline] btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline] btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128 vfs_get_tree+0x9a/0x370 fs/super.c:1758 fc_mount fs/namespace.c:1199 [inline] do_new_mount_fc fs/namespace.c:3642 [inline] do_new_mount fs/namespace.c:3718 [inline] path_mount+0x5b8/0x1ea0 fs/namespace.c:4028 do_mount fs/namespace.c:4041 [inline] __do_sys_mount fs/namespace.c:4229 [inline] __se_sys_mount fs/namespace.c:4206 [inline] __x64_sys_mount+0x282/0x320 fs/namespace.c:4206 ... RIP: 0033:0x7f969c9a8fde Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f ---[ end trace 0000000000000000 ]--- The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic metadata fuzzing tool that corrupts btrfs metadata at runtime. [CAUSE] A non-zero drop_progress.objectid means an interrupted btrfs_drop_snapshot() left a resume point on disk, and in that case drop_level must be greater than 0 because the checkpoint is only saved at internal node levels. Although this invariant is enforced when the kernel writes the root item, it is not validated when the root item is read back from disk. That allows on-disk corruption to provide an invalid state with drop_progress.objectid != 0 and drop_level == 0. When relocation recovery later processes such a root item, merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The same invalid metadata can also trigger the corresponding BUG_ON() in btrfs_drop_snapshot(). [FIX] Fix this by validating the root_item invariant in tree-checker when reading root items from disk: if drop_progress.objectid is non-zero, drop_level must also be non-zero. Reject such malformed metadata with -EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot() and triggers the BUG_ON. After the fix, the same corruption is correctly rejected by tree-checker and the BUG_ON is no longer triggered. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysbtrfs: don't take device_list_mutex when querying zone infoJohannes Thumshirn1-2/+4
[ Upstream commit 77603ab10429fe713a03345553ca8dbbfb1d91c6 ] Shin'ichiro reported sporadic hangs when running generic/013 in our CI system. When enabling lockdep, there is a lockdep splat when calling btrfs_get_dev_zone_info_all_devices() in the mount path that can be triggered by i.e. generic/013: ====================================================== WARNING: possible circular locking dependency detected 7.0.0-rc1+ #355 Not tainted ------------------------------------------------------ mount/1043 is trying to acquire lock: ffff8881020b5470 (&vblk->vdev_mutex){+.+.}-{4:4}, at: virtblk_report_zones+0xda/0x430 but task is already holding lock: ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&fs_devs->device_list_mutex){+.+.}-{4:4}: __mutex_lock+0xa3/0x1360 btrfs_create_pending_block_groups+0x1f4/0x9d0 __btrfs_end_transaction+0x3e/0x2e0 btrfs_zoned_reserve_data_reloc_bg+0x2f8/0x390 open_ctree+0x1934/0x23db btrfs_get_tree.cold+0x105/0x26c vfs_get_tree+0x28/0xb0 __do_sys_fsconfig+0x324/0x680 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #3 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0xc2/0x5c0 start_transaction+0x17c/0xbc0 btrfs_zoned_reserve_data_reloc_bg+0x2b4/0x390 open_ctree+0x1934/0x23db btrfs_get_tree.cold+0x105/0x26c vfs_get_tree+0x28/0xb0 __do_sys_fsconfig+0x324/0x680 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (btrfs_trans_num_writers){++++}-{0:0}: lock_release+0x163/0x4b0 __btrfs_end_transaction+0x1c7/0x2e0 btrfs_dirty_inode+0x6f/0xd0 touch_atime+0xe5/0x2c0 btrfs_file_mmap_prepare+0x65/0x90 __mmap_region+0x4b9/0xf00 mmap_region+0xf7/0x120 do_mmap+0x43d/0x610 vm_mmap_pgoff+0xd6/0x190 ksys_mmap_pgoff+0x7e/0xc0 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&mm->mmap_lock){++++}-{4:4}: __might_fault+0x68/0xa0 _copy_to_user+0x22/0x70 blkdev_copy_zone_to_user+0x22/0x40 virtblk_report_zones+0x282/0x430 blkdev_report_zones_ioctl+0xfd/0x130 blkdev_ioctl+0x20f/0x2c0 __x64_sys_ioctl+0x86/0xd0 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&vblk->vdev_mutex){+.+.}-{4:4}: __lock_acquire+0x1522/0x2680 lock_acquire+0xd5/0x2f0 __mutex_lock+0xa3/0x1360 virtblk_report_zones+0xda/0x430 blkdev_report_zones_cached+0x162/0x190 btrfs_get_dev_zones+0xdc/0x2e0 btrfs_get_dev_zone_info+0x219/0xe80 btrfs_get_dev_zone_info_all_devices+0x62/0x90 open_ctree+0x1200/0x23db btrfs_get_tree.cold+0x105/0x26c vfs_get_tree+0x28/0xb0 __do_sys_fsconfig+0x324/0x680 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &vblk->vdev_mutex --> btrfs_trans_num_extwriters --> &fs_devs->device_list_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_devs->device_list_mutex); lock(btrfs_trans_num_extwriters); lock(&fs_devs->device_list_mutex); lock(&vblk->vdev_mutex); *** DEADLOCK *** 3 locks held by mount/1043: #0: ffff88811063e878 (&fc->uapi_mutex){+.+.}-{4:4}, at: __do_sys_fsconfig+0x2ae/0x680 #1: ffff88810cb9f0e8 (&type->s_umount_key#31/1){+.+.}-{4:4}, at: alloc_super+0xc0/0x3e0 #2: ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90 stack backtrace: CPU: 2 UID: 0 PID: 1043 Comm: mount Not tainted 7.0.0-rc1+ #355 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/10/2025 Call Trace: <TASK> dump_stack_lvl+0x5b/0x80 print_circular_bug.cold+0x18d/0x1d8 check_noncircular+0x10d/0x130 __lock_acquire+0x1522/0x2680 ? vmap_small_pages_range_noflush+0x3ef/0x820 lock_acquire+0xd5/0x2f0 ? virtblk_report_zones+0xda/0x430 ? lock_is_held_type+0xcd/0x130 __mutex_lock+0xa3/0x1360 ? virtblk_report_zones+0xda/0x430 ? virtblk_report_zones+0xda/0x430 ? __pfx_copy_zone_info_cb+0x10/0x10 ? virtblk_report_zones+0xda/0x430 virtblk_report_zones+0xda/0x430 ? __pfx_copy_zone_info_cb+0x10/0x10 blkdev_report_zones_cached+0x162/0x190 ? __pfx_copy_zone_info_cb+0x10/0x10 btrfs_get_dev_zones+0xdc/0x2e0 btrfs_get_dev_zone_info+0x219/0xe80 btrfs_get_dev_zone_info_all_devices+0x62/0x90 open_ctree+0x1200/0x23db btrfs_get_tree.cold+0x105/0x26c ? rcu_is_watching+0x18/0x50 vfs_get_tree+0x28/0xb0 __do_sys_fsconfig+0x324/0x680 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f615e27a40e RSP: 002b:00007fff11b18fb8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af RAX: ffffffffffffffda RBX: 000055572e92ab10 RCX: 00007f615e27a40e RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 RBP: 00007fff11b19100 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000055572e92bc40 R14: 00007f615e3faa60 R15: 000055572e92bd08 </TASK> Don't hold the device_list_mutex while calling into btrfs_get_dev_zone_info() in btrfs_get_dev_zone_info_all_devices() to mitigate the issue. This is safe, as no other thread can touch the device list at the moment of execution. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
8 daysxattr: switch to CLASS(fd)Al Viro1-17/+10
[ Upstream commit a71874379ec8c6e788a61d71b3ad014a8d9a5c08 ] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/20241002012230.4174585-1-viro@zeniv.linux.org.uk/ [ Neither `fd_file` nor `fd_empty` are available in 6.6.y, so the changes to the check are dropped. Kept the minor formatting change. ] Signed-off-by: Tomasz Kramkowski <tomasz@kramkow.ski> Tested-by: Barry K. Nathan <barryn@pobox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
8 daysRevert "xattr: switch to CLASS(fd)"Tomasz Kramkowski1-6/+21
This reverts commit 5a1e865e51063d6c56f673ec8ad4b6604321b455 which is commit a71874379ec8c6e788a61d71b3ad014a8d9a5c08 upstream. A backporting mistake erroneously removed file descriptor checks for `fgetxattr`, `flistxattr`, `fremovexattr`, and `fsetxattr` which lead to kernel panics when those functions were called from userspace with a file descriptor which did not reference an open file. Reported-by: Brad Spengler <spender@grsecurity.net> Closes: https://x.com/spendergrsec/status/2040049852793450561 Cc: Alva Lan <alvalan9@foxmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tomasz Kramkowski <tomasz@kramkow.ski> Tested-by: Barry K. Nathan <barryn@pobox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysbtrfs: fix lost error when running device stats on multiple devices fsFilipe Manana1-2/+3
[ Upstream commit 1c37d896b12dfd0d4c96e310b0033c6676933917 ] Whenever we get an error updating the device stats item for a device in btrfs_run_dev_stats() we allow the loop to go to the next device, and if updating the stats item for the next device succeeds, we end up losing the error we had from the previous device. Fix this by breaking out of the loop once we get an error and make sure it's returned to the caller. Since we are in the transaction commit path (and in the critical section actually), returning the error will result in a transaction abort. Fixes: 733f4fbbc108 ("Btrfs: read device stats on mount, write modified ones during commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 daysbtrfs: fix leak of kobject name for sub-group space_infoShin'ichiro Kawasaki1-1/+1
[ Upstream commit a4376d9a5d4c9610e69def3fc0b32c86a7ab7a41 ] When create_space_info_sub_group() allocates elements of space_info->sub_group[], kobject_init_and_add() is called for each element via btrfs_sysfs_add_space_info_type(). However, when check_removing_space_info() frees these elements, it does not call btrfs_sysfs_remove_space_info() on them. As a result, kobject_put() is not called and the associated kobj->name objects are leaked. This memory leak is reproduced by running the blktests test case zbd/009 on kernels built with CONFIG_DEBUG_KMEMLEAK. The kmemleak feature reports the following error: unreferenced object 0xffff888112877d40 (size 16): comm "mount", pid 1244, jiffies 4294996972 hex dump (first 16 bytes): 64 61 74 61 2d 72 65 6c 6f 63 00 c4 c6 a7 cb 7f data-reloc...... backtrace (crc 53ffde4d): __kmalloc_node_track_caller_noprof+0x619/0x870 kstrdup+0x42/0xc0 kobject_set_name_vargs+0x44/0x110 kobject_init_and_add+0xcf/0x150 btrfs_sysfs_add_space_info_type+0xfc/0x210 [btrfs] create_space_info_sub_group.constprop.0+0xfb/0x1b0 [btrfs] create_space_info+0x211/0x320 [btrfs] btrfs_init_space_info+0x15a/0x1b0 [btrfs] open_ctree+0x33c7/0x4a50 [btrfs] btrfs_get_tree.cold+0x9f/0x1ee [btrfs] vfs_get_tree+0x87/0x2f0 vfs_cmd_create+0xbd/0x280 __do_sys_fsconfig+0x3df/0x990 do_syscall_64+0x136/0x1540 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid the leak, call btrfs_sysfs_remove_space_info() instead of kfree() for the elements. Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group") Link: https://lore.kernel.org/linux-block/b9488881-f18d-4f47-91a5-3c9bf63955a5@wdc.com/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 daysbtrfs: fix super block offset in error message in btrfs_validate_super()Mark Harmstone1-2/+2
[ Upstream commit b52fe51f724385b3ed81e37e510a4a33107e8161 ] Fix the superblock offset mismatch error message in btrfs_validate_super(): we changed it so that it considers all the superblocks, but the message still assumes we're only looking at the first one. The change from %u to %llu is because we're changing from a constant to a u64. Fixes: 069ec957c35e ("btrfs: Refactor btrfs_check_super_valid") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 dayserofs: fix "BUG: Bad page state in z_erofs_do_read_page"Gao Xiang1-0/+1
It's actually a stable-only issue from backporting 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly") We missed to update `oldpage` after `pcl->compressed_bvecs[nr].page` is updated, so that the following cmpxchg() will fail; the original upstream commit doesn't behave like this due to new features and refactoring. This backport issue only impacts some specific crafted images and normal filesystems won't be impacted at all. Fixes: 1bf7e414cac3 ("erofs: handle overlapped pclusters out of crafted images properly") # 6.6.y Closes: https://syzkaller.appspot.com/bug?extid=b6353e35ae2bab997538 Reported-and-tested-by: syzbot+b6353e35ae2bab997538@syzkaller.appspotmail.com [1] [1] https://lore.kernel.org/r/69c3b299.a70a0220.234938.004b.GAE@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysxfs: save ailp before dropping the AIL lock in push callbacksYuto Ohnuki2-4/+14
commit 394d70b86fae9fe865e7e6d9540b7696f73aa9b6 upstream. In xfs_inode_item_push() and xfs_qm_dquot_logitem_push(), the AIL lock is dropped to perform buffer IO. Once the cluster buffer no longer protects the log item from reclaim, the log item may be freed by background reclaim or the dquot shrinker. The subsequent spin_lock() call dereferences lip->li_ailp, which is a use-after-free. Fix this by saving the ailp pointer in a local variable while the AIL lock is held and the log item is guaranteed to be valid. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysxfs: avoid dereferencing log items after push callbacksYuto Ohnuki3-11/+52
commit 79ef34ec0554ec04bdbafafbc9836423734e1bd6 upstream. After xfsaild_push_item() calls iop_push(), the log item may have been freed if the AIL lock was dropped during the push. Background inode reclaim or the dquot shrinker can free the log item while the AIL lock is not held, and the tracepoints in the switch statement dereference the log item after iop_push() returns. Fix this by capturing the log item type, flags, and LSN before calling xfsaild_push_item(), and introducing a new xfs_ail_push_class trace event class that takes these pre-captured values and the ailp pointer instead of the log item pointer. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysxattr: switch to CLASS(fd)Al Viro1-21/+6
[ Upstream commit a71874379ec8c6e788a61d71b3ad014a8d9a5c08 ] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [ Only switch to CLASS(fd) in v6.6.y for fd_empty() was introduced in commit 88a2f6468d01 ("struct fd: representation change") in 6.12. ] Signed-off-by: Alva Lan <alvalan9@foxmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysgfs2: Fix unlikely race in gdlm_put_lockAndreas Gruenbacher1-5/+5
[ Upstream commit 28c4d9bc0708956c1a736a9e49fee71b65deee81 ] In gdlm_put_lock(), there is a small window of time in which the DFL_UNMOUNT flag has been set but the lockspace hasn't been released, yet. In that window, dlm may still call gdlm_ast() and gdlm_bast(). To prevent it from dereferencing freed glock objects, only free the glock if the lockspace has actually been released. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andrew Price <anprice@redhat.com> [ Minor context change fixed. ] Signed-off-by: Robert Garcia <rob_garcia@163.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysksmbd: fix memory leaks and NULL deref in smb2_lock()Werner Kasselman1-9/+18
[ Upstream commit 309b44ed684496ed3f9c5715d10b899338623512 ] smb2_lock() has three error handling issues after list_del() detaches smb_lock from lock_list at no_check_cl: 1) If vfs_lock_file() returns an unexpected error in the non-UNLOCK path, goto out leaks smb_lock and its flock because the out: handler only iterates lock_list and rollback_list, neither of which contains the detached smb_lock. 2) If vfs_lock_file() returns -ENOENT in the UNLOCK path, goto out leaks smb_lock and flock for the same reason. The error code returned to the dispatcher is also stale. 3) In the rollback path, smb_flock_init() can return NULL on allocation failure. The result is dereferenced unconditionally, causing a kernel NULL pointer dereference. Add a NULL check to prevent the crash and clean up the bookkeeping; the VFS lock itself cannot be rolled back without the allocation and will be released at file or connection teardown. Fix cases 1 and 2 by hoisting the locks_free_lock()/kfree() to before the if(!rc) check in the UNLOCK branch so all exit paths share one free site, and by freeing smb_lock and flock before goto out in the non-UNLOCK branch. Propagate the correct error code in both cases. Fix case 3 by wrapping the VFS unlock in an if(rlock) guard and adding a NULL check for locks_free_lock(rlock) in the shared cleanup. Found via call-graph analysis using sqry. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Suggested-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Werner Kasselman <werner@verivus.com> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> [ adapted rlock->c.flc_type to rlock->fl_type ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysksmbd: fix use-after-free and NULL deref in smb_grant_oplock()Werner Kasselman1-27/+45
[ Upstream commit 48623ec358c1c600fa1e38368746f933e0f1a617 ] smb_grant_oplock() has two issues in the oplock publication sequence: 1) opinfo is linked into ci->m_op_list (via opinfo_add) before add_lease_global_list() is called. If add_lease_global_list() fails (kmalloc returns NULL), the error path frees the opinfo via __free_opinfo() while it is still linked in ci->m_op_list. Concurrent m_op_list readers (opinfo_get_list, or direct iteration in smb_break_all_levII_oplock) dereference the freed node. 2) opinfo->o_fp is assigned after add_lease_global_list() publishes the opinfo on the global lease list. A concurrent find_same_lease_key() can walk the lease list and dereference opinfo->o_fp->f_ci while o_fp is still NULL. Fix by restructuring the publication sequence to eliminate post-publish failure: - Set opinfo->o_fp before any list publication (fixes NULL deref). - Preallocate lease_table via alloc_lease_table() before opinfo_add() so add_lease_global_list() becomes infallible after publication. - Keep the original m_op_list publication order (opinfo_add before lease list) so concurrent opens via same_client_has_lease() and opinfo_get_list() still see the in-flight grant. - Use opinfo_put() instead of __free_opinfo() on err_out so that the RCU-deferred free path is used. This also requires splitting add_lease_global_list() to take a preallocated lease_table and changing its return type from int to void, since it can no longer fail. Fixes: 1dfd062caa16 ("ksmbd: fix use-after-free by using call_rcu() for oplock_info") Cc: stable@vger.kernel.org Signed-off-by: Werner Kasselman <werner@verivus.com> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> [ replaced kmalloc_obj() and KSMBD_DEFAULT_GFP with kmalloc(sizeof(), GFP_KERNEL) ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: always drain queued discard work in ext4_mb_release()Theodore Ts'o1-7/+5
commit 9ee29d20aab228adfb02ca93f87fb53c56c2f3af upstream. While reviewing recent ext4 patch[1], Sashiko raised the following concern[2]: > If the filesystem is initially mounted with the discard option, > deleting files will populate sbi->s_discard_list and queue > s_discard_work. If it is then remounted with nodiscard, the > EXT4_MOUNT_DISCARD flag is cleared, but the pending s_discard_work is > neither cancelled nor flushed. [1] https://lore.kernel.org/r/20260319094545.19291-1-qiang.zhang@linux.dev/ [2] https://sashiko.dev/#/patchset/20260319094545.19291-1-qiang.zhang%40linux.dev The concern was valid, but it had nothing to do with the patch[1]. One of the problems with Sashiko in its current (early) form is that it will detect pre-existing issues and report it as a problem with the patch that it is reviewing. In practice, it would be hard to hit deliberately (unless you are a malicious syzkaller fuzzer), since it would involve mounting the file system with -o discard, and then deleting a large number of files, remounting the file system with -o nodiscard, and then immediately unmounting the file system before the queued discard work has a change to drain on its own. Fix it because it's a real bug, and to avoid Sashiko from raising this concern when analyzing future patches to mballoc.c. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread contex") Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: fix iloc.bh leak in ext4_fc_replay_inode() error pathsBaokun Li1-5/+8
commit ec0a7500d8eace5b4f305fa0c594dd148f0e8d29 upstream. During code review, Joseph found that ext4_fc_replay_inode() calls ext4_get_fc_inode_loc() to get the inode location, which holds a reference to iloc.bh that must be released via brelse(). However, several error paths jump to the 'out' label without releasing iloc.bh: - ext4_handle_dirty_metadata() failure - sync_dirty_buffer() failure - ext4_mark_inode_used() failure - ext4_iget() failure Fix this by introducing an 'out_brelse' label placed just before the existing 'out' label to ensure iloc.bh is always released. Additionally, make ext4_fc_replay_inode() propagate errors properly instead of always returning 0. Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: fix the might_sleep() warnings in kvfree()Zqiang2-13/+5
commit 496bb99b7e66f48b178126626f47e9ba79e2d0fa upstream. Use the kvfree() in the RCU read critical section can trigger the following warnings: EXT4-fs (vdb): unmounting filesystem cd983e5b-3c83-4f5a-a136-17b00eb9d018. WARNING: suspicious RCU usage ./include/linux/rcupdate.h:409 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 Call Trace: <TASK> dump_stack_lvl+0xbb/0xd0 dump_stack+0x14/0x20 lockdep_rcu_suspicious+0x15a/0x1b0 __might_resched+0x375/0x4d0 ? put_object.part.0+0x2c/0x50 __might_sleep+0x108/0x160 vfree+0x58/0x910 ? ext4_group_desc_free+0x27/0x270 kvfree+0x23/0x40 ext4_group_desc_free+0x111/0x270 ext4_put_super+0x3c8/0xd40 generic_shutdown_super+0x14c/0x4a0 ? __pfx_shrinker_free+0x10/0x10 kill_block_super+0x40/0x90 ext4_kill_sb+0x6d/0xb0 deactivate_locked_super+0xb4/0x180 deactivate_super+0x7e/0xa0 cleanup_mnt+0x296/0x3e0 __cleanup_mnt+0x16/0x20 task_work_run+0x157/0x250 ? __pfx_task_work_run+0x10/0x10 ? exit_to_user_mode_loop+0x6a/0x550 exit_to_user_mode_loop+0x102/0x550 do_syscall_64+0x44a/0x500 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> BUG: sleeping function called from invalid context at mm/vmalloc.c:3441 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 556, name: umount preempt_count: 1, expected: 0 CPU: 3 UID: 0 PID: 556 Comm: umount Call Trace: <TASK> dump_stack_lvl+0xbb/0xd0 dump_stack+0x14/0x20 __might_resched+0x275/0x4d0 ? put_object.part.0+0x2c/0x50 __might_sleep+0x108/0x160 vfree+0x58/0x910 ? ext4_group_desc_free+0x27/0x270 kvfree+0x23/0x40 ext4_group_desc_free+0x111/0x270 ext4_put_super+0x3c8/0xd40 generic_shutdown_super+0x14c/0x4a0 ? __pfx_shrinker_free+0x10/0x10 kill_block_super+0x40/0x90 ext4_kill_sb+0x6d/0xb0 deactivate_locked_super+0xb4/0x180 deactivate_super+0x7e/0xa0 cleanup_mnt+0x296/0x3e0 __cleanup_mnt+0x16/0x20 task_work_run+0x157/0x250 ? __pfx_task_work_run+0x10/0x10 ? exit_to_user_mode_loop+0x6a/0x550 exit_to_user_mode_loop+0x102/0x550 do_syscall_64+0x44a/0x500 entry_SYSCALL_64_after_hwframe+0x77/0x7f The above scenarios occur in initialization failures and teardown paths, there are no parallel operations on the resources released by kvfree(), this commit therefore remove rcu_read_lock/unlock() and use rcu_access_pointer() instead of rcu_dereference() operations. Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access") Fixes: df3da4ea5a0f ("ext4: fix potential race between s_group_info online resizing and access") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Link: https://patch.msgid.link/20260319094545.19291-1-qiang.zhang@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: fix use-after-free in update_super_work when racing with umountJiayuan Chen3-1/+11
commit d15e4b0a418537aafa56b2cb80d44add83e83697 upstream. Commit b98535d09179 ("ext4: fix bug_on in start_this_handle during umount filesystem") moved ext4_unregister_sysfs() before flushing s_sb_upd_work to prevent new error work from being queued via /proc/fs/ext4/xx/mb_groups reads during unmount. However, this introduced a use-after-free because update_super_work calls ext4_notify_error_sysfs() -> sysfs_notify() which accesses the kobject's kernfs_node after it has been freed by kobject_del() in ext4_unregister_sysfs(): update_super_work ext4_put_super ----------------- -------------- ext4_unregister_sysfs(sb) kobject_del(&sbi->s_kobj) __kobject_del() sysfs_remove_dir() kobj->sd = NULL sysfs_put(sd) kernfs_put() // RCU free ext4_notify_error_sysfs(sbi) sysfs_notify(&sbi->s_kobj) kn = kobj->sd // stale pointer kernfs_get(kn) // UAF on freed kernfs_node ext4_journal_destroy() flush_work(&sbi->s_sb_upd_work) Instead of reordering the teardown sequence, fix this by making ext4_notify_error_sysfs() detect that sysfs has already been torn down by checking s_kobj.state_in_sysfs, and skipping the sysfs_notify() call in that case. A dedicated mutex (s_error_notify_mutex) serializes ext4_notify_error_sysfs() against kobject_del() in ext4_unregister_sysfs() to prevent TOCTOU races where the kobject could be deleted between the state_in_sysfs check and the sysfs_notify() call. Fixes: b98535d09179 ("ext4: fix bug_on in start_this_handle during umount filesystem") Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260319120336.157873-1-jiayuan.chen@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: reject mount if bigalloc with s_first_data_block != 0Helen Koike1-0/+7
commit 3822743dc20386d9897e999dbb990befa3a5b3f8 upstream. bigalloc with s_first_data_block != 0 is not supported, reject mounting it. Signed-off-by: Helen Koike <koike@igalia.com> Suggested-by: Theodore Ts'o <tytso@mit.edu> Reported-by: syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b73703b873a33d8eb8f6 Link: https://patch.msgid.link/20260317142325.135074-1-koike@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: avoid allocate block from corrupted group in ext4_mb_find_by_goal()Ye Bin1-1/+5
commit 46066e3a06647c5b186cc6334409722622d05c44 upstream. There's issue as follows: ... EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2243 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2239 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): error count since last fsck: 1 EXT4-fs (mmcblk0p1): initial error at time 1765597433: ext4_mb_generate_buddy:760 EXT4-fs (mmcblk0p1): last error at time 1765597433: ext4_mb_generate_buddy:760 ... According to the log analysis, blocks are always requested from the corrupted block group. This may happen as follows: ext4_mb_find_by_goal ext4_mb_load_buddy ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_read_block_bitmap_nowait ext4_wait_block_bitmap ext4_validate_block_bitmap if (!grp || EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) return -EFSCORRUPTED; // There's no logs. if (err) return err; // Will return error ext4_lock_group(ac->ac_sb, group); if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) // Unreachable goto out; After commit 9008a58e5dce ("ext4: make the bitmap read routines return real error codes") merged, Commit 163a203ddb36 ("ext4: mark block group as corrupt on block bitmap error") is no real solution for allocating blocks from corrupted block groups. This is because if 'EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)' is true, then 'ext4_mb_load_buddy()' may return an error. This means that the block allocation will fail. Therefore, check block group if corrupted when ext4_mb_load_buddy() returns error. Fixes: 163a203ddb36 ("ext4: mark block group as corrupt on block bitmap error") Fixes: 9008a58e5dce ("ext4: make the bitmap read routines return real error codes") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260302134619.3145520-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: avoid infinite loops caused by residual dataEdward Adam Davis1-2/+6
commit 5422fe71d26d42af6c454ca9527faaad4e677d6c upstream. On the mkdir/mknod path, when mapping logical blocks to physical blocks, if inserting a new extent into the extent tree fails (in this example, because the file system disabled the huge file feature when marking the inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to reclaim the physical block without deleting the corresponding data in the extent tree. This causes subsequent mkdir operations to reference the previously reclaimed physical block number again, even though this physical block is already being used by the xattr block. Therefore, a situation arises where both the directory and xattr are using the same buffer head block in memory simultaneously. The above causes ext4_xattr_block_set() to enter an infinite loop about "inserted" and cannot release the inode lock, ultimately leading to the 143s blocking problem mentioned in [1]. If the metadata is corrupted, then trying to remove some extent space can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE was passed, remove space wrongly update quota information. Jan Kara suggests distinguishing between two cases: 1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully consistent and we must maintain its consistency including all the accounting. However these errors can happen only early before we've inserted the extent into the extent tree. So current code works correctly for this case. 2) Some other error - this means metadata is corrupted. We should strive to do as few modifications as possible to limit damage. So I'd just skip freeing of allocated blocks. [1] INFO: task syz.0.17:5995 blocked for more than 143 seconds. Call Trace: inode_lock_nested include/linux/fs.h:1073 [inline] __start_dirop fs/namei.c:2923 [inline] start_dirop fs/namei.c:2934 [inline] Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7 Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: replace BUG_ON with proper error handling in ext4_read_inline_folioYuto Ohnuki1-1/+9
commit 356227096eb66e41b23caf7045e6304877322edf upstream. Replace BUG_ON() with proper error handling when inline data size exceeds PAGE_SIZE. This prevents kernel panic and allows the system to continue running while properly reporting the filesystem corruption. The error is logged via ext4_error_inode(), the buffer head is released to prevent memory leak, and -EFSCORRUPTED is returned to indicate filesystem corruption. Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Link: https://patch.msgid.link/20260223123345.14838-2-ytohnuki@amazon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: make recently_deleted() properly work with lazy itable initializationJan Kara1-0/+6
commit bd060afa7cc3e0ad30afa9ecc544a78638498555 upstream. recently_deleted() checks whether inode has been used in the near past. However this can give false positive result when inode table is not initialized yet and we are in fact comparing to random garbage (or stale itable block of a filesystem before mkfs). Ultimately this results in uninitialized inodes being skipped during inode allocation and possibly they are never initialized and thus e2fsck complains. Verify if the inode has been initialized before checking for dtime. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260216164848.3074-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: fix fsync(2) for nojournal modeJan Kara1-2/+14
commit 1308255bbf8452762f89f44f7447ce137ecdbcff upstream. When inode metadata is changed, we sometimes just call ext4_mark_inode_dirty() to track modified metadata. This copies inode metadata into block buffer which is enough when we are journalling metadata. However when we are running in nojournal mode we currently fail to write the dirtied inode buffer during fsync(2) because the inode is not marked as dirty. Use explicit ext4_write_inode() call to make sure the inode table buffer is written to the disk. This is a band aid solution but proper solution requires a much larger rewrite including changes in metadata bh tracking infrastructure. Reported-by: Free Ekanayaka <free.ekanayaka@gmail.com> Link: https://lore.kernel.org/all/87il8nhxdm.fsf@x1.mail-host-address-is-not-set/ CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260216164848.3074-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: fix stale xarray tags after writebackJan Kara1-2/+8
commit f4a2b42e78914ff15630e71289adc589c3a8eb45 upstream. There are cases where ext4_bio_write_page() gets called for a page which has no buffers to submit. This happens e.g. when the part of the file is actually a hole, when we cannot allocate blocks due to being called from jbd2, or in data=journal mode when checkpointing writes the buffers earlier. In these cases we just return from ext4_bio_write_page() however if the page didn't need redirtying, we will leave stale DIRTY and/or TOWRITE tags in xarray because those get cleared only in __folio_start_writeback(). As a result we can leave these tags set in mappings even after a final sync on filesystem that's getting remounted read-only or that's being frozen. Various assertions can then get upset when writeback is started on such filesystems (Gerald reported assertion in ext4_journal_check_start() firing). Fix the problem by cycling the page through writeback state even if we decide nothing needs to be written for it so that xarray tags get properly updated. This is slightly silly (we could update the xarray tags directly) but I don't think a special helper messing with xarray tags is really worth it in this relatively rare corner case. Reported-by: Gerald Yang <gerald.yang@canonical.com> Link: https://lore.kernel.org/all/20260128074515.2028982-1-gerald.yang@canonical.com Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260205092223.21287-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: convert inline data to extents when truncate exceeds inline sizeDeepanshu Kartikey1-0/+12
commit ed9356a30e59c7cc3198e7fc46cfedf3767b9b17 upstream. Add a check in ext4_setattr() to convert files from inline data storage to extent-based storage when truncate() grows the file size beyond the inline capacity. This prevents the filesystem from entering an inconsistent state where the inline data flag is set but the file size exceeds what can be stored inline. Without this fix, the following sequence causes a kernel BUG_ON(): 1. Mount filesystem with inode that has inline flag set and small size 2. truncate(file, 50MB) - grows size but inline flag remains set 3. sendfile() attempts to write data 4. ext4_write_inline_data() hits BUG_ON(write_size > inline_capacity) The crash occurs because ext4_write_inline_data() expects inline storage to accommodate the write, but the actual inline capacity (~60 bytes for i_block + ~96 bytes for xattrs) is far smaller than the file size and write request. The fix checks if the new size from setattr exceeds the inode's actual inline capacity (EXT4_I(inode)->i_inline_size) and converts the file to extent-based storage before proceeding with the size change. This addresses the root cause by ensuring the inline data flag and file size remain consistent during truncate operations. Reported-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7de5fe447862fc37576f Tested-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Link: https://patch.msgid.link/20260207043607.1175976-1-kartikey406@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysext4: fix journal credit check when setting fscrypt contextSimon Weber1-1/+8
commit b1d682f1990c19fb1d5b97d13266210457092bcd upstream. Fix an issue arising when ext4 features has_journal, ea_inode, and encrypt are activated simultaneously, leading to ENOSPC when creating an encrypted file. Fix by passing XATTR_CREATE flag to xattr_set_handle function if a handle is specified, i.e., when the function is called in the control flow of creating a new inode. This aligns the number of jbd2 credits set_handle checks for with the number allocated for creating a new inode. ext4_set_context must not be called with a non-null handle (fs_data) if fscrypt context xattr is not guaranteed to not exist yet. The only other usage of this function currently is when handling the ioctl FS_IOC_SET_ENCRYPTION_POLICY, which calls it with fs_data=NULL. Fixes: c1a5d5f6ab21eb7e ("ext4: improve journal credit handling in set xattr paths") Co-developed-by: Anthony Durrer <anthonydev@fastmail.com> Signed-off-by: Anthony Durrer <anthonydev@fastmail.com> Signed-off-by: Simon Weber <simon.weber.39@gmail.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20260207100148.724275-4-simon.weber.39@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysxfs: fix ri_total validation in xlog_recover_attri_commit_pass2Long Li1-2/+2
commit d72f2084e30966097c8eae762e31986a33c3c0ae upstream. The ri_total checks for SET/REPLACE operations are hardcoded to 3, but xfs_attri_item_size() only emits a value iovec when value_len > 0, so ri_total is 2 when value_len == 0. For PPTR_SET/PPTR_REMOVE/PPTR_REPLACE, value_len is validated by xfs_attri_validate() to be exactly sizeof(struct xfs_parent_rec) and is never zero, so their hardcoded checks remain correct. This problem may cause log recovery failures. The following script can be used to reproduce the problem: #!/bin/bash mkfs.xfs -f /dev/sda mount /dev/sda /mnt/test/ touch /mnt/test/file for i in {1..200}; do attr -s "user.attr_$i" -V "value_$i" /mnt/test/file > /dev/null done echo 1 > /sys/fs/xfs/debug/larp echo 1 > /sys/fs/xfs/sda/errortag/larp attr -s "user.zero" -V "" /mnt/test/file echo 0 > /sys/fs/xfs/sda/errortag/larp umount /mnt/test mount /dev/sda /mnt/test/ # mount failed Fix this by deriving the expected count dynamically as "2 + !!value_len" for SET/REPLACE operations. Cc: stable@vger.kernel.org # v6.9 Fixes: ad206ae50eca ("xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysxfs: stop reclaim before pushing AIL during unmountYuto Ohnuki1-3/+4
commit 4f24a767e3d64a5f58c595b5c29b6063a201f1e3 upstream. The unmount sequence in xfs_unmount_flush_inodes() pushed the AIL while background reclaim and inodegc are still running. This is broken independently of any use-after-free issues - background reclaim and inodegc should not be running while the AIL is being pushed during unmount, as inodegc can dirty and insert inodes into the AIL during the flush, and background reclaim can race to abort and free dirty inodes. Reorder xfs_unmount_flush_inodes() to stop inodegc and cancel background reclaim before pushing the AIL. Stop inodegc before cancelling m_reclaim_work because the inodegc worker can re-queue m_reclaim_work via xfs_inodegc_set_reclaimable. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysjbd2: gracefully abort on checkpointing state corruptionsMilos Nikic1-2/+13
commit bac3190a8e79beff6ed221975e0c9b1b5f2a21da upstream. This patch targets two internal state machine invariants in checkpoint.c residing inside functions that natively return integer error codes. - In jbd2_cleanup_journal_tail(): A blocknr of 0 indicates a severely corrupted journal superblock. Replaced the J_ASSERT with a WARN_ON_ONCE and a graceful journal abort, returning -EFSCORRUPTED. - In jbd2_log_do_checkpoint(): Replaced the J_ASSERT_BH checking for an unexpected buffer_jwrite state. If the warning triggers, we explicitly drop the just-taken get_bh() reference and call __flush_batch() to safely clean up any previously queued buffers in the j_chkpt_bhs array, preventing a memory leak before returning -EFSCORRUPTED. Signed-off-by: Milos Nikic <nikic.milos@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260311041548.159424-1-nikic.milos@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 dayserofs: add GFP_NOIO in the bio completion if neededJiucheng Xu1-0/+3
commit c23df30915f83e7257c8625b690a1cece94142a0 upstream. The bio completion path in the process context (e.g. dm-verity) will directly call into decompression rather than trigger another workqueue context for minimal scheduling latencies, which can then call vm_map_ram() with GFP_KERNEL. Due to insufficient memory, vm_map_ram() may generate memory swapping I/O, which can cause submit_bio_wait to deadlock in some scenarios. Trimmed down the call stack, as follows: f2fs_submit_read_io submit_bio //bio_list is initialized. mmc_blk_mq_recovery z_erofs_endio vm_map_ram __pte_alloc_kernel __alloc_pages_direct_reclaim shrink_folio_list __swap_writepage submit_bio_wait //bio_list is non-NULL, hang!!! Use memalloc_noio_{save,restore}() to wrap up this path. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Jiucheng Xu <jiucheng.xu@amlogic.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysksmbd: do not expire session on binding failureHyunwoo Kim1-2/+8
commit 9bbb19d21ded7d78645506f20d8c44895e3d0fb9 upstream. When a multichannel session binding request fails (e.g. wrong password), the error path unconditionally sets sess->state = SMB2_SESSION_EXPIRED. However, during binding, sess points to the target session looked up via ksmbd_session_lookup_slowpath() -- which belongs to another connection's user. This allows a remote attacker to invalidate any active session by simply sending a binding request with a wrong password (DoS). Fix this by skipping session expiration when the failed request was a binding attempt, since the session does not belong to the current connection. The reference taken by ksmbd_session_lookup_slowpath() is still correctly released via ksmbd_user_session_put(). Cc: stable@vger.kernel.org Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysksmbd: fix potencial OOB in get_file_all_info() for compound requestsNamjae Jeon1-2/+14
commit beef2634f81f1c086208191f7228bce1d366493d upstream. When a compound request consists of QUERY_DIRECTORY + QUERY_INFO (FILE_ALL_INFORMATION) and the first command consumes nearly the entire max_trans_size, get_file_all_info() would blindly call smbConvertToUTF16() with PATH_MAX, causing out-of-bounds write beyond the response buffer. In get_file_all_info(), there was a missing validation check for the client-provided OutputBufferLength before copying the filename into FileName field of the smb2_file_all_info structure. If the filename length exceeds the available buffer space, it could lead to potential buffer overflows or memory corruption during smbConvertToUTF16 conversion. This calculating the actual free buffer size using smb2_calc_max_out_buf_len() and returning -EINVAL if the buffer is insufficient and updating smbConvertToUTF16 to use the actual filename length (clamped by PATH_MAX) to ensure a safe copy operation. Cc: stable@vger.kernel.org Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Reported-by: Asim Viladi Oglu Manizada <manizada@pm.me> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysksmbd: replace hardcoded hdr2_len with offsetof() in smb2_calc_max_out_buf_len()Namjae Jeon1-8/+12
commit 0e55f63dd08f09651d39e1b709a91705a8a0ddcb upstream. After this commit (e2b76ab8b5c9 "ksmbd: add support for read compound"), response buffer management was changed to use dynamic iov array. In the new design, smb2_calc_max_out_buf_len() expects the second argument (hdr2_len) to be the offset of ->Buffer field in the response structure, not a hardcoded magic number. Fix the remaining call sites to use the correct offsetof() value. Cc: stable@vger.kernel.org Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 daysbtrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol createBoris Burkov1-0/+7
[ Upstream commit 5131fa077f9bb386a1b901bf5b247041f0ec8f80 ] We have recently observed a number of subvolumes with broken dentries. ls-ing the parent dir looks like: drwxrwxrwt 1 root root 16 Jan 23 16:49 . drwxr-xr-x 1 root root 24 Jan 23 16:48 .. d????????? ? ? ? ? ? broken_subvol and similarly stat-ing the file fails. In this state, deleting the subvol fails with ENOENT, but attempting to create a new file or subvol over it errors out with EEXIST and even aborts the fs. Which leaves us a bit stuck. dmesg contains a single notable error message reading: "could not do orphan cleanup -2" 2 is ENOENT and the error comes from the failure handling path of btrfs_orphan_cleanup(), with the stack leading back up to btrfs_lookup(). btrfs_lookup btrfs_lookup_dentry btrfs_orphan_cleanup // prints that message and returns -ENOENT After some detailed inspection of the internal state, it became clear that: - there are no orphan items for the subvol - the subvol is otherwise healthy looking, it is not half-deleted or anything, there is no drop progress, etc. - the subvol was created a while ago and does the meaningful first btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much later. - after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT, which results in a negative dentry for the subvolume via d_splice_alias(NULL, dentry), leading to the observed behavior. The bug can be mitigated by dropping the dentry cache, at which point we can successfully delete the subvolume if we want. i.e., btrfs_lookup() btrfs_lookup_dentry() if (!sb_rdonly(inode->vfs_inode)->vfs_inode) btrfs_orphan_cleanup(sub_root) test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP) btrfs_search_slot() // finds orphan item for inode N ... prints "could not do orphan cleanup -2" if (inode == ERR_PTR(-ENOENT)) inode = NULL; return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP) on the root when it runs, so it cannot run more than once on a given root, so something else must run concurrently. However, the obvious routes to deleting an orphan when nlinks goes to 0 should not be able to run without first doing a lookup into the subvolume, which should run btrfs_orphan_cleanup() and set the bit. The final important observation is that create_subvol() calls d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if the dentry cache gets dropped, the next lookup into the subvolume will make a real call into btrfs_orphan_cleanup() for the first time. This opens up the possibility of concurrently deleting the inode/orphan items but most typical evict() paths will be holding a reference on the parent dentry (child dentry holds parent->d_lockref.count via dget in d_alloc(), released in __dentry_kill()) and prevent the parent from being removed from the dentry cache. The one exception is delayed iputs. Ordered extent creation calls igrab() on the inode. If the file is unlinked and closed while those refs are held, iput() in __dentry_kill() decrements i_count but does not trigger eviction (i_count > 0). The child dentry is freed and the subvol dentry's d_lockref.count drops to 0, making it evictable while the inode is still alive. Since there are two races (the race between writeback and unlink and the race between lookup and delayed iputs), and there are too many moving parts, the following three diagrams show the complete picture. (Only the second and third are races) Phase 1: Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set btrfs_mksubvol() lookup_one_len() __lookup_slow() d_alloc_parallel() __d_alloc() // d_lockref.count = 1 create_subvol(dentry) // doesn't touch the bit.. d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1 Phase 2: Create a delayed iput for a file in the subvol but leave the subvol in state where its dentry can be evicted (d_lockref.count == 0) T1 (task) T2 (writeback) T3 (OE workqueue) write() // dirty pages btrfs_writepages() btrfs_run_delalloc_range() cow_file_range() btrfs_alloc_ordered_extent() igrab() // i_count: 1 -> 2 btrfs_unlink_inode() btrfs_orphan_add() close() __fput() dput() finish_dput() __dentry_kill() dentry_unlink_inode() iput() // 2 -> 1 --parent->d_lockref.count // 1 -> 0; evictable finish_ordered_fn() btrfs_finish_ordered_io() btrfs_put_ordered_extent() btrfs_add_delayed_iput() Phase 3: Once the delayed iput is pending and the subvol dentry is evictable, the shrinker can free it, causing the next lookup to go through btrfs_lookup() and call btrfs_orphan_cleanup() for the first time. If the cleaner kthread processes the delayed iput concurrently, the two race: T1 (shrinker) T2 (cleaner kthread) T3 (lookup) super_cache_scan() prune_dcache_sb() __dentry_kill() // subvol dentry freed btrfs_run_delayed_iputs() iput() // i_count -> 0 evict() // sets I_FREEING btrfs_evict_inode() // truncation loop btrfs_lookup() btrfs_lookup_dentry() btrfs_orphan_cleanup() // first call (bit never set) btrfs_iget() // blocks on I_FREEING btrfs_orphan_del() // inode freed // returns -ENOENT btrfs_del_orphan_item() // -ENOENT // "could not do orphan cleanup -2" d_splice_alias(NULL, dentry) // negative dentry for valid subvol The most straightforward fix is to ensure the invariant that a dentry for a subvolume can exist if and only if that subvolume has BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no orphans or ran btrfs_orphan_cleanup()). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-25ksmbd: fix use-after-free in durable v2 replay of active file handlesHyunwoo Kim1-1/+5
[ Upstream commit b425e4d0eb321a1116ddbf39636333181675d8f4 ] parse_durable_handle_context() unconditionally assigns dh_info->fp->conn to the current connection when handling a DURABLE_REQ_V2 context with SMB2_FLAGS_REPLAY_OPERATION. ksmbd_lookup_fd_cguid() does not filter by fp->conn, so it returns file handles that are already actively connected. The unconditional overwrite replaces fp->conn, and when the overwriting connection is subsequently freed, __ksmbd_close_fd() dereferences the stale fp->conn via spin_lock(&fp->conn->llist_lock), causing a use-after-free. KASAN report: [ 7.349357] ================================================================== [ 7.349607] BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0 [ 7.349811] Write of size 4 at addr ffff8881056ac18c by task kworker/1:2/108 [ 7.350010] [ 7.350064] CPU: 1 UID: 0 PID: 108 Comm: kworker/1:2 Not tainted 7.0.0-rc3+ #58 PREEMPTLAZY [ 7.350068] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 7.350070] Workqueue: ksmbd-io handle_ksmbd_work [ 7.350083] Call Trace: [ 7.350087] <TASK> [ 7.350087] dump_stack_lvl+0x64/0x80 [ 7.350094] print_report+0xce/0x660 [ 7.350100] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 7.350101] ? __pfx___mod_timer+0x10/0x10 [ 7.350106] ? _raw_spin_lock+0x75/0xe0 [ 7.350108] kasan_report+0xce/0x100 [ 7.350109] ? _raw_spin_lock+0x75/0xe0 [ 7.350114] kasan_check_range+0x105/0x1b0 [ 7.350116] _raw_spin_lock+0x75/0xe0 [ 7.350118] ? __pfx__raw_spin_lock+0x10/0x10 [ 7.350119] ? __call_rcu_common.constprop.0+0x25e/0x780 [ 7.350125] ? close_id_del_oplock+0x2cc/0x4e0 [ 7.350128] __ksmbd_close_fd+0x27f/0xaf0 [ 7.350131] ksmbd_close_fd+0x135/0x1b0 [ 7.350133] smb2_close+0xb19/0x15b0 [ 7.350142] ? __pfx_smb2_close+0x10/0x10 [ 7.350143] ? xas_load+0x18/0x270 [ 7.350146] ? _raw_spin_lock+0x84/0xe0 [ 7.350148] ? __pfx__raw_spin_lock+0x10/0x10 [ 7.350150] ? _raw_spin_unlock+0xe/0x30 [ 7.350151] ? ksmbd_smb2_check_message+0xeb2/0x24c0 [ 7.350153] ? ksmbd_tree_conn_lookup+0xcd/0xf0 [ 7.350154] handle_ksmbd_work+0x40f/0x1080 [ 7.350156] process_one_work+0x5fa/0xef0 [ 7.350162] ? assign_work+0x122/0x3e0 [ 7.350163] worker_thread+0x54b/0xf70 [ 7.350165] ? __pfx_worker_thread+0x10/0x10 [ 7.350166] kthread+0x346/0x470 [ 7.350170] ? recalc_sigpending+0x19b/0x230 [ 7.350176] ? __pfx_kthread+0x10/0x10 [ 7.350178] ret_from_fork+0x4fb/0x6c0 [ 7.350183] ? __pfx_ret_from_fork+0x10/0x10 [ 7.350185] ? __switch_to+0x36c/0xbe0 [ 7.350188] ? __pfx_kthread+0x10/0x10 [ 7.350190] ret_from_fork_asm+0x1a/0x30 [ 7.350197] </TASK> [ 7.350197] [ 7.355160] Allocated by task 123: [ 7.355261] kasan_save_stack+0x33/0x60 [ 7.355373] kasan_save_track+0x14/0x30 [ 7.355484] __kasan_kmalloc+0x8f/0xa0 [ 7.355593] ksmbd_conn_alloc+0x44/0x6d0 [ 7.355711] ksmbd_kthread_fn+0x243/0xd70 [ 7.355839] kthread+0x346/0x470 [ 7.355942] ret_from_fork+0x4fb/0x6c0 [ 7.356051] ret_from_fork_asm+0x1a/0x30 [ 7.356164] [ 7.356214] Freed by task 134: [ 7.356305] kasan_save_stack+0x33/0x60 [ 7.356416] kasan_save_track+0x14/0x30 [ 7.356527] kasan_save_free_info+0x3b/0x60 [ 7.356646] __kasan_slab_free+0x43/0x70 [ 7.356761] kfree+0x1ca/0x430 [ 7.356862] ksmbd_tcp_disconnect+0x59/0xe0 [ 7.356993] ksmbd_conn_handler_loop+0x77e/0xd40 [ 7.357138] kthread+0x346/0x470 [ 7.357240] ret_from_fork+0x4fb/0x6c0 [ 7.357350] ret_from_fork_asm+0x1a/0x30 [ 7.357463] [ 7.357513] The buggy address belongs to the object at ffff8881056ac000 [ 7.357513] which belongs to the cache kmalloc-1k of size 1024 [ 7.357857] The buggy address is located 396 bytes inside of [ 7.357857] freed 1024-byte region [ffff8881056ac000, ffff8881056ac400) Fix by removing the unconditional fp->conn assignment and rejecting the replay when fp->conn is non-NULL. This is consistent with ksmbd_lookup_durable_fd(), which also rejects file handles with a non-NULL fp->conn. For disconnected file handles (fp->conn == NULL), ksmbd_reopen_durable_fd() handles setting fp->conn. Fixes: c8efcc786146 ("ksmbd: add support for durable handles v1/v2") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-25ksmbd: fix use-after-free of share_conf in compound requestHyunwoo Kim1-0/+2
[ Upstream commit c33615f995aee80657b9fdfbc4ee7f49c2bd733d ] smb2_get_ksmbd_tcon() reuses work->tcon in compound requests without validating tcon->t_state. ksmbd_tree_conn_lookup() checks t_state == TREE_CONNECTED on the initial lookup path, but the compound reuse path bypasses this check entirely. If a prior command in the compound (SMB2_TREE_DISCONNECT) sets t_state to TREE_DISCONNECTED and frees share_conf via ksmbd_share_config_put(), subsequent commands dereference the freed share_conf through work->tcon->share_conf. KASAN report: [ 4.144653] ================================================================== [ 4.145059] BUG: KASAN: slab-use-after-free in smb2_write+0xc74/0xe70 [ 4.145415] Read of size 4 at addr ffff88810430c194 by task kworker/1:1/44 [ 4.145772] [ 4.145867] CPU: 1 UID: 0 PID: 44 Comm: kworker/1:1 Not tainted 7.0.0-rc3+ #60 PREEMPTLAZY [ 4.145871] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 4.145875] Workqueue: ksmbd-io handle_ksmbd_work [ 4.145888] Call Trace: [ 4.145892] <TASK> [ 4.145894] dump_stack_lvl+0x64/0x80 [ 4.145910] print_report+0xce/0x660 [ 4.145919] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 4.145928] ? smb2_write+0xc74/0xe70 [ 4.145931] kasan_report+0xce/0x100 [ 4.145934] ? smb2_write+0xc74/0xe70 [ 4.145937] smb2_write+0xc74/0xe70 [ 4.145939] ? __pfx_smb2_write+0x10/0x10 [ 4.145942] ? _raw_spin_unlock+0xe/0x30 [ 4.145945] ? ksmbd_smb2_check_message+0xeb2/0x24c0 [ 4.145948] ? smb2_tree_disconnect+0x31c/0x480 [ 4.145951] handle_ksmbd_work+0x40f/0x1080 [ 4.145953] process_one_work+0x5fa/0xef0 [ 4.145962] ? assign_work+0x122/0x3e0 [ 4.145964] worker_thread+0x54b/0xf70 [ 4.145967] ? __pfx_worker_thread+0x10/0x10 [ 4.145970] kthread+0x346/0x470 [ 4.145976] ? recalc_sigpending+0x19b/0x230 [ 4.145980] ? __pfx_kthread+0x10/0x10 [ 4.145984] ret_from_fork+0x4fb/0x6c0 [ 4.145992] ? __pfx_ret_from_fork+0x10/0x10 [ 4.145995] ? __switch_to+0x36c/0xbe0 [ 4.145999] ? __pfx_kthread+0x10/0x10 [ 4.146003] ret_from_fork_asm+0x1a/0x30 [ 4.146013] </TASK> [ 4.146014] [ 4.149858] Allocated by task 44: [ 4.149953] kasan_save_stack+0x33/0x60 [ 4.150061] kasan_save_track+0x14/0x30 [ 4.150169] __kasan_kmalloc+0x8f/0xa0 [ 4.150274] ksmbd_share_config_get+0x1dd/0xdd0 [ 4.150401] ksmbd_tree_conn_connect+0x7e/0x600 [ 4.150529] smb2_tree_connect+0x2e6/0x1000 [ 4.150645] handle_ksmbd_work+0x40f/0x1080 [ 4.150761] process_one_work+0x5fa/0xef0 [ 4.150873] worker_thread+0x54b/0xf70 [ 4.150978] kthread+0x346/0x470 [ 4.151071] ret_from_fork+0x4fb/0x6c0 [ 4.151176] ret_from_fork_asm+0x1a/0x30 [ 4.151286] [ 4.151332] Freed by task 44: [ 4.151418] kasan_save_stack+0x33/0x60 [ 4.151526] kasan_save_track+0x14/0x30 [ 4.151634] kasan_save_free_info+0x3b/0x60 [ 4.151751] __kasan_slab_free+0x43/0x70 [ 4.151861] kfree+0x1ca/0x430 [ 4.151952] __ksmbd_tree_conn_disconnect+0xc8/0x190 [ 4.152088] smb2_tree_disconnect+0x1cd/0x480 [ 4.152211] handle_ksmbd_work+0x40f/0x1080 [ 4.152326] process_one_work+0x5fa/0xef0 [ 4.152438] worker_thread+0x54b/0xf70 [ 4.152545] kthread+0x346/0x470 [ 4.152638] ret_from_fork+0x4fb/0x6c0 [ 4.152743] ret_from_fork_asm+0x1a/0x30 [ 4.152853] [ 4.152900] The buggy address belongs to the object at ffff88810430c180 [ 4.152900] which belongs to the cache kmalloc-96 of size 96 [ 4.153226] The buggy address is located 20 bytes inside of [ 4.153226] freed 96-byte region [ffff88810430c180, ffff88810430c1e0) [ 4.153549] [ 4.153596] The buggy address belongs to the physical page: [ 4.153750] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88810430ce80 pfn:0x10430c [ 4.154000] flags: 0x100000000000200(workingset|node=0|zone=2) [ 4.154160] page_type: f5(slab) [ 4.154251] raw: 0100000000000200 ffff888100041280 ffff888100040110 ffff888100040110 [ 4.154461] raw: ffff88810430ce80 0000000800200009 00000000f5000000 0000000000000000 [ 4.154668] page dumped because: kasan: bad access detected [ 4.154820] [ 4.154866] Memory state around the buggy address: [ 4.155002] ffff88810430c080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 4.155196] ffff88810430c100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 4.155391] >ffff88810430c180: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 4.155587] ^ [ 4.155693] ffff88810430c200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 4.155891] ffff88810430c280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 4.156087] ================================================================== Add the same t_state validation to the compound reuse path, consistent with ksmbd_tree_conn_lookup(). Fixes: 5005bcb42191 ("ksmbd: validate session id and tree id in the compound request") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-25btrfs: tree-checker: fix misleading root drop_level error messageZhengYuan Huang1-1/+1
[ Upstream commit fc1cd1f18c34f91e78362f9629ab9fd43b9dcab9 ] Fix tree-checker error message to report "invalid root drop_level" instead of the misleading "invalid root level". Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-25btrfs: log new dentries when logging parent dir of a conflicting inodeFilipe Manana1-0/+6
[ Upstream commit 9573a365ff9ff45da9222d3fe63695ce562beb24 ] If we log the parent directory of a conflicting inode, we are not logging the new dentries of the directory, so when we finish we have the parent directory's inode marked as logged but we did not log its new dentries. As a consequence if the parent directory is explicitly fsynced later and it does not have any new changes since we logged it, the fsync is a no-op and after a power failure the new dentries are missing. Example scenario: $ mkdir foo $ sync $rmdir foo $ mkdir dir1 $ mkdir dir2 # A file with the same name and parent as the directory we just deleted # and was persisted in a past transaction. So the deleted directory's # inode is a conflicting inode of this new file's inode. $ touch foo $ ln foo dir2/link # The fsync on dir2 will log the parent directory (".") because the # conflicting inode (deleted directory) does not exists anymore, but it # it does not log its new dentries (dir1). $ xfs_io -c "fsync" dir2 # This fsync on the parent directory is no-op, since the previous fsync # logged it (but without logging its new dentries). $ xfs_io -c "fsync" . <power failure> # After log replay dir1 is missing. Fix this by ensuring we log new dir dentries whenever we log the parent directory of a no longer existing conflicting inode. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/182055fa-e9ce-4089-9f5f-4b8a23e8dd91@gmail.com/ Fixes: a3baaf0d786e ("Btrfs: fix fsync after succession of renames and unlink/rmdir") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-25nfsd: fix heap overflow in NFSv4.0 LOCK replay cacheJeff Layton2-7/+19
[ Upstream commit 5133b61aaf437e5f25b1b396b14242a6bb0508e2 ] The NFSv4.0 replay cache uses a fixed 112-byte inline buffer (rp_ibuf[NFSD4_REPLAY_ISIZE]) to store encoded operation responses. This size was calculated based on OPEN responses and does not account for LOCK denied responses, which include the conflicting lock owner as a variable-length field up to 1024 bytes (NFS4_OPAQUE_LIMIT). When a LOCK operation is denied due to a conflict with an existing lock that has a large owner, nfsd4_encode_operation() copies the full encoded response into the undersized replay buffer via read_bytes_from_xdr_buf() with no bounds check. This results in a slab-out-of-bounds write of up to 944 bytes past the end of the buffer, corrupting adjacent heap memory. This can be triggered remotely by an unauthenticated attacker with two cooperating NFSv4.0 clients: one sets a lock with a large owner string, then the other requests a conflicting lock to provoke the denial. We could fix this by increasing NFSD4_REPLAY_ISIZE to allow for a full opaque, but that would increase the size of every stateowner, when most lockowners are not that large. Instead, fix this by checking the encoded response length against NFSD4_REPLAY_ISIZE before copying into the replay buffer. If the response is too large, set rp_buflen to 0 to skip caching the replay payload. The status is still cached, and the client already received the correct response on the original request. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@kernel.org Reported-by: Nicholas Carlini <npc@anthropic.com> Tested-by: Nicholas Carlini <npc@anthropic.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [ replaced `op_status_offset + XDR_UNIT` with existing `post_err_offset` variable ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-25btrfs: fix transaction abort on file creation due to name hash collisionFilipe Manana1-0/+19
[ Upstream commit 2d1ababdedd4ba38867c2500eb7f95af5ddeeef7 ] If we attempt to create several files with names that result in the same hash, we have to pack them in same dir item and that has a limit inherent to the leaf size. However if we reach that limit, we trigger a transaction abort and turns the filesystem into RO mode. This allows for a malicious user to disrupt a system, without the need to have administration privileges/capabilities. Reproducer: $ cat exploit-hash-collisions.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi # Use smallest node size to make the test faster and require fewer file # names that result in hash collision. mkfs.btrfs -f --nodesize 4K $DEV mount $DEV $MNT # List of names that result in the same crc32c hash for btrfs. declare -a names=( 'foobar' '%a8tYkxfGMLWRGr55QSeQc4PBNH9PCLIvR6jZnkDtUUru1t@RouaUe_L:@xGkbO3nCwvLNYeK9vhE628gss:T$yZjZ5l-Nbd6CbC$M=hqE-ujhJICXyIxBvYrIU9-TDC' 'AQci3EUB%shMsg-N%frgU:02ByLs=IPJU0OpgiWit5nexSyxZDncY6WB:=zKZuk5Zy0DD$Ua78%MelgBuMqaHGyKsJUFf9s=UW80PcJmKctb46KveLSiUtNmqrMiL9-Y0I_l5Fnam04CGIg=8@U:Z' 'CvVqJpJzueKcuA$wqwePfyu7VxuWNN3ho$p0zi2H8QFYK$7YlEqOhhb%:hHgjhIjW5vnqWHKNP4' 'ET:vk@rFU4tsvMB0$C_p=xQHaYZjvoF%-BTc%wkFW8yaDAPcCYoR%x$FH5O:' 'HwTon%v7SGSP4FE08jBwwiu5aot2CFKXHTeEAa@38fUcNGOWvE@Mz6WBeDH_VooaZ6AgsXPkVGwy9l@@ZbNXabUU9csiWrrOp0MWUdfi$EZ3w9GkIqtz7I_eOsByOkBOO' 'Ij%2VlFGXSuPvxJGf5UWy6O@1svxGha%b@=%wjkq:CIgE6u7eJOjmQY5qTtxE2Rjbis9@us' 'KBkjG5%9R8K9sOG8UTnAYjxLNAvBmvV5vz3IiZaPmKuLYO03-6asI9lJ_j4@6Xo$KZicaLWJ3Pv8XEwVeUPMwbHYWwbx0pYvNlGMO9F:ZhHAwyctnGy%_eujl%WPd4U2BI7qooOSr85J-C2V$LfY' 'NcRfDfuUQ2=zP8K3CCF5dFcpfiOm6mwenShsAb_F%n6GAGC7fT2JFFn:c35X-3aYwoq7jNX5$ZJ6hI3wnZs$7KgGi7wjulffhHNUxAT0fRRLF39vJ@NvaEMxsMO' 'Oj42AQAEzRoTxa5OuSKIr=A_lwGMy132v4g3Pdq1GvUG9874YseIFQ6QU' 'Ono7avN5GjC:_6dBJ_' 'WHmN2gnmaN-9dVDy4aWo:yNGFzz8qsJyJhWEWcud7$QzN2D9R0efIWWEdu5kwWr73NZm4=@CoCDxrrZnRITr-kGtU_cfW2:%2_am' 'WiFnuTEhAG9FEC6zopQmj-A-$LDQ0T3WULz%ox3UZAPybSV6v1Z$b4L_XBi4M4BMBtJZpz93r9xafpB77r:lbwvitWRyo$odnAUYlYMmU4RvgnNd--e=I5hiEjGLETTtaScWlQp8mYsBovZwM2k' 'XKyH=OsOAF3p%uziGF_ZVr$ivrvhVgD@1u%5RtrV-gl_vqAwHkK@x7YwlxX3qT6WKKQ%PR56NrUBU2dOAOAdzr2=5nJuKPM-T-$ZpQfCL7phxQbUcb:BZOTPaFExc-qK-gDRCDW2' 'd3uUR6OFEwZr%ns1XH_@tbxA@cCPmbBRLdyh7p6V45H$P2$F%w0RqrD3M0g8aGvWpoTFMiBdOTJXjD:JF7=h9a_43xBywYAP%r$SPZi%zDg%ql-KvkdUCtF9OLaQlxmd' 'ePTpbnit%hyNm@WELlpKzNZYOzOTf8EQ$sEfkMy1VOfIUu3coyvIr13-Y7Sv5v-Ivax2Go_GQRFMU1b3362nktT9WOJf3SpT%z8sZmM3gvYQBDgmKI%%RM-G7hyrhgYflOw%z::ZRcv5O:lDCFm' 'evqk743Y@dvZAiG5J05L_ROFV@$2%rVWJ2%3nxV72-W7$e$-SK3tuSHA2mBt$qloC5jwNx33GmQUjD%akhBPu=VJ5g$xhlZiaFtTrjeeM5x7dt4cHpX0cZkmfImndYzGmvwQG:$euFYmXn$_2rA9mKZ' 'gkgUtnihWXsZQTEkrMAWIxir09k3t7jk_IK25t1:cy1XWN0GGqC%FrySdcmU7M8MuPO_ppkLw3=Dfr0UuBAL4%GFk2$Ma10V1jDRGJje%Xx9EV2ERaWKtjpwiZwh0gCSJsj5UL7CR8RtW5opCVFKGGy8Cky' 'hNgsG_8lNRik3PvphqPm0yEH3P%%fYG:kQLY=6O-61Wa6nrV_WVGR6TLB09vHOv%g4VQRP8Gzx7VXUY1qvZyS' 'isA7JVzN12xCxVPJZ_qoLm-pTBuhjjHMvV7o=F:EaClfYNyFGlsfw-Kf%uxdqW-kwk1sPl2vhbjyHU1A6$hz' 'kiJ_fgcdZFDiOptjgH5PN9-PSyLO4fbk_:u5_2tz35lV_iXiJ6cx7pwjTtKy-XGaQ5IefmpJ4N_ZqGsqCsKuqOOBgf9LkUdffHet@Wu' 'lvwtxyhE9:%Q3UxeHiViUyNzJsy:fm38pg_b6s25JvdhOAT=1s0$pG25x=LZ2rlHTszj=gN6M4zHZYr_qrB49i=pA--@WqWLIuX7o1S_SfS@2FSiUZN' 'rC24cw3UBDZ=5qJBUMs9e$=S4Y94ni%Z8639vnrGp=0Hv4z3dNFL0fBLmQ40=EYIY:Z=SLc@QLMSt2zsss2ZXrP7j4=' 'uwGl2s-fFrf@GqS=DQqq2I0LJSsOmM%xzTjS:lzXguE3wChdMoHYtLRKPvfaPOZF2fER@j53evbKa7R%A7r4%YEkD=kicJe@SFiGtXHbKe4gCgPAYbnVn' 'UG37U6KKua2bgc:IHzRs7BnB6FD:2Mt5Cc5NdlsW%$1tyvnfz7S27FvNkroXwAW:mBZLA1@qa9WnDbHCDmQmfPMC9z-Eq6QT0jhhPpqyymaD:R02ghwYo%yx7SAaaq-:x33LYpei$5g8DMl3C' 'y2vjek0FE1PDJC0qpfnN:x8k2wCFZ9xiUF2ege=JnP98R%wxjKkdfEiLWvQzmnW' '8-HCSgH5B%K7P8_jaVtQhBXpBk:pE-$P7ts58U0J@iR9YZntMPl7j$s62yAJO@_9eanFPS54b=UTw$94C-t=HLxT8n6o9P=QnIxq-f1=Ne2dvhe6WbjEQtc' 'YPPh:IFt2mtR6XWSmjHptXL_hbSYu8bMw-JP8@PNyaFkdNFsk$M=xfL6LDKCDM-mSyGA_2MBwZ8Dr4=R1D%7-mCaaKGxb990jzaagRktDTyp' '9hD2ApKa_t_7x-a@GCG28kY:7$M@5udI1myQ$x5udtggvagmCQcq9QXWRC5hoB0o-_zHQUqZI5rMcz_kbMgvN5jr63LeYA4Cj-c6F5Ugmx6DgVf@2Jqm%MafecpgooqreJ53P-QTS' ) # Now create files with all those names in the same parent directory. # It should not fail since a 4K leaf has enough space for them. for name in "${names[@]}"; do touch $MNT/$name done # Now add one more file name that causes a crc32c hash collision. # This should fail, but it should not turn the filesystem into RO mode # (which could be exploited by malicious users) due to a transaction # abort. touch $MNT/'W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt' # Check that we are able to create another file, with a name that does not cause # a crc32c hash collision. echo -n "hello world" > $MNT/baz # Unmount and mount again, verify file baz exists and with the right content. umount $MNT mount $DEV $MNT echo "File baz content: $(cat $MNT/baz)" umount $MNT When running the reproducer: $ ./exploit-hash-collisions.sh (...) touch: cannot touch '/mnt/sdi/W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt': Value too large for defined data type ./exploit-hash-collisions.sh: line 57: /mnt/sdi/baz: Read-only file system cat: /mnt/sdi/baz: No such file or directory File baz content: And the transaction abort stack trace in dmesg/syslog: $ dmesg (...) [758240.509761] ------------[ cut here ]------------ [758240.510668] BTRFS: Transaction aborted (error -75) [758240.511577] WARNING: fs/btrfs/inode.c:6854 at btrfs_create_new_inode+0x805/0xb50 [btrfs], CPU#6: touch/888644 [758240.513513] Modules linked in: btrfs dm_zero (...) [758240.523221] CPU: 6 UID: 0 PID: 888644 Comm: touch Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full) [758240.524621] Tainted: [W]=WARN [758240.525037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [758240.526331] RIP: 0010:btrfs_create_new_inode+0x80b/0xb50 [btrfs] [758240.527093] Code: 0f 82 cf (...) [758240.529211] RSP: 0018:ffffce64418fbb48 EFLAGS: 00010292 [758240.529935] RAX: 00000000ffffffd3 RBX: 0000000000000000 RCX: 00000000ffffffb5 [758240.531040] RDX: 0000000d04f33e06 RSI: 00000000ffffffb5 RDI: ffffffffc0919dd0 [758240.531920] RBP: ffffce64418fbc10 R08: 0000000000000000 R09: 00000000ffffffb5 [758240.532928] R10: 0000000000000000 R11: ffff8e52c0000000 R12: ffff8e53eee7d0f0 [758240.533818] R13: ffff8e57f70932a0 R14: ffff8e5417629568 R15: 0000000000000000 [758240.534664] FS: 00007f1959a2a740(0000) GS:ffff8e5b27cae000(0000) knlGS:0000000000000000 [758240.535821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [758240.536644] CR2: 00007f1959b10ce0 CR3: 000000012a2cc005 CR4: 0000000000370ef0 [758240.537517] Call Trace: [758240.537828] <TASK> [758240.538099] btrfs_create_common+0xbf/0x140 [btrfs] [758240.538760] path_openat+0x111a/0x15b0 [758240.539252] do_filp_open+0xc2/0x170 [758240.539699] ? preempt_count_add+0x47/0xa0 [758240.540200] ? __virt_addr_valid+0xe4/0x1a0 [758240.540800] ? __check_object_size+0x1b3/0x230 [758240.541661] ? alloc_fd+0x118/0x180 [758240.542315] do_sys_openat2+0x70/0xd0 [758240.543012] __x64_sys_openat+0x50/0xa0 [758240.543723] do_syscall_64+0x50/0xf20 [758240.544462] entry_SYSCALL_64_after_hwframe+0x76/0x7e [758240.545397] RIP: 0033:0x7f1959abc687 [758240.546019] Code: 48 89 fa (...) [758240.548522] RSP: 002b:00007ffe16ff8690 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 [758240.566278] RAX: ffffffffffffffda RBX: 00007f1959a2a740 RCX: 00007f1959abc687 [758240.567068] RDX: 0000000000000941 RSI: 00007ffe16ffa333 RDI: ffffffffffffff9c [758240.567860] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [758240.568707] R10: 00000000000001b6 R11: 0000000000000202 R12: 0000561eec7c4b90 [758240.569712] R13: 0000561eec7c311f R14: 00007ffe16ffa333 R15: 0000000000000000 [758240.570758] </TASK> [758240.571040] ---[ end trace 0000000000000000 ]--- [758240.571681] BTRFS: error (device sdi state A) in btrfs_create_new_inode:6854: errno=-75 unknown [758240.572899] BTRFS info (device sdi state EA): forced readonly Fix this by checking for hash collision, and if the adding a new name is possible, early in btrfs_create_new_inode() before we do any tree updates, so that we don't need to abort the transaction if we cannot add the new name due to the leaf size limit. A test case for fstests will be sent soon. Fixes: caae78e03234 ("btrfs: move common inode creation code into btrfs_create_new_inode()") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-25btrfs: fix transaction abort on set received ioctl due to item overflowFilipe Manana3-2/+64
[ Upstream commit 87f2c46003fce4d739138aab4af1942b1afdadac ] If the set received ioctl fails due to an item overflow when attempting to add the BTRFS_UUID_KEY_RECEIVED_SUBVOL we have to abort the transaction since we did some metadata updates before. This means that if a user calls this ioctl with the same received UUID field for a lot of subvolumes, we will hit the overflow, trigger the transaction abort and turn the filesystem into RO mode. A malicious user could exploit this, and this ioctl does not even requires that a user has admin privileges (CAP_SYS_ADMIN), only that he/she owns the subvolume. Fix this by doing an early check for item overflow before starting a transaction. This is also race safe because we are holding the subvol_sem semaphore in exclusive (write) mode. A test case for fstests will follow soon. Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree") CC: stable@vger.kernel.org # 3.12+ Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [ adapted BTRFS_PATH_AUTO_FREE macro to manual btrfs_free_path calls ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-25btrfs: fix transaction abort when snapshotting received subvolumesFilipe Manana1-0/+16
[ Upstream commit e1b18b959025e6b5dbad668f391f65d34b39595a ] Currently a user can trigger a transaction abort by snapshotting a previously received snapshot a bunch of times until we reach a BTRFS_UUID_KEY_RECEIVED_SUBVOL item overflow (the maximum item size we can store in a leaf). This is very likely not common in practice, but if it happens, it turns the filesystem into RO mode. The snapshot, send and set_received_subvol and subvol_setflags (used by receive) don't require CAP_SYS_ADMIN, just inode_owner_or_capable(). A malicious user could use this to turn a filesystem into RO mode and disrupt a system. Reproducer script: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi # Use smallest node size to make the test faster. mkfs.btrfs -f --nodesize 4K $DEV mount $DEV $MNT # Create a subvolume and set it to RO so that it can be used for send. btrfs subvolume create $MNT/sv touch $MNT/sv/foo btrfs property set $MNT/sv ro true # Send and receive the subvolume into snaps/sv. mkdir $MNT/snaps btrfs send $MNT/sv | btrfs receive $MNT/snaps # Now snapshot the received subvolume, which has a received_uuid, a # lot of times to trigger the leaf overflow. total=500 for ((i = 1; i <= $total; i++)); do echo -ne "\rCreating snapshot $i/$total" btrfs subvolume snapshot -r $MNT/snaps/sv $MNT/snaps/sv_$i > /dev/null done echo umount $MNT When running the test: $ ./test.sh (...) Create subvolume '/mnt/sdi/sv' At subvol /mnt/sdi/sv At subvol sv Creating snapshot 496/500ERROR: Could not create subvolume: Value too large for defined data type Creating snapshot 497/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 498/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 499/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 500/500ERROR: Could not create subvolume: Read-only file system And in dmesg/syslog: $ dmesg (...) [251067.627338] BTRFS warning (device sdi): insert uuid item failed -75 (0x4628b21c4ac8d898, 0x2598bee2b1515c91) type 252! [251067.629212] ------------[ cut here ]------------ [251067.630033] BTRFS: Transaction aborted (error -75) [251067.630871] WARNING: fs/btrfs/transaction.c:1907 at create_pending_snapshot.cold+0x52/0x465 [btrfs], CPU#10: btrfs/615235 [251067.632851] Modules linked in: btrfs dm_zero (...) [251067.644071] CPU: 10 UID: 0 PID: 615235 Comm: btrfs Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full) [251067.646165] Tainted: [W]=WARN [251067.646733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [251067.648735] RIP: 0010:create_pending_snapshot.cold+0x55/0x465 [btrfs] [251067.649984] Code: f0 48 0f (...) [251067.653313] RSP: 0018:ffffce644908fae8 EFLAGS: 00010292 [251067.653987] RAX: 00000000ffffff01 RBX: ffff8e5639e63a80 RCX: 00000000ffffffd3 [251067.655042] RDX: ffff8e53faa76b00 RSI: 00000000ffffffb5 RDI: ffffffffc0919750 [251067.656077] RBP: ffffce644908fbd8 R08: 0000000000000000 R09: ffffce644908f820 [251067.657068] R10: ffff8e5adc1fffa8 R11: 0000000000000003 R12: ffff8e53c0431bd0 [251067.658050] R13: ffff8e5414593600 R14: ffff8e55efafd000 R15: 00000000ffffffb5 [251067.659019] FS: 00007f2a4944b3c0(0000) GS:ffff8e5b27dae000(0000) knlGS:0000000000000000 [251067.660115] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [251067.660943] CR2: 00007ffc5aa57898 CR3: 00000005813a2003 CR4: 0000000000370ef0 [251067.661972] Call Trace: [251067.662292] <TASK> [251067.662653] create_pending_snapshots+0x97/0xc0 [btrfs] [251067.663413] btrfs_commit_transaction+0x26e/0xc00 [btrfs] [251067.664257] ? btrfs_qgroup_convert_reserved_meta+0x35/0x390 [btrfs] [251067.665238] ? _raw_spin_unlock+0x15/0x30 [251067.665837] ? record_root_in_trans+0xa2/0xd0 [btrfs] [251067.666531] btrfs_mksubvol+0x330/0x580 [btrfs] [251067.667145] btrfs_mksnapshot+0x74/0xa0 [btrfs] [251067.667827] __btrfs_ioctl_snap_create+0x194/0x1d0 [btrfs] [251067.668595] btrfs_ioctl_snap_create_v2+0x107/0x130 [btrfs] [251067.669479] btrfs_ioctl+0x1580/0x2690 [btrfs] [251067.670093] ? count_memcg_events+0x6d/0x180 [251067.670849] ? handle_mm_fault+0x1a0/0x2a0 [251067.671652] __x64_sys_ioctl+0x92/0xe0 [251067.672406] do_syscall_64+0x50/0xf20 [251067.673129] entry_SYSCALL_64_after_hwframe+0x76/0x7e [251067.674096] RIP: 0033:0x7f2a495648db [251067.674812] Code: 00 48 89 (...) [251067.678227] RSP: 002b:00007ffc5aa57840 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [251067.679691] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2a495648db [251067.681145] RDX: 00007ffc5aa588b0 RSI: 0000000050009417 RDI: 0000000000000004 [251067.682511] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 [251067.683842] R10: 000000000000000a R11: 0000000000000246 R12: 00007ffc5aa59910 [251067.685176] R13: 00007ffc5aa588b0 R14: 0000000000000004 R15: 0000000000000006 [251067.686524] </TASK> [251067.686972] ---[ end trace 0000000000000000 ]--- [251067.687890] BTRFS: error (device sdi state A) in create_pending_snapshot:1907: errno=-75 unknown [251067.689049] BTRFS info (device sdi state EA): forced readonly [251067.689054] BTRFS warning (device sdi state EA): Skipping commit of aborted transaction. [251067.690119] BTRFS: error (device sdi state EA) in cleanup_transaction:2043: errno=-75 unknown [251067.702028] BTRFS info (device sdi state EA): last unmount of filesystem 46dc3975-30a2-4a69-a18f-418b859cccda Fix this by ignoring -EOVERFLOW errors from btrfs_uuid_tree_add() in the snapshot creation code when attempting to add the BTRFS_UUID_KEY_RECEIVED_SUBVOL item. This is OK because it's not critical and we are still able to delete the snapshot, as snapshot/subvolume deletion ignores if a BTRFS_UUID_KEY_RECEIVED_SUBVOL is missing (see inode.c:btrfs_delete_subvolume()). As for send/receive, we can still do send/receive operations since it always peeks the first root ID in the existing BTRFS_UUID_KEY_RECEIVED_SUBVOL (it could peek any since all snapshots have the same content), and even if the key is missing, it falls back to searching by BTRFS_UUID_KEY_SUBVOL key. A test case for fstests will be sent soon. Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree") CC: stable@vger.kernel.org # 3.12+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [ adapted error check condition to omit unlikely() wrapper ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-25ksmbd: unset conn->binding on failed binding requestNamjae Jeon1-0/+1
commit 282343cf8a4a5a3603b1cb0e17a7083e4a593b03 upstream. When a multichannel SMB2_SESSION_SETUP request with SMB2_SESSION_REQ_FLAG_BINDING fails ksmbd sets conn->binding = true but never clears it on the error path. This leaves the connection in a binding state where all subsequent ksmbd_session_lookup_all() calls fall back to the global sessions table. This fix it by clearing conn->binding = false in the error path. Cc: stable@vger.kernel.org Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>