summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2 daysext4: fix mballoc-test.c is not compiled when EXT4_KUNIT_TESTS=MYe Bin4-45/+172
[ Upstream commit 519b76ac0b31d86b45784735d4ef964e8efdc56b ] Now, only EXT4_KUNIT_TESTS=Y testcase will be compiled in 'mballoc.c'. To solve this issue, the ext4 test code needs to be decoupled. The ext4 test module is compiled into a separate module. Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Closes: https://patchwork.kernel.org/project/cifs-client/patch/20260118091313.1988168-2-chenxiaosong.chenxiaosong@linux.dev/ Fixes: 7c9fa399a369 ("ext4: add first unit test for ext4_mb_new_blocks_simple in mballoc") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260314075258.1317579-3-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysext4: introduce EXPORT_SYMBOL_FOR_EXT4_TEST() helperYe Bin1-0/+5
[ Upstream commit 49504a512587147dd6da3b4b08832ccc157b97dc ] Introduce EXPORT_SYMBOL_FOR_EXT4_TEST() helper for kuint test. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260314075258.1317579-2-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 519b76ac0b31 ("ext4: fix mballoc-test.c is not compiled when EXT4_KUNIT_TESTS=M") Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysnetfs: Fix the handling of stream->front by removing itDavid Howells7-12/+7
[ Upstream commit 0e764b9d46071668969410ec5429be0e2f38c6d3 ] The netfs_io_stream::front member is meant to point to the subrequest currently being collected on a stream, but it isn't actually used this way by direct write (which mostly ignores it). However, there's a tracepoint which looks at it. Further, stream->front is actually redundant with stream->subrequests.next. Fix the potential problem in the direct code by just removing the member and using stream->subrequests.next instead, thereby also simplifying the code. Fixes: a0b4c7a49137 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence") Reported-by: Paulo Alcantara <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/4158599.1774426817@warthog.procyon.org.uk Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysbtrfs: fix lost error when running device stats on multiple devices fsFilipe Manana1-2/+3
[ Upstream commit 1c37d896b12dfd0d4c96e310b0033c6676933917 ] Whenever we get an error updating the device stats item for a device in btrfs_run_dev_stats() we allow the loop to go to the next device, and if updating the stats item for the next device succeeds, we end up losing the error we had from the previous device. Fix this by breaking out of the loop once we get an error and make sure it's returned to the caller. Since we are in the transaction commit path (and in the critical section actually), returning the error will result in a transaction abort. Fixes: 733f4fbbc108 ("Btrfs: read device stats on mount, write modified ones during commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysbtrfs: fix leak of kobject name for sub-group space_infoShin'ichiro Kawasaki1-1/+1
[ Upstream commit a4376d9a5d4c9610e69def3fc0b32c86a7ab7a41 ] When create_space_info_sub_group() allocates elements of space_info->sub_group[], kobject_init_and_add() is called for each element via btrfs_sysfs_add_space_info_type(). However, when check_removing_space_info() frees these elements, it does not call btrfs_sysfs_remove_space_info() on them. As a result, kobject_put() is not called and the associated kobj->name objects are leaked. This memory leak is reproduced by running the blktests test case zbd/009 on kernels built with CONFIG_DEBUG_KMEMLEAK. The kmemleak feature reports the following error: unreferenced object 0xffff888112877d40 (size 16): comm "mount", pid 1244, jiffies 4294996972 hex dump (first 16 bytes): 64 61 74 61 2d 72 65 6c 6f 63 00 c4 c6 a7 cb 7f data-reloc...... backtrace (crc 53ffde4d): __kmalloc_node_track_caller_noprof+0x619/0x870 kstrdup+0x42/0xc0 kobject_set_name_vargs+0x44/0x110 kobject_init_and_add+0xcf/0x150 btrfs_sysfs_add_space_info_type+0xfc/0x210 [btrfs] create_space_info_sub_group.constprop.0+0xfb/0x1b0 [btrfs] create_space_info+0x211/0x320 [btrfs] btrfs_init_space_info+0x15a/0x1b0 [btrfs] open_ctree+0x33c7/0x4a50 [btrfs] btrfs_get_tree.cold+0x9f/0x1ee [btrfs] vfs_get_tree+0x87/0x2f0 vfs_cmd_create+0xbd/0x280 __do_sys_fsconfig+0x3df/0x990 do_syscall_64+0x136/0x1540 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid the leak, call btrfs_sysfs_remove_space_info() instead of kfree() for the elements. Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group") Link: https://lore.kernel.org/linux-block/b9488881-f18d-4f47-91a5-3c9bf63955a5@wdc.com/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysbtrfs: fix super block offset in error message in btrfs_validate_super()Mark Harmstone1-2/+2
[ Upstream commit b52fe51f724385b3ed81e37e510a4a33107e8161 ] Fix the superblock offset mismatch error message in btrfs_validate_super(): we changed it so that it considers all the superblocks, but the message still assumes we're only looking at the first one. The change from %u to %llu is because we're changing from a constant to a u64. Fixes: 069ec957c35e ("btrfs: Refactor btrfs_check_super_valid") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysnetfs: Fix read abandonment during retryDavid Howells1-1/+4
[ Upstream commit 7e57523490cd2efb52b1ea97f2e0a74c0fb634cd ] Under certain circumstances, all the remaining subrequests from a read request will get abandoned during retry. The abandonment process expects the 'subreq' variable to be set to the place to start abandonment from, but it doesn't always have a useful value (it will be uninitialised on the first pass through the loop and it may point to a deleted subrequest on later passes). Fix the first jump to "abandon:" to set subreq to the start of the first subrequest expected to need retry (which, in this abandonment case, turned out unexpectedly to no longer have NEED_RETRY set). Also clear the subreq pointer after discarding superfluous retryable subrequests to cause an oops if we do try to access it. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/3775287.1773848338@warthog.procyon.org.uk Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysnetfs: Fix NULL pointer dereference in netfs_unbuffered_write() on retryDeepanshu Kartikey1-3/+11
[ Upstream commit e9075e420a1eb3b52c60f3b95893a55e77419ce8 ] When a write subrequest is marked NETFS_SREQ_NEED_RETRY, the retry path in netfs_unbuffered_write() unconditionally calls stream->prepare_write() without checking if it is NULL. Filesystems such as 9P do not set the prepare_write operation, so stream->prepare_write remains NULL. When get_user_pages() fails with -EFAULT and the subrequest is flagged for retry, this results in a NULL pointer dereference at fs/netfs/direct_write.c:189. Fix this by mirroring the pattern already used in write_retry.c: if stream->prepare_write is NULL, skip renegotiation and directly reissue the subrequest via netfs_reissue_write(), which handles iterator reset, IN_PROGRESS flag, stats update and reissue internally. Fixes: a0b4c7a49137 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence") Reported-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7227db0fbac9f348dba0 Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Link: https://patch.msgid.link/20260307043947.347092-1-kartikey406@gmail.com Tested-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysnetfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iteratorsDeepanshu Kartikey1-0/+43
[ Upstream commit 67e467a11f62ff64ad219dc6aa5459e132c79d14 ] When a process crashes and the kernel writes a core dump to a 9P filesystem, __kernel_write() creates an ITER_KVEC iterator. This iterator reaches netfs_limit_iter() via netfs_unbuffered_write(), which only handles ITER_FOLIOQ, ITER_BVEC and ITER_XARRAY iterator types, hitting the BUG() for any other type. Fix this by adding netfs_limit_kvec() following the same pattern as netfs_limit_bvec(), since both kvec and bvec are simple segment arrays with pointer and length fields. Dispatch it from netfs_limit_iter() when the iterator type is ITER_KVEC. Fixes: cae932d3aee5 ("netfs: Add func to calculate pagecount/size-limited span of an iterator") Reported-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9c058f0d63475adc97fd Tested-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Link: https://patch.msgid.link/20260307090041.359870-1-kartikey406@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysksmbd: fix use-after-free and NULL deref in smb_grant_oplock()Werner Kasselman1-27/+45
[ Upstream commit 48623ec358c1c600fa1e38368746f933e0f1a617 ] smb_grant_oplock() has two issues in the oplock publication sequence: 1) opinfo is linked into ci->m_op_list (via opinfo_add) before add_lease_global_list() is called. If add_lease_global_list() fails (kmalloc returns NULL), the error path frees the opinfo via __free_opinfo() while it is still linked in ci->m_op_list. Concurrent m_op_list readers (opinfo_get_list, or direct iteration in smb_break_all_levII_oplock) dereference the freed node. 2) opinfo->o_fp is assigned after add_lease_global_list() publishes the opinfo on the global lease list. A concurrent find_same_lease_key() can walk the lease list and dereference opinfo->o_fp->f_ci while o_fp is still NULL. Fix by restructuring the publication sequence to eliminate post-publish failure: - Set opinfo->o_fp before any list publication (fixes NULL deref). - Preallocate lease_table via alloc_lease_table() before opinfo_add() so add_lease_global_list() becomes infallible after publication. - Keep the original m_op_list publication order (opinfo_add before lease list) so concurrent opens via same_client_has_lease() and opinfo_get_list() still see the in-flight grant. - Use opinfo_put() instead of __free_opinfo() on err_out so that the RCU-deferred free path is used. This also requires splitting add_lease_global_list() to take a preallocated lease_table and changing its return type from int to void, since it can no longer fail. Fixes: 1dfd062caa16 ("ksmbd: fix use-after-free by using call_rcu() for oplock_info") Cc: stable@vger.kernel.org Signed-off-by: Werner Kasselman <werner@verivus.com> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> [ adapted kmalloc_obj() macro to kmalloc(sizeof()) ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: always drain queued discard work in ext4_mb_release()Theodore Ts'o1-7/+5
commit 9ee29d20aab228adfb02ca93f87fb53c56c2f3af upstream. While reviewing recent ext4 patch[1], Sashiko raised the following concern[2]: > If the filesystem is initially mounted with the discard option, > deleting files will populate sbi->s_discard_list and queue > s_discard_work. If it is then remounted with nodiscard, the > EXT4_MOUNT_DISCARD flag is cleared, but the pending s_discard_work is > neither cancelled nor flushed. [1] https://lore.kernel.org/r/20260319094545.19291-1-qiang.zhang@linux.dev/ [2] https://sashiko.dev/#/patchset/20260319094545.19291-1-qiang.zhang%40linux.dev The concern was valid, but it had nothing to do with the patch[1]. One of the problems with Sashiko in its current (early) form is that it will detect pre-existing issues and report it as a problem with the patch that it is reviewing. In practice, it would be hard to hit deliberately (unless you are a malicious syzkaller fuzzer), since it would involve mounting the file system with -o discard, and then deleting a large number of files, remounting the file system with -o nodiscard, and then immediately unmounting the file system before the queued discard work has a change to drain on its own. Fix it because it's a real bug, and to avoid Sashiko from raising this concern when analyzing future patches to mballoc.c. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: 55cdd0af2bc5 ("ext4: get discard out of jbd2 commit kthread contex") Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: fix iloc.bh leak in ext4_fc_replay_inode() error pathsBaokun Li1-5/+8
commit ec0a7500d8eace5b4f305fa0c594dd148f0e8d29 upstream. During code review, Joseph found that ext4_fc_replay_inode() calls ext4_get_fc_inode_loc() to get the inode location, which holds a reference to iloc.bh that must be released via brelse(). However, several error paths jump to the 'out' label without releasing iloc.bh: - ext4_handle_dirty_metadata() failure - sync_dirty_buffer() failure - ext4_mark_inode_used() failure - ext4_iget() failure Fix this by introducing an 'out_brelse' label placed just before the existing 'out' label to ensure iloc.bh is always released. Additionally, make ext4_fc_replay_inode() propagate errors properly instead of always returning 0. Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: handle wraparound when searching for blocks for indirect mapped blocksTheodore Ts'o1-0/+2
commit bb81702370fad22c06ca12b6e1648754dbc37e0f upstream. Commit 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") restricts what blocks will be allocated for indirect block based files to block numbers that fit within 32-bit block numbers. However, when using a review bot running on the latest Gemini LLM to check this commit when backporting into an LTS based kernel, it raised this concern: If ac->ac_g_ex.fe_group is >= ngroups (for instance, if the goal group was populated via stream allocation from s_mb_last_groups), then start will be >= ngroups. Does this allow allocating blocks beyond the 32-bit limit for indirect block mapped files? The commit message mentions that ext4_mb_scan_groups_linear() takes care to not select unsupported groups. However, its loop uses group = *start, and the very first iteration will call ext4_mb_scan_group() with this unsupported group because next_linear_group() is only called at the end of the iteration. After reviewing the code paths involved and considering the LLM review, I determined that this can happen when there is a file system where some files/directories are extent-mapped and others are indirect-block mapped. To address this, add a safety clamp in ext4_mb_scan_groups(). Fixes: 4865c768b563 ("ext4: always allocate blocks only from groups inode can use") Cc: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20260326045834.1175822-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: fix the might_sleep() warnings in kvfree()Zqiang2-13/+5
commit 496bb99b7e66f48b178126626f47e9ba79e2d0fa upstream. Use the kvfree() in the RCU read critical section can trigger the following warnings: EXT4-fs (vdb): unmounting filesystem cd983e5b-3c83-4f5a-a136-17b00eb9d018. WARNING: suspicious RCU usage ./include/linux/rcupdate.h:409 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 Call Trace: <TASK> dump_stack_lvl+0xbb/0xd0 dump_stack+0x14/0x20 lockdep_rcu_suspicious+0x15a/0x1b0 __might_resched+0x375/0x4d0 ? put_object.part.0+0x2c/0x50 __might_sleep+0x108/0x160 vfree+0x58/0x910 ? ext4_group_desc_free+0x27/0x270 kvfree+0x23/0x40 ext4_group_desc_free+0x111/0x270 ext4_put_super+0x3c8/0xd40 generic_shutdown_super+0x14c/0x4a0 ? __pfx_shrinker_free+0x10/0x10 kill_block_super+0x40/0x90 ext4_kill_sb+0x6d/0xb0 deactivate_locked_super+0xb4/0x180 deactivate_super+0x7e/0xa0 cleanup_mnt+0x296/0x3e0 __cleanup_mnt+0x16/0x20 task_work_run+0x157/0x250 ? __pfx_task_work_run+0x10/0x10 ? exit_to_user_mode_loop+0x6a/0x550 exit_to_user_mode_loop+0x102/0x550 do_syscall_64+0x44a/0x500 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> BUG: sleeping function called from invalid context at mm/vmalloc.c:3441 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 556, name: umount preempt_count: 1, expected: 0 CPU: 3 UID: 0 PID: 556 Comm: umount Call Trace: <TASK> dump_stack_lvl+0xbb/0xd0 dump_stack+0x14/0x20 __might_resched+0x275/0x4d0 ? put_object.part.0+0x2c/0x50 __might_sleep+0x108/0x160 vfree+0x58/0x910 ? ext4_group_desc_free+0x27/0x270 kvfree+0x23/0x40 ext4_group_desc_free+0x111/0x270 ext4_put_super+0x3c8/0xd40 generic_shutdown_super+0x14c/0x4a0 ? __pfx_shrinker_free+0x10/0x10 kill_block_super+0x40/0x90 ext4_kill_sb+0x6d/0xb0 deactivate_locked_super+0xb4/0x180 deactivate_super+0x7e/0xa0 cleanup_mnt+0x296/0x3e0 __cleanup_mnt+0x16/0x20 task_work_run+0x157/0x250 ? __pfx_task_work_run+0x10/0x10 ? exit_to_user_mode_loop+0x6a/0x550 exit_to_user_mode_loop+0x102/0x550 do_syscall_64+0x44a/0x500 entry_SYSCALL_64_after_hwframe+0x77/0x7f The above scenarios occur in initialization failures and teardown paths, there are no parallel operations on the resources released by kvfree(), this commit therefore remove rcu_read_lock/unlock() and use rcu_access_pointer() instead of rcu_dereference() operations. Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access") Fixes: df3da4ea5a0f ("ext4: fix potential race between s_group_info online resizing and access") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Link: https://patch.msgid.link/20260319094545.19291-1-qiang.zhang@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: fix use-after-free in update_super_work when racing with umountJiayuan Chen3-1/+11
commit d15e4b0a418537aafa56b2cb80d44add83e83697 upstream. Commit b98535d09179 ("ext4: fix bug_on in start_this_handle during umount filesystem") moved ext4_unregister_sysfs() before flushing s_sb_upd_work to prevent new error work from being queued via /proc/fs/ext4/xx/mb_groups reads during unmount. However, this introduced a use-after-free because update_super_work calls ext4_notify_error_sysfs() -> sysfs_notify() which accesses the kobject's kernfs_node after it has been freed by kobject_del() in ext4_unregister_sysfs(): update_super_work ext4_put_super ----------------- -------------- ext4_unregister_sysfs(sb) kobject_del(&sbi->s_kobj) __kobject_del() sysfs_remove_dir() kobj->sd = NULL sysfs_put(sd) kernfs_put() // RCU free ext4_notify_error_sysfs(sbi) sysfs_notify(&sbi->s_kobj) kn = kobj->sd // stale pointer kernfs_get(kn) // UAF on freed kernfs_node ext4_journal_destroy() flush_work(&sbi->s_sb_upd_work) Instead of reordering the teardown sequence, fix this by making ext4_notify_error_sysfs() detect that sysfs has already been torn down by checking s_kobj.state_in_sysfs, and skipping the sysfs_notify() call in that case. A dedicated mutex (s_error_notify_mutex) serializes ext4_notify_error_sysfs() against kobject_del() in ext4_unregister_sysfs() to prevent TOCTOU races where the kobject could be deleted between the state_in_sysfs check and the sysfs_notify() call. Fixes: b98535d09179 ("ext4: fix bug_on in start_this_handle during umount filesystem") Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260319120336.157873-1-jiayuan.chen@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: reject mount if bigalloc with s_first_data_block != 0Helen Koike1-0/+7
commit 3822743dc20386d9897e999dbb990befa3a5b3f8 upstream. bigalloc with s_first_data_block != 0 is not supported, reject mounting it. Signed-off-by: Helen Koike <koike@igalia.com> Suggested-by: Theodore Ts'o <tytso@mit.edu> Reported-by: syzbot+b73703b873a33d8eb8f6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b73703b873a33d8eb8f6 Link: https://patch.msgid.link/20260317142325.135074-1-koike@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: avoid allocate block from corrupted group in ext4_mb_find_by_goal()Ye Bin1-1/+5
commit 46066e3a06647c5b186cc6334409722622d05c44 upstream. There's issue as follows: ... EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2243 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2239 at logical offset 0 with max blocks 1 with error 117 EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost EXT4-fs (mmcblk0p1): error count since last fsck: 1 EXT4-fs (mmcblk0p1): initial error at time 1765597433: ext4_mb_generate_buddy:760 EXT4-fs (mmcblk0p1): last error at time 1765597433: ext4_mb_generate_buddy:760 ... According to the log analysis, blocks are always requested from the corrupted block group. This may happen as follows: ext4_mb_find_by_goal ext4_mb_load_buddy ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_read_block_bitmap_nowait ext4_wait_block_bitmap ext4_validate_block_bitmap if (!grp || EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) return -EFSCORRUPTED; // There's no logs. if (err) return err; // Will return error ext4_lock_group(ac->ac_sb, group); if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) // Unreachable goto out; After commit 9008a58e5dce ("ext4: make the bitmap read routines return real error codes") merged, Commit 163a203ddb36 ("ext4: mark block group as corrupt on block bitmap error") is no real solution for allocating blocks from corrupted block groups. This is because if 'EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)' is true, then 'ext4_mb_load_buddy()' may return an error. This means that the block allocation will fail. Therefore, check block group if corrupted when ext4_mb_load_buddy() returns error. Fixes: 163a203ddb36 ("ext4: mark block group as corrupt on block bitmap error") Fixes: 9008a58e5dce ("ext4: make the bitmap read routines return real error codes") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260302134619.3145520-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: avoid infinite loops caused by residual dataEdward Adam Davis1-2/+6
commit 5422fe71d26d42af6c454ca9527faaad4e677d6c upstream. On the mkdir/mknod path, when mapping logical blocks to physical blocks, if inserting a new extent into the extent tree fails (in this example, because the file system disabled the huge file feature when marking the inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to reclaim the physical block without deleting the corresponding data in the extent tree. This causes subsequent mkdir operations to reference the previously reclaimed physical block number again, even though this physical block is already being used by the xattr block. Therefore, a situation arises where both the directory and xattr are using the same buffer head block in memory simultaneously. The above causes ext4_xattr_block_set() to enter an infinite loop about "inserted" and cannot release the inode lock, ultimately leading to the 143s blocking problem mentioned in [1]. If the metadata is corrupted, then trying to remove some extent space can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE was passed, remove space wrongly update quota information. Jan Kara suggests distinguishing between two cases: 1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully consistent and we must maintain its consistency including all the accounting. However these errors can happen only early before we've inserted the extent into the extent tree. So current code works correctly for this case. 2) Some other error - this means metadata is corrupted. We should strive to do as few modifications as possible to limit damage. So I'd just skip freeing of allocated blocks. [1] INFO: task syz.0.17:5995 blocked for more than 143 seconds. Call Trace: inode_lock_nested include/linux/fs.h:1073 [inline] __start_dirop fs/namei.c:2923 [inline] start_dirop fs/namei.c:2934 [inline] Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7 Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: validate p_idx bounds in ext4_ext_correct_indexesTejas Bharambe1-0/+15
commit 2acb5c12ebd860f30e4faf67e6cc8c44ddfe5fe8 upstream. ext4_ext_correct_indexes() walks up the extent tree correcting index entries when the first extent in a leaf is modified. Before accessing path[k].p_idx->ei_block, there is no validation that p_idx falls within the valid range of index entries for that level. If the on-disk extent header contains a corrupted or crafted eh_entries value, p_idx can point past the end of the allocated buffer, causing a slab-out-of-bounds read. Fix this by validating path[k].p_idx against EXT_LAST_INDEX() at both access sites: before the while loop and inside it. Return -EFSCORRUPTED if the index pointer is out of range, consistent with how other bounds violations are handled in the ext4 extent tree code. Reported-by: syzbot+04c4e65cab786a2e5b7e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=04c4e65cab786a2e5b7e Signed-off-by: Tejas Bharambe <tejas.bharambe@outlook.com> Link: https://patch.msgid.link/JH0PR06MB66326016F9B6AD24097D232B897CA@JH0PR06MB6632.apcprd06.prod.outlook.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: test if inode's all dirty pages are submitted to diskYe Bin1-0/+8
commit 73bf12adbea10b13647864cd1c62410d19e21086 upstream. The commit aa373cf55099 ("writeback: stop background/kupdate works from livelocking other works") introduced an issue where unmounting a filesystem in a multi-logical-partition scenario could lead to batch file data loss. This problem was not fixed until the commit d92109891f21 ("fs/writeback: bail out if there is no more inodes for IO and queued once"). It took considerable time to identify the root cause. Additionally, in actual production environments, we frequently encountered file data loss after normal system reboots. Therefore, we are adding a check in the inode release flow to verify whether all dirty pages have been flushed to disk, in order to determine whether the data loss is caused by a logic issue in the filesystem code. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260303012242.3206465-1-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: publish jinode after initializationLi Chen2-6/+13
commit 1aec30021edd410b986c156f195f3d23959a9d11 upstream. ext4_inode_attach_jinode() publishes ei->jinode to concurrent users. It used to set ei->jinode before jbd2_journal_init_jbd_inode(), allowing a reader to observe a non-NULL jinode with i_vfs_inode still unset. The fast commit flush path can then pass this jinode to jbd2_wait_inode_data(), which dereferences i_vfs_inode->i_mapping and may crash. Below is the crash I observe: ``` BUG: unable to handle page fault for address: 000000010beb47f4 PGD 110e51067 P4D 110e51067 PUD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 1 UID: 0 PID: 4850 Comm: fc_fsync_bench_ Not tainted 6.18.0-00764-g795a690c06a5 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 RIP: 0010:xas_find_marked+0x3d/0x2e0 Code: e0 03 48 83 f8 02 0f 84 f0 01 00 00 48 8b 47 08 48 89 c3 48 39 c6 0f 82 fd 01 00 00 48 85 c9 74 3d 48 83 f9 03 77 63 4c 8b 0f <49> 8b 71 08 48 c7 47 18 00 00 00 00 48 89 f1 83 e1 03 48 83 f9 02 RSP: 0018:ffffbbee806e7bf0 EFLAGS: 00010246 RAX: 000000000010beb4 RBX: 000000000010beb4 RCX: 0000000000000003 RDX: 0000000000000001 RSI: 0000002000300000 RDI: ffffbbee806e7c10 RBP: 0000000000000001 R08: 0000002000300000 R09: 000000010beb47ec R10: ffff9ea494590090 R11: 0000000000000000 R12: 0000002000300000 R13: ffffbbee806e7c90 R14: ffff9ea494513788 R15: ffffbbee806e7c88 FS: 00007fc2f9e3e6c0(0000) GS:ffff9ea6b1444000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000010beb47f4 CR3: 0000000119ac5000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> filemap_get_folios_tag+0x87/0x2a0 __filemap_fdatawait_range+0x5f/0xd0 ? srso_alias_return_thunk+0x5/0xfbef5 ? __schedule+0x3e7/0x10c0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 ? cap_safe_nice+0x37/0x70 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 filemap_fdatawait_range_keep_errors+0x12/0x40 ext4_fc_commit+0x697/0x8b0 ? ext4_file_write_iter+0x64b/0x950 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ? srso_alias_return_thunk+0x5/0xfbef5 ? vfs_write+0x356/0x480 ? srso_alias_return_thunk+0x5/0xfbef5 ? preempt_count_sub+0x5f/0x80 ext4_sync_file+0xf7/0x370 do_fsync+0x3b/0x80 ? syscall_trace_enter+0x108/0x1d0 __x64_sys_fdatasync+0x16/0x20 do_syscall_64+0x62/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ... ``` Fix this by initializing the jbd2_inode first. Use smp_wmb() and WRITE_ONCE() to publish ei->jinode after initialization. Readers use READ_ONCE() to fetch the pointer. Fixes: a361293f5fede ("jbd2: Fix oops in jbd2_journal_file_inode()") Cc: stable@vger.kernel.org Signed-off-by: Li Chen <me@linux.beauty> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260225082617.147957-1-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: replace BUG_ON with proper error handling in ext4_read_inline_folioYuto Ohnuki1-1/+9
commit 356227096eb66e41b23caf7045e6304877322edf upstream. Replace BUG_ON() with proper error handling when inline data size exceeds PAGE_SIZE. This prevents kernel panic and allows the system to continue running while properly reporting the filesystem corruption. The error is logged via ext4_error_inode(), the buffer head is released to prevent memory leak, and -EFSCORRUPTED is returned to indicate filesystem corruption. Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Link: https://patch.msgid.link/20260223123345.14838-2-ytohnuki@amazon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: make recently_deleted() properly work with lazy itable initializationJan Kara1-0/+6
commit bd060afa7cc3e0ad30afa9ecc544a78638498555 upstream. recently_deleted() checks whether inode has been used in the near past. However this can give false positive result when inode table is not initialized yet and we are in fact comparing to random garbage (or stale itable block of a filesystem before mkfs). Ultimately this results in uninitialized inodes being skipped during inode allocation and possibly they are never initialized and thus e2fsck complains. Verify if the inode has been initialized before checking for dtime. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260216164848.3074-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: fix fsync(2) for nojournal modeJan Kara1-2/+14
commit 1308255bbf8452762f89f44f7447ce137ecdbcff upstream. When inode metadata is changed, we sometimes just call ext4_mark_inode_dirty() to track modified metadata. This copies inode metadata into block buffer which is enough when we are journalling metadata. However when we are running in nojournal mode we currently fail to write the dirtied inode buffer during fsync(2) because the inode is not marked as dirty. Use explicit ext4_write_inode() call to make sure the inode table buffer is written to the disk. This is a band aid solution but proper solution requires a much larger rewrite including changes in metadata bh tracking infrastructure. Reported-by: Free Ekanayaka <free.ekanayaka@gmail.com> Link: https://lore.kernel.org/all/87il8nhxdm.fsf@x1.mail-host-address-is-not-set/ CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260216164848.3074-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: do not check fast symlink during orphan recoveryZhang Yi1-11/+29
commit 84e21e3fb8fd99ea460eb7274584750d11cf3e9f upstream. Commit '5f920d5d6083 ("ext4: verify fast symlink length")' causes the generic/475 test to fail during orphan cleanup of zero-length symlinks. generic/475 84s ... _check_generic_filesystem: filesystem on /dev/vde is inconsistent The fsck reports are provided below: Deleted inode 9686 has zero dtime. Deleted inode 158230 has zero dtime. ... Inode bitmap differences: -9686 -158230 Orphan file (inode 12) block 13 is not clean. Failed to initialize orphan file. In ext4_symlink(), a newly created symlink can be added to the orphan list due to ENOSPC. Its data has not been initialized, and its size is zero. Therefore, we need to disregard the length check of the symbolic link when cleaning up orphan inodes. Instead, we should ensure that the nlink count is zero. Fixes: 5f920d5d6083 ("ext4: verify fast symlink length") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260131091156.1733648-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: fix stale xarray tags after writebackJan Kara1-2/+8
commit f4a2b42e78914ff15630e71289adc589c3a8eb45 upstream. There are cases where ext4_bio_write_page() gets called for a page which has no buffers to submit. This happens e.g. when the part of the file is actually a hole, when we cannot allocate blocks due to being called from jbd2, or in data=journal mode when checkpointing writes the buffers earlier. In these cases we just return from ext4_bio_write_page() however if the page didn't need redirtying, we will leave stale DIRTY and/or TOWRITE tags in xarray because those get cleared only in __folio_start_writeback(). As a result we can leave these tags set in mappings even after a final sync on filesystem that's getting remounted read-only or that's being frozen. Various assertions can then get upset when writeback is started on such filesystems (Gerald reported assertion in ext4_journal_check_start() firing). Fix the problem by cycling the page through writeback state even if we decide nothing needs to be written for it so that xarray tags get properly updated. This is slightly silly (we could update the xarray tags directly) but I don't think a special helper messing with xarray tags is really worth it in this relatively rare corner case. Reported-by: Gerald Yang <gerald.yang@canonical.com> Link: https://lore.kernel.org/all/20260128074515.2028982-1-gerald.yang@canonical.com Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260205092223.21287-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: convert inline data to extents when truncate exceeds inline sizeDeepanshu Kartikey1-0/+12
commit ed9356a30e59c7cc3198e7fc46cfedf3767b9b17 upstream. Add a check in ext4_setattr() to convert files from inline data storage to extent-based storage when truncate() grows the file size beyond the inline capacity. This prevents the filesystem from entering an inconsistent state where the inline data flag is set but the file size exceeds what can be stored inline. Without this fix, the following sequence causes a kernel BUG_ON(): 1. Mount filesystem with inode that has inline flag set and small size 2. truncate(file, 50MB) - grows size but inline flag remains set 3. sendfile() attempts to write data 4. ext4_write_inline_data() hits BUG_ON(write_size > inline_capacity) The crash occurs because ext4_write_inline_data() expects inline storage to accommodate the write, but the actual inline capacity (~60 bytes for i_block + ~96 bytes for xattrs) is far smaller than the file size and write request. The fix checks if the new size from setattr exceeds the inode's actual inline capacity (EXT4_I(inode)->i_inline_size) and converts the file to extent-based storage before proceeding with the size change. This addresses the root cause by ensuring the inline data flag and file size remain consistent during truncate operations. Reported-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7de5fe447862fc37576f Tested-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Link: https://patch.msgid.link/20260207043607.1175976-1-kartikey406@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysext4: fix journal credit check when setting fscrypt contextSimon Weber1-1/+8
commit b1d682f1990c19fb1d5b97d13266210457092bcd upstream. Fix an issue arising when ext4 features has_journal, ea_inode, and encrypt are activated simultaneously, leading to ENOSPC when creating an encrypted file. Fix by passing XATTR_CREATE flag to xattr_set_handle function if a handle is specified, i.e., when the function is called in the control flow of creating a new inode. This aligns the number of jbd2 credits set_handle checks for with the number allocated for creating a new inode. ext4_set_context must not be called with a non-null handle (fs_data) if fscrypt context xattr is not guaranteed to not exist yet. The only other usage of this function currently is when handling the ioctl FS_IOC_SET_ENCRYPTION_POLICY, which calls it with fs_data=NULL. Fixes: c1a5d5f6ab21eb7e ("ext4: improve journal credit handling in set xattr paths") Co-developed-by: Anthony Durrer <anthonydev@fastmail.com> Signed-off-by: Anthony Durrer <anthonydev@fastmail.com> Signed-off-by: Simon Weber <simon.weber.39@gmail.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20260207100148.724275-4-simon.weber.39@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysxfs: remove file_path tracepoint dataDarrick J. Wong2-19/+4
commit e31c53a8060e134111ed095783fee0aa0c43b080 upstream. The xfile/xmbuf shmem file descriptions are no longer as detailed as they were when online fsck was first merged, because moving to static strings in commit 60382993a2e180 ("xfs: get rid of the xchk_xfile_*_descr calls") removed a memory allocation and hence a source of failure. However this makes encoding the description in the tracepoints sort of a waste of memory. David Laight also points out that file_path doesn't zero the whole buffer which causes exposure of stale trace bytes, and Steven Rostedt wonders why we're not using a dynamic array for the file path. I don't think this is worth fixing, so let's just rip it out. Cc: rostedt@goodmis.org Cc: david.laight.linux@gmail.com Link: https://lore.kernel.org/linux-xfs/20260323172204.work.979-kees@kernel.org/ Cc: stable@vger.kernel.org # v6.11 Fixes: 19ebc8f84ea12e ("xfs: fix file_path handling in tracepoints") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysxfs: don't irele after failing to iget in xfs_attri_recover_workDarrick J. Wong1-1/+0
commit 70685c291ef82269180758130394ecdc4496b52c upstream. xlog_recovery_iget* never set @ip to a valid pointer if they return an error, so this irele will walk off a dangling pointer. Fix that. Cc: stable@vger.kernel.org # v6.10 Fixes: ae673f534a3097 ("xfs: record inode generation in xattr update log intent items") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysxfs: fix ri_total validation in xlog_recover_attri_commit_pass2Long Li1-2/+2
commit d72f2084e30966097c8eae762e31986a33c3c0ae upstream. The ri_total checks for SET/REPLACE operations are hardcoded to 3, but xfs_attri_item_size() only emits a value iovec when value_len > 0, so ri_total is 2 when value_len == 0. For PPTR_SET/PPTR_REMOVE/PPTR_REPLACE, value_len is validated by xfs_attri_validate() to be exactly sizeof(struct xfs_parent_rec) and is never zero, so their hardcoded checks remain correct. This problem may cause log recovery failures. The following script can be used to reproduce the problem: #!/bin/bash mkfs.xfs -f /dev/sda mount /dev/sda /mnt/test/ touch /mnt/test/file for i in {1..200}; do attr -s "user.attr_$i" -V "value_$i" /mnt/test/file > /dev/null done echo 1 > /sys/fs/xfs/debug/larp echo 1 > /sys/fs/xfs/sda/errortag/larp attr -s "user.zero" -V "" /mnt/test/file echo 0 > /sys/fs/xfs/sda/errortag/larp umount /mnt/test mount /dev/sda /mnt/test/ # mount failed Fix this by deriving the expected count dynamically as "2 + !!value_len" for SET/REPLACE operations. Cc: stable@vger.kernel.org # v6.9 Fixes: ad206ae50eca ("xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysxfs: scrub: unlock dquot before early return in quota scrubhongao1-1/+3
commit 268378b6ad20569af0d1957992de1c8b16c6e900 upstream. xchk_quota_item can return early after calling xchk_fblock_process_error. When that helper returns false, the function returned immediately without dropping dq->q_qlock, which can leave the dquot lock held and risk lock leaks or deadlocks in later quota operations. Fix this by unlocking dq->q_qlock before the early return. Signed-off-by: hongao <hongao@uniontech.com> Fixes: 7d1f0e167a067e ("xfs: check the ondisk space mapping behind a dquot") Cc: <stable@vger.kernel.org> # v6.8 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysxfs: avoid dereferencing log items after push callbacksYuto Ohnuki2-11/+51
commit 79ef34ec0554ec04bdbafafbc9836423734e1bd6 upstream. After xfsaild_push_item() calls iop_push(), the log item may have been freed if the AIL lock was dropped during the push. Background inode reclaim or the dquot shrinker can free the log item while the AIL lock is not held, and the tracepoints in the switch statement dereference the log item after iop_push() returns. Fix this by capturing the log item type, flags, and LSN before calling xfsaild_push_item(), and introducing a new xfs_ail_push_class trace event class that takes these pre-captured values and the ailp pointer instead of the log item pointer. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysxfs: save ailp before dropping the AIL lock in push callbacksYuto Ohnuki2-4/+14
commit 394d70b86fae9fe865e7e6d9540b7696f73aa9b6 upstream. In xfs_inode_item_push() and xfs_qm_dquot_logitem_push(), the AIL lock is dropped to perform buffer IO. Once the cluster buffer no longer protects the log item from reclaim, the log item may be freed by background reclaim or the dquot shrinker. The subsequent spin_lock() call dereferences lip->li_ailp, which is a use-after-free. Fix this by saving the ailp pointer in a local variable while the AIL lock is held and the log item is guaranteed to be valid. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysxfs: stop reclaim before pushing AIL during unmountYuto Ohnuki1-3/+4
commit 4f24a767e3d64a5f58c595b5c29b6063a201f1e3 upstream. The unmount sequence in xfs_unmount_flush_inodes() pushed the AIL while background reclaim and inodegc are still running. This is broken independently of any use-after-free issues - background reclaim and inodegc should not be running while the AIL is being pushed during unmount, as inodegc can dirty and insert inodes into the AIL during the flush, and background reclaim can race to abort and free dirty inodes. Reorder xfs_unmount_flush_inodes() to stop inodegc and cancel background reclaim before pushing the AIL. Stop inodegc before cancelling m_reclaim_work because the inodegc worker can re-queue m_reclaim_work via xfs_inodegc_set_reclaimable. Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysiomap: fix invalid folio access when i_blkbits differs from I/O granularityJoanne Koong1-5/+10
commit bd71fb3fea9945987053968f028a948997cba8cc upstream. Commit aa35dd5cbc06 ("iomap: fix invalid folio access after folio_end_read()") partially addressed invalid folio access for folios without an ifs attached, but it did not handle the case where 1 << inode->i_blkbits matches the folio size but is different from the granularity used for the IO, which means IO can be submitted for less than the full folio for the !ifs case. In this case, the condition: if (*bytes_submitted == folio_len) ctx->cur_folio = NULL; in iomap_read_folio_iter() will not invalidate ctx->cur_folio, and iomap_read_end() will still be called on the folio even though the IO helper owns it and will finish the read on it. Fix this by unconditionally invalidating ctx->cur_folio for the !ifs case. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/linux-fsdevel/b3dfe271-4e3d-4922-b618-e73731242bca@wdc.com/ Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260317203935.830549-1-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysjbd2: gracefully abort on checkpointing state corruptionsMilos Nikic1-2/+13
commit bac3190a8e79beff6ed221975e0c9b1b5f2a21da upstream. This patch targets two internal state machine invariants in checkpoint.c residing inside functions that natively return integer error codes. - In jbd2_cleanup_journal_tail(): A blocknr of 0 indicates a severely corrupted journal superblock. Replaced the J_ASSERT with a WARN_ON_ONCE and a graceful journal abort, returning -EFSCORRUPTED. - In jbd2_log_do_checkpoint(): Replaced the J_ASSERT_BH checking for an unexpected buffer_jwrite state. If the warning triggers, we explicitly drop the just-taken get_bh() reference and call __flush_batch() to safely clean up any previously queued buffers in the j_chkpt_bhs array, preventing a memory leak before returning -EFSCORRUPTED. Signed-off-by: Milos Nikic <nikic.milos@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260311041548.159424-1-nikic.milos@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysovl: fix wrong detection of 32bit inode numbersAmir Goldstein1-1/+4
commit 53a7c171e9dd833f0a96b545adcb89bd57387239 upstream. The implicit FILEID_INO32_GEN encoder was changed to be explicit, so we need to fix the detection. When mounting overlayfs with upperdir and lowerdir on different ext4 filesystems, the expected kmsg log is: overlayfs: "xino" feature enabled using 32 upper inode bits. But instead, since the regressing commit, the kmsg log was: overlayfs: "xino" feature enabled using 2 upper inode bits. Fixes: e21fc2038c1b9 ("exportfs: make ->encode_fh() a mandatory method for NFS export") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysovl: make fsync after metadata copy-up opt-in mount optionFei Lv5-15/+54
commit 1f6ee9be92f8df85a8c9a5a78c20fd39c0c21a95 upstream. Commit 7d6899fb69d25 ("ovl: fsync after metadata copy-up") was done to fix durability of overlayfs copy up on an upper filesystem which does not enforce ordering on storing of metadata changes (e.g. ubifs). In an earlier revision of the regressing commit by Lei Lv, the metadata fsync behavior was opt-in via a new "fsync=strict" mount option. We were hoping that the opt-in mount option could be avoided, so the change was only made to depend on metacopy=off, in the hope of not hurting performance of metadata heavy workloads, which are more likely to be using metacopy=on. This hope was proven wrong by a performance regression report from Google COS workload after upgrade to kernel 6.12. This is an adaptation of Lei's original "fsync=strict" mount option to the existing upstream code. The new mount option is mutually exclusive with the "volatile" mount option, so the latter is now an alias to the "fsync=volatile" mount option. Reported-by: Chenglong Tang <chenglongtang@google.com> Closes: https://lore.kernel.org/linux-unionfs/CAOdxtTadAFH01Vui1FvWfcmQ8jH1O45owTzUcpYbNvBxnLeM7Q@mail.gmail.com/ Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/ Fixes: 7d6899fb69d25 ("ovl: fsync after metadata copy-up") Depends: 50e638beb67e0 ("ovl: Use str_on_off() helper in ovl_show_options()") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Fei Lv <feilv@asrmicro.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 dayswriteback: don't block sync for filesystems with no data integrity guaranteesJoanne Koong3-9/+14
commit 76f9377cd2ab7a9220c25d33940d9ca20d368172 upstream. Add a SB_I_NO_DATA_INTEGRITY superblock flag for filesystems that cannot guarantee data persistence on sync (eg fuse). For superblocks with this flag set, sync kicks off writeback of dirty inodes but does not wait for the flusher threads to complete the writeback. This replaces the per-inode AS_NO_DATA_INTEGRITY mapping flag added in commit f9a49aa302a0 ("fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes()"). The flag belongs at the superblock level because data integrity is a filesystem-wide property, not a per-inode one. Having this flag at the superblock level also allows us to skip having to iterate every dirty inode in wait_sb_inodes() only to skip each inode individually. Prior to this commit, mappings with no data integrity guarantees skipped waiting on writeback completion but still waited on the flusher threads to finish initiating the writeback. Waiting on the flusher threads is unnecessary. This commit kicks off writeback but does not wait on the flusher threads. This change properly addresses a recent report [1] for a suspend-to-RAM hang seen on fuse-overlayfs that was caused by waiting on the flusher threads to finish: Workqueue: pm_fs_sync pm_fs_sync_work_fn Call Trace: <TASK> __schedule+0x457/0x1720 schedule+0x27/0xd0 wb_wait_for_completion+0x97/0xe0 sync_inodes_sb+0xf8/0x2e0 __iterate_supers+0xdc/0x160 ksys_sync+0x43/0xb0 pm_fs_sync_work_fn+0x17/0xa0 process_one_work+0x193/0x350 worker_thread+0x1a1/0x310 kthread+0xfc/0x240 ret_from_fork+0x243/0x280 ret_from_fork_asm+0x1a/0x30 </TASK> On fuse this is problematic because there are paths that may cause the flusher thread to block (eg if systemd freezes the user session cgroups first, which freezes the fuse daemon, before invoking the kernel suspend. The kernel suspend triggers ->write_node() which on fuse issues a synchronous setattr request, which cannot be processed since the daemon is frozen. Or if the daemon is buggy and cannot properly complete writeback, initiating writeback on a dirty folio already under writeback leads to writeback_get_folio() -> folio_prepare_writeback() -> unconditional wait on writeback to finish, which will cause a hang). This commit restores fuse to its prior behavior before tmp folios were removed, where sync was essentially a no-op. [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1a-asuvfrbKXbEwwDSctvemF+6zfhdnuzO65Pt8HsFSRw@mail.gmail.com/T/#m632c4648e9cafc4239299887109ebd880ac6c5c1 Fixes: 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree") Reported-by: John <therealgraysky@proton.me> Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260320005145.2483161-2-joannelkoong@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 dayserofs: add GFP_NOIO in the bio completion if neededJiucheng Xu1-0/+3
commit c23df30915f83e7257c8625b690a1cece94142a0 upstream. The bio completion path in the process context (e.g. dm-verity) will directly call into decompression rather than trigger another workqueue context for minimal scheduling latencies, which can then call vm_map_ram() with GFP_KERNEL. Due to insufficient memory, vm_map_ram() may generate memory swapping I/O, which can cause submit_bio_wait to deadlock in some scenarios. Trimmed down the call stack, as follows: f2fs_submit_read_io submit_bio //bio_list is initialized. mmc_blk_mq_recovery z_erofs_endio vm_map_ram __pte_alloc_kernel __alloc_pages_direct_reclaim shrink_folio_list __swap_writepage submit_bio_wait //bio_list is non-NULL, hang!!! Use memalloc_noio_{save,restore}() to wrap up this path. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Jiucheng Xu <jiucheng.xu@amlogic.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysksmbd: do not expire session on binding failureHyunwoo Kim1-2/+8
commit 9bbb19d21ded7d78645506f20d8c44895e3d0fb9 upstream. When a multichannel session binding request fails (e.g. wrong password), the error path unconditionally sets sess->state = SMB2_SESSION_EXPIRED. However, during binding, sess points to the target session looked up via ksmbd_session_lookup_slowpath() -- which belongs to another connection's user. This allows a remote attacker to invalidate any active session by simply sending a binding request with a wrong password (DoS). Fix this by skipping session expiration when the failed request was a binding attempt, since the session does not belong to the current connection. The reference taken by ksmbd_session_lookup_slowpath() is still correctly released via ksmbd_user_session_put(). Cc: stable@vger.kernel.org Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysksmbd: fix memory leaks and NULL deref in smb2_lock()Werner Kasselman1-9/+18
commit 309b44ed684496ed3f9c5715d10b899338623512 upstream. smb2_lock() has three error handling issues after list_del() detaches smb_lock from lock_list at no_check_cl: 1) If vfs_lock_file() returns an unexpected error in the non-UNLOCK path, goto out leaks smb_lock and its flock because the out: handler only iterates lock_list and rollback_list, neither of which contains the detached smb_lock. 2) If vfs_lock_file() returns -ENOENT in the UNLOCK path, goto out leaks smb_lock and flock for the same reason. The error code returned to the dispatcher is also stale. 3) In the rollback path, smb_flock_init() can return NULL on allocation failure. The result is dereferenced unconditionally, causing a kernel NULL pointer dereference. Add a NULL check to prevent the crash and clean up the bookkeeping; the VFS lock itself cannot be rolled back without the allocation and will be released at file or connection teardown. Fix cases 1 and 2 by hoisting the locks_free_lock()/kfree() to before the if(!rc) check in the UNLOCK branch so all exit paths share one free site, and by freeing smb_lock and flock before goto out in the non-UNLOCK branch. Propagate the correct error code in both cases. Fix case 3 by wrapping the VFS unlock in an if(rlock) guard and adding a NULL check for locks_free_lock(rlock) in the shared cleanup. Found via call-graph analysis using sqry. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Suggested-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Werner Kasselman <werner@verivus.com> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysksmbd: fix potencial OOB in get_file_all_info() for compound requestsNamjae Jeon1-2/+14
commit beef2634f81f1c086208191f7228bce1d366493d upstream. When a compound request consists of QUERY_DIRECTORY + QUERY_INFO (FILE_ALL_INFORMATION) and the first command consumes nearly the entire max_trans_size, get_file_all_info() would blindly call smbConvertToUTF16() with PATH_MAX, causing out-of-bounds write beyond the response buffer. In get_file_all_info(), there was a missing validation check for the client-provided OutputBufferLength before copying the filename into FileName field of the smb2_file_all_info structure. If the filename length exceeds the available buffer space, it could lead to potential buffer overflows or memory corruption during smbConvertToUTF16 conversion. This calculating the actual free buffer size using smb2_calc_max_out_buf_len() and returning -EINVAL if the buffer is insufficient and updating smbConvertToUTF16 to use the actual filename length (clamped by PATH_MAX) to ensure a safe copy operation. Cc: stable@vger.kernel.org Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Reported-by: Asim Viladi Oglu Manizada <manizada@pm.me> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysksmbd: replace hardcoded hdr2_len with offsetof() in smb2_calc_max_out_buf_len()Namjae Jeon1-8/+12
commit 0e55f63dd08f09651d39e1b709a91705a8a0ddcb upstream. After this commit (e2b76ab8b5c9 "ksmbd: add support for read compound"), response buffer management was changed to use dynamic iov array. In the new design, smb2_calc_max_out_buf_len() expects the second argument (hdr2_len) to be the offset of ->Buffer field in the response structure, not a hardcoded magic number. Fix the remaining call sites to use the correct offsetof() value. Cc: stable@vger.kernel.org Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 dayserofs: set fileio bio failed in short read caseSheng Yong1-4/+2
[ Upstream commit eade54040384f54b7fb330e4b0975c5734850b3c ] For file-backed mount, IO requests are handled by vfs_iocb_iter_read(). However, it can be interrupted by SIGKILL, returning the number of bytes actually copied. Unused folios in bio are unexpectedly marked as uptodate. vfs_read filemap_read filemap_get_pages filemap_readahead erofs_fileio_readahead erofs_fileio_rq_submit vfs_iocb_iter_read filemap_read filemap_get_pages <= detect signal erofs_fileio_ki_complete <= set all folios uptodate This patch addresses this by setting short read bio with an error directly. Fixes: bc804a8d7e86 ("erofs: handle end of filesystem properly for file-backed mounts") Reported-by: chenguanyou <chenguanyou@xiaomi.com> Signed-off-by: Yunlei He <heyunlei@xiaomi.com> Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysbtrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol createBoris Burkov1-0/+7
[ Upstream commit 5131fa077f9bb386a1b901bf5b247041f0ec8f80 ] We have recently observed a number of subvolumes with broken dentries. ls-ing the parent dir looks like: drwxrwxrwt 1 root root 16 Jan 23 16:49 . drwxr-xr-x 1 root root 24 Jan 23 16:48 .. d????????? ? ? ? ? ? broken_subvol and similarly stat-ing the file fails. In this state, deleting the subvol fails with ENOENT, but attempting to create a new file or subvol over it errors out with EEXIST and even aborts the fs. Which leaves us a bit stuck. dmesg contains a single notable error message reading: "could not do orphan cleanup -2" 2 is ENOENT and the error comes from the failure handling path of btrfs_orphan_cleanup(), with the stack leading back up to btrfs_lookup(). btrfs_lookup btrfs_lookup_dentry btrfs_orphan_cleanup // prints that message and returns -ENOENT After some detailed inspection of the internal state, it became clear that: - there are no orphan items for the subvol - the subvol is otherwise healthy looking, it is not half-deleted or anything, there is no drop progress, etc. - the subvol was created a while ago and does the meaningful first btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much later. - after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT, which results in a negative dentry for the subvolume via d_splice_alias(NULL, dentry), leading to the observed behavior. The bug can be mitigated by dropping the dentry cache, at which point we can successfully delete the subvolume if we want. i.e., btrfs_lookup() btrfs_lookup_dentry() if (!sb_rdonly(inode->vfs_inode)->vfs_inode) btrfs_orphan_cleanup(sub_root) test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP) btrfs_search_slot() // finds orphan item for inode N ... prints "could not do orphan cleanup -2" if (inode == ERR_PTR(-ENOENT)) inode = NULL; return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP) on the root when it runs, so it cannot run more than once on a given root, so something else must run concurrently. However, the obvious routes to deleting an orphan when nlinks goes to 0 should not be able to run without first doing a lookup into the subvolume, which should run btrfs_orphan_cleanup() and set the bit. The final important observation is that create_subvol() calls d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if the dentry cache gets dropped, the next lookup into the subvolume will make a real call into btrfs_orphan_cleanup() for the first time. This opens up the possibility of concurrently deleting the inode/orphan items but most typical evict() paths will be holding a reference on the parent dentry (child dentry holds parent->d_lockref.count via dget in d_alloc(), released in __dentry_kill()) and prevent the parent from being removed from the dentry cache. The one exception is delayed iputs. Ordered extent creation calls igrab() on the inode. If the file is unlinked and closed while those refs are held, iput() in __dentry_kill() decrements i_count but does not trigger eviction (i_count > 0). The child dentry is freed and the subvol dentry's d_lockref.count drops to 0, making it evictable while the inode is still alive. Since there are two races (the race between writeback and unlink and the race between lookup and delayed iputs), and there are too many moving parts, the following three diagrams show the complete picture. (Only the second and third are races) Phase 1: Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set btrfs_mksubvol() lookup_one_len() __lookup_slow() d_alloc_parallel() __d_alloc() // d_lockref.count = 1 create_subvol(dentry) // doesn't touch the bit.. d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1 Phase 2: Create a delayed iput for a file in the subvol but leave the subvol in state where its dentry can be evicted (d_lockref.count == 0) T1 (task) T2 (writeback) T3 (OE workqueue) write() // dirty pages btrfs_writepages() btrfs_run_delalloc_range() cow_file_range() btrfs_alloc_ordered_extent() igrab() // i_count: 1 -> 2 btrfs_unlink_inode() btrfs_orphan_add() close() __fput() dput() finish_dput() __dentry_kill() dentry_unlink_inode() iput() // 2 -> 1 --parent->d_lockref.count // 1 -> 0; evictable finish_ordered_fn() btrfs_finish_ordered_io() btrfs_put_ordered_extent() btrfs_add_delayed_iput() Phase 3: Once the delayed iput is pending and the subvol dentry is evictable, the shrinker can free it, causing the next lookup to go through btrfs_lookup() and call btrfs_orphan_cleanup() for the first time. If the cleaner kthread processes the delayed iput concurrently, the two race: T1 (shrinker) T2 (cleaner kthread) T3 (lookup) super_cache_scan() prune_dcache_sb() __dentry_kill() // subvol dentry freed btrfs_run_delayed_iputs() iput() // i_count -> 0 evict() // sets I_FREEING btrfs_evict_inode() // truncation loop btrfs_lookup() btrfs_lookup_dentry() btrfs_orphan_cleanup() // first call (bit never set) btrfs_iget() // blocks on I_FREEING btrfs_orphan_del() // inode freed // returns -ENOENT btrfs_del_orphan_item() // -ENOENT // "could not do orphan cleanup -2" d_splice_alias(NULL, dentry) // negative dentry for valid subvol The most straightforward fix is to ensure the invariant that a dentry for a subvolume can exist if and only if that subvolume has BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no orphans or ran btrfs_orphan_cleanup()). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysfs/tests: exec: Remove bad test vectorKees Cook1-3/+0
[ Upstream commit c4192754e836e0ffed95833509b6ada975b74418 ] Drop an unusable test in the bprm stack limits. Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/all/a3e9b1c2-40c1-45df-9fa2-14ee6a7b3fe2@roeck-us.net Fixes: 60371f43e56b ("exec: Add KUnit test for bprm_stack_limits()") Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysksmbd: fix use-after-free in durable v2 replay of active file handlesHyunwoo Kim1-1/+5
[ Upstream commit b425e4d0eb321a1116ddbf39636333181675d8f4 ] parse_durable_handle_context() unconditionally assigns dh_info->fp->conn to the current connection when handling a DURABLE_REQ_V2 context with SMB2_FLAGS_REPLAY_OPERATION. ksmbd_lookup_fd_cguid() does not filter by fp->conn, so it returns file handles that are already actively connected. The unconditional overwrite replaces fp->conn, and when the overwriting connection is subsequently freed, __ksmbd_close_fd() dereferences the stale fp->conn via spin_lock(&fp->conn->llist_lock), causing a use-after-free. KASAN report: [ 7.349357] ================================================================== [ 7.349607] BUG: KASAN: slab-use-after-free in _raw_spin_lock+0x75/0xe0 [ 7.349811] Write of size 4 at addr ffff8881056ac18c by task kworker/1:2/108 [ 7.350010] [ 7.350064] CPU: 1 UID: 0 PID: 108 Comm: kworker/1:2 Not tainted 7.0.0-rc3+ #58 PREEMPTLAZY [ 7.350068] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 7.350070] Workqueue: ksmbd-io handle_ksmbd_work [ 7.350083] Call Trace: [ 7.350087] <TASK> [ 7.350087] dump_stack_lvl+0x64/0x80 [ 7.350094] print_report+0xce/0x660 [ 7.350100] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 7.350101] ? __pfx___mod_timer+0x10/0x10 [ 7.350106] ? _raw_spin_lock+0x75/0xe0 [ 7.350108] kasan_report+0xce/0x100 [ 7.350109] ? _raw_spin_lock+0x75/0xe0 [ 7.350114] kasan_check_range+0x105/0x1b0 [ 7.350116] _raw_spin_lock+0x75/0xe0 [ 7.350118] ? __pfx__raw_spin_lock+0x10/0x10 [ 7.350119] ? __call_rcu_common.constprop.0+0x25e/0x780 [ 7.350125] ? close_id_del_oplock+0x2cc/0x4e0 [ 7.350128] __ksmbd_close_fd+0x27f/0xaf0 [ 7.350131] ksmbd_close_fd+0x135/0x1b0 [ 7.350133] smb2_close+0xb19/0x15b0 [ 7.350142] ? __pfx_smb2_close+0x10/0x10 [ 7.350143] ? xas_load+0x18/0x270 [ 7.350146] ? _raw_spin_lock+0x84/0xe0 [ 7.350148] ? __pfx__raw_spin_lock+0x10/0x10 [ 7.350150] ? _raw_spin_unlock+0xe/0x30 [ 7.350151] ? ksmbd_smb2_check_message+0xeb2/0x24c0 [ 7.350153] ? ksmbd_tree_conn_lookup+0xcd/0xf0 [ 7.350154] handle_ksmbd_work+0x40f/0x1080 [ 7.350156] process_one_work+0x5fa/0xef0 [ 7.350162] ? assign_work+0x122/0x3e0 [ 7.350163] worker_thread+0x54b/0xf70 [ 7.350165] ? __pfx_worker_thread+0x10/0x10 [ 7.350166] kthread+0x346/0x470 [ 7.350170] ? recalc_sigpending+0x19b/0x230 [ 7.350176] ? __pfx_kthread+0x10/0x10 [ 7.350178] ret_from_fork+0x4fb/0x6c0 [ 7.350183] ? __pfx_ret_from_fork+0x10/0x10 [ 7.350185] ? __switch_to+0x36c/0xbe0 [ 7.350188] ? __pfx_kthread+0x10/0x10 [ 7.350190] ret_from_fork_asm+0x1a/0x30 [ 7.350197] </TASK> [ 7.350197] [ 7.355160] Allocated by task 123: [ 7.355261] kasan_save_stack+0x33/0x60 [ 7.355373] kasan_save_track+0x14/0x30 [ 7.355484] __kasan_kmalloc+0x8f/0xa0 [ 7.355593] ksmbd_conn_alloc+0x44/0x6d0 [ 7.355711] ksmbd_kthread_fn+0x243/0xd70 [ 7.355839] kthread+0x346/0x470 [ 7.355942] ret_from_fork+0x4fb/0x6c0 [ 7.356051] ret_from_fork_asm+0x1a/0x30 [ 7.356164] [ 7.356214] Freed by task 134: [ 7.356305] kasan_save_stack+0x33/0x60 [ 7.356416] kasan_save_track+0x14/0x30 [ 7.356527] kasan_save_free_info+0x3b/0x60 [ 7.356646] __kasan_slab_free+0x43/0x70 [ 7.356761] kfree+0x1ca/0x430 [ 7.356862] ksmbd_tcp_disconnect+0x59/0xe0 [ 7.356993] ksmbd_conn_handler_loop+0x77e/0xd40 [ 7.357138] kthread+0x346/0x470 [ 7.357240] ret_from_fork+0x4fb/0x6c0 [ 7.357350] ret_from_fork_asm+0x1a/0x30 [ 7.357463] [ 7.357513] The buggy address belongs to the object at ffff8881056ac000 [ 7.357513] which belongs to the cache kmalloc-1k of size 1024 [ 7.357857] The buggy address is located 396 bytes inside of [ 7.357857] freed 1024-byte region [ffff8881056ac000, ffff8881056ac400) Fix by removing the unconditional fp->conn assignment and rejecting the replay when fp->conn is non-NULL. This is consistent with ksmbd_lookup_durable_fd(), which also rejects file handles with a non-NULL fp->conn. For disconnected file handles (fp->conn == NULL), ksmbd_reopen_durable_fd() handles setting fp->conn. Fixes: c8efcc786146 ("ksmbd: add support for durable handles v1/v2") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysksmbd: fix use-after-free of share_conf in compound requestHyunwoo Kim1-0/+2
[ Upstream commit c33615f995aee80657b9fdfbc4ee7f49c2bd733d ] smb2_get_ksmbd_tcon() reuses work->tcon in compound requests without validating tcon->t_state. ksmbd_tree_conn_lookup() checks t_state == TREE_CONNECTED on the initial lookup path, but the compound reuse path bypasses this check entirely. If a prior command in the compound (SMB2_TREE_DISCONNECT) sets t_state to TREE_DISCONNECTED and frees share_conf via ksmbd_share_config_put(), subsequent commands dereference the freed share_conf through work->tcon->share_conf. KASAN report: [ 4.144653] ================================================================== [ 4.145059] BUG: KASAN: slab-use-after-free in smb2_write+0xc74/0xe70 [ 4.145415] Read of size 4 at addr ffff88810430c194 by task kworker/1:1/44 [ 4.145772] [ 4.145867] CPU: 1 UID: 0 PID: 44 Comm: kworker/1:1 Not tainted 7.0.0-rc3+ #60 PREEMPTLAZY [ 4.145871] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 4.145875] Workqueue: ksmbd-io handle_ksmbd_work [ 4.145888] Call Trace: [ 4.145892] <TASK> [ 4.145894] dump_stack_lvl+0x64/0x80 [ 4.145910] print_report+0xce/0x660 [ 4.145919] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 4.145928] ? smb2_write+0xc74/0xe70 [ 4.145931] kasan_report+0xce/0x100 [ 4.145934] ? smb2_write+0xc74/0xe70 [ 4.145937] smb2_write+0xc74/0xe70 [ 4.145939] ? __pfx_smb2_write+0x10/0x10 [ 4.145942] ? _raw_spin_unlock+0xe/0x30 [ 4.145945] ? ksmbd_smb2_check_message+0xeb2/0x24c0 [ 4.145948] ? smb2_tree_disconnect+0x31c/0x480 [ 4.145951] handle_ksmbd_work+0x40f/0x1080 [ 4.145953] process_one_work+0x5fa/0xef0 [ 4.145962] ? assign_work+0x122/0x3e0 [ 4.145964] worker_thread+0x54b/0xf70 [ 4.145967] ? __pfx_worker_thread+0x10/0x10 [ 4.145970] kthread+0x346/0x470 [ 4.145976] ? recalc_sigpending+0x19b/0x230 [ 4.145980] ? __pfx_kthread+0x10/0x10 [ 4.145984] ret_from_fork+0x4fb/0x6c0 [ 4.145992] ? __pfx_ret_from_fork+0x10/0x10 [ 4.145995] ? __switch_to+0x36c/0xbe0 [ 4.145999] ? __pfx_kthread+0x10/0x10 [ 4.146003] ret_from_fork_asm+0x1a/0x30 [ 4.146013] </TASK> [ 4.146014] [ 4.149858] Allocated by task 44: [ 4.149953] kasan_save_stack+0x33/0x60 [ 4.150061] kasan_save_track+0x14/0x30 [ 4.150169] __kasan_kmalloc+0x8f/0xa0 [ 4.150274] ksmbd_share_config_get+0x1dd/0xdd0 [ 4.150401] ksmbd_tree_conn_connect+0x7e/0x600 [ 4.150529] smb2_tree_connect+0x2e6/0x1000 [ 4.150645] handle_ksmbd_work+0x40f/0x1080 [ 4.150761] process_one_work+0x5fa/0xef0 [ 4.150873] worker_thread+0x54b/0xf70 [ 4.150978] kthread+0x346/0x470 [ 4.151071] ret_from_fork+0x4fb/0x6c0 [ 4.151176] ret_from_fork_asm+0x1a/0x30 [ 4.151286] [ 4.151332] Freed by task 44: [ 4.151418] kasan_save_stack+0x33/0x60 [ 4.151526] kasan_save_track+0x14/0x30 [ 4.151634] kasan_save_free_info+0x3b/0x60 [ 4.151751] __kasan_slab_free+0x43/0x70 [ 4.151861] kfree+0x1ca/0x430 [ 4.151952] __ksmbd_tree_conn_disconnect+0xc8/0x190 [ 4.152088] smb2_tree_disconnect+0x1cd/0x480 [ 4.152211] handle_ksmbd_work+0x40f/0x1080 [ 4.152326] process_one_work+0x5fa/0xef0 [ 4.152438] worker_thread+0x54b/0xf70 [ 4.152545] kthread+0x346/0x470 [ 4.152638] ret_from_fork+0x4fb/0x6c0 [ 4.152743] ret_from_fork_asm+0x1a/0x30 [ 4.152853] [ 4.152900] The buggy address belongs to the object at ffff88810430c180 [ 4.152900] which belongs to the cache kmalloc-96 of size 96 [ 4.153226] The buggy address is located 20 bytes inside of [ 4.153226] freed 96-byte region [ffff88810430c180, ffff88810430c1e0) [ 4.153549] [ 4.153596] The buggy address belongs to the physical page: [ 4.153750] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88810430ce80 pfn:0x10430c [ 4.154000] flags: 0x100000000000200(workingset|node=0|zone=2) [ 4.154160] page_type: f5(slab) [ 4.154251] raw: 0100000000000200 ffff888100041280 ffff888100040110 ffff888100040110 [ 4.154461] raw: ffff88810430ce80 0000000800200009 00000000f5000000 0000000000000000 [ 4.154668] page dumped because: kasan: bad access detected [ 4.154820] [ 4.154866] Memory state around the buggy address: [ 4.155002] ffff88810430c080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 4.155196] ffff88810430c100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 4.155391] >ffff88810430c180: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 4.155587] ^ [ 4.155693] ffff88810430c200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 4.155891] ffff88810430c280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 4.156087] ================================================================== Add the same t_state validation to the compound reuse path, consistent with ksmbd_tree_conn_lookup(). Fixes: 5005bcb42191 ("ksmbd: validate session id and tree id in the compound request") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>