summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2024-07-13Merge tag '6.10-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds1-2/+2
Pull smb client fix from Steve French: "Small fix, also for stable" * tag '6.10-rc7-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix setting SecurityFlags to true
2024-07-13cifs: fix setting SecurityFlags to trueSteve French1-2/+2
If you try to set /proc/fs/cifs/SecurityFlags to 1 it will set them to CIFSSEC_MUST_NTLMV2 which no longer is relevant (the less secure ones like lanman have been removed from cifs.ko) and is also missing some flags (like for signing and encryption) and can even cause mount to fail, so change this to set it to Kerberos in this case. Also change the description of the SecurityFlags to remove mention of flags which are no longer supported. Cc: stable@vger.kernel.org Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-07-12Merge tag 'for-6.10-rc7-tag' of ↵Linus Torvalds3-29/+97
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fix a regression in extent map shrinker behaviour. In the past weeks we got reports from users that there are huge latency spikes or freezes. This was bisected to newly added shrinker of extent maps (it was added to fix a build up of the structures in memory). I'm assuming that the freezes would happen to many users after release so I'd like to get it merged now so it's in 6.10. Although the diff size is not small the changes are relatively straightforward, the reporters verified the fixes and we did testing on our side. The fixes: - adjust behaviour under memory pressure and check lock or scheduling conditions, bail out if needed - synchronize tracking of the scanning progress so inode ranges are not skipped or work duplicated - do a delayed iput when scanning a root so evicting an inode does not slow things down in case of lots of dirty data, also fix lockdep warning, a deadlock could happen when writing the dirty data would need to start a transaction" * tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: avoid races when tracking progress for extent map shrinking btrfs: stop extent map shrinker if reschedule is needed btrfs: use delayed iput during extent map shrinking
2024-07-12Merge tag 'bcachefs-2024-07-12' of https://evilpiepirate.org/git/bcachefsLinus Torvalds11-24/+87
Pull more bcachefs fixes from Kent Overstreet: - revert the SLAB_ACCOUNT patch, something crazy is going on in memcg and someone forgot to test - minor fixes: missing rcu_read_lock(), scheduling while atomic (in an emergency shutdown path) - two lockdep fixes; these could have gone earlier, but were left to bake awhile * tag 'bcachefs-2024-07-12' of https://evilpiepirate.org/git/bcachefs: bcachefs: bch2_gc_btree() should not use btree_root_lock bcachefs: Set PF_MEMALLOC_NOFS when trans->locked bcachefs; Use trans_unlock_long() when waiting on allocator Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT" bcachefs: fix scheduling while atomic in break_cycle() bcachefs: Fix RCU splat
2024-07-12bcachefs: bch2_gc_btree() should not use btree_root_lockKent Overstreet1-8/+22
btree_root_lock is for the root keys in btree_root, not the pointers to the nodes themselves; this fixes a lock ordering issue between btree_root_lock and btree node locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-12bcachefs: Set PF_MEMALLOC_NOFS when trans->lockedKent Overstreet4-8/+30
proper lock ordering is: fs_reclaim -> btree node locks Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-12bcachefs; Use trans_unlock_long() when waiting on allocatorKent Overstreet1-1/+1
not using unlock_long() blocks key cache reclaim, and the allocator may take awhile Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-12Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT"Kent Overstreet1-2/+1
This reverts commit 86d81ec5f5f05846c7c6e48ffb964b24cba2e669. This wasn't tested with memcg enabled, it immediately hits a null ptr deref in list_lru_add(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11Merge tag 'vfs-6.10-rc8.fixes' of ↵Linus Torvalds25-133/+203
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "cachefiles: - Export an existing and add a new cachefile helper to be used in filesystems to fix reference count bugs - Use the newly added fscache_ty_get_volume() helper to get a reference count on an fscache_volume to handle volumes that are about to be removed cleanly - After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN wait for all ongoing cookie lookups to complete and for the object count to reach zero - Propagate errors from vfs_getxattr() to avoid an infinite loop in cachefiles_check_volume_xattr() because it keeps seeing ESTALE - Don't send new requests when an object is dropped by raising CACHEFILES_ONDEMAND_OJBSTATE_DROPPING - Cancel all requests for an object that is about to be dropped - Wait for the ondemand_boject_worker to finish before dropping a cachefiles object to prevent use-after-free - Use cyclic allocation for message ids to better handle id recycling - Add missing lock protection when iterating through the xarray when polling netfs: - Use standard logging helpers for debug logging VFS: - Fix potential use-after-free in file locks during trace_posix_lock_inode(). The tracepoint could fire while another task raced it and freed the lock that was requested to be traced - Only increment the nr_dentry_negative counter for dentries that are present on the superblock LRU. Currently, DCACHE_LRU_LIST list is used to detect this case. However, the flag is also raised in combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru is used. So checking only DCACHE_LRU_LIST will lead to wrong nr_dentry_negative count. Fix the check to not count dentries that are on a shrink related list Misc: - hfsplus: fix an uninitialized value issue in copy_name - minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even though we switched it to kmap_local_page() a while ago" * tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: minixfs: Fix minixfs_rename with HIGHMEM hfsplus: fix uninit-value in copy_name vfs: don't mod negative dentry count when on shrinker list filelock: fix potential use-after-free in posix_lock_inode cachefiles: add missing lock protection when polling cachefiles: cyclic allocation of msg_id to avoid reuse cachefiles: wait for ondemand_object_worker to finish when dropping object cachefiles: cancel all requests for the object that is being dropped cachefiles: stop sending new request when dropping object cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie() cachefiles: fix slab-use-after-free in fscache_withdraw_volume() netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume() netfs: Switch debug logging to pr_debug()
2024-07-11btrfs: avoid races when tracking progress for extent map shrinkingFilipe Manana3-21/+66
We store the progress (root and inode numbers) of the extent map shrinker in fs_info without any synchronization but we can have multiple tasks calling into the shrinker during memory allocations when there's enough memory pressure for example. This can result in a task A reading fs_info->extent_map_shrinker_last_ino after another task B updates it, and task A reading fs_info->extent_map_shrinker_last_root before task B updates it, making task A see an odd state that isn't necessarily harmful but may make it skip certain inode ranges or do more work than necessary by going over the same inodes again. These unprotected accesses would also trigger warnings from tools like KCSAN. So add a lock to protect access to these progress fields. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: stop extent map shrinker if reschedule is neededFilipe Manana1-8/+31
The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use delayed iput during extent map shrinkingFilipe Manana1-1/+1
When putting an inode during extent map shrinking we're doing a standard iput() but that may take a long time in case the inode is dirty and we are doing the final iput that triggers eviction - the VFS will have to wait for writeback before calling the btrfs evict callback (see fs/inode.c:evict()). This slows down the task running the shrinker which may have been triggered while updating some tree for example, meaning locks are held as well as an open transaction handle. Also if the iput() ends up triggering eviction and the inode has no links anymore, then we trigger item truncation which requires flushing delayed items, space reservation to start a transaction and that may trigger the space reclaim task and wait for it, resulting in deadlocks in case the reclaim task needs for example to commit a transaction and the shrinker is being triggered from a path holding a transaction handle. Syzbot reported such a case with the following stack traces: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted ------------------------------------------------------ kswapd0/111 is trying to acquire lock: ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 but task is already holding lock: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: __fs_reclaim_acquire mm/page_alloc.c:3783 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911 btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114 btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373 __btrfs_balance fs/btrfs/volumes.c:4157 [inline] btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534 btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline] btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742 __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #1 (btrfs_trans_num_writers){++++}-{0:0}: join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #0 (sb_internal#3){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Chain exists of: sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(btrfs_trans_num_extwriters); lock(fs_reclaim); rlock(sb_internal#3); *** DEADLOCK *** 2 locks held by kswapd0/111: #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline] #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196 stack backtrace: CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> So fix this by using btrfs_add_delayed_iput() so that the final iput is delegated to the cleaner kthread. Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/ Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11Merge tag 'mm-hotfixes-stable-2024-07-10-13-19' of ↵Linus Torvalds2-3/+36
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes, 15 of which are cc:stable. No identifiable theme here - all are singleton patches, 19 are for MM" * tag 'mm-hotfixes-stable-2024-07-10-13-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio() filemap: replace pte_offset_map() with pte_offset_map_nolock() arch/xtensa: always_inline get_current() and current_thread_info() sched.h: always_inline alloc_tag_{save|restore} to fix modpost warnings MAINTAINERS: mailmap: update Lorenzo Stoakes's email address mm: fix crashes from deferred split racing folio migration lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 compat mm: gup: stop abusing try_grab_folio nilfs2: fix kernel bug on rename operation of broken directory mm/hugetlb_vmemmap: fix race with speculative PFN walkers cachestat: do not flush stats in recency check mm/shmem: disable PMD-sized page cache if needed mm/filemap: skip to create PMD-sized page cache if needed mm/readahead: limit page cache size in page_cache_ra_order() mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray mm/damon/core: merge regions aggressively when max_nr_regions is unmet Fix userfaultfd_api to return EINVAL as expected mm: vmalloc: check if a hash-index is in cpu_possible_mask mm: prevent derefencing NULL ptr in pfn_section_valid() ...
2024-07-10Merge tag 'bcachefs-2024-07-10' of https://evilpiepirate.org/git/bcachefsLinus Torvalds23-142/+256
Pull bcachefs fixes from Kent Overstreet: - Switch some asserts to WARN() - Fix a few "transaction not locked" asserts in the data read retry paths and backpointers gc - Fix a race that would cause the journal to get stuck on a flush commit - Add missing fsck checks for the fragmentation LRU - The usual assorted ssorted syzbot fixes * tag 'bcachefs-2024-07-10' of https://evilpiepirate.org/git/bcachefs: (22 commits) bcachefs: Add missing bch2_trans_begin() bcachefs: Fix missing error check in journal_entry_btree_keys_validate() bcachefs: Warn on attempting a move with no replicas bcachefs: bch2_data_update_to_text() bcachefs: Log mount failure error code bcachefs: Fix undefined behaviour in eytzinger1_first() bcachefs: Mark bch_inode_info as SLAB_ACCOUNT bcachefs: Fix bch2_inode_insert() race path for tmpfiles closures: fix closure_sync + closure debugging bcachefs: Fix journal getting stuck on a flush commit bcachefs: io clock: run timer fns under clock lock bcachefs: Repair fragmentation_lru in alloc_write_key() bcachefs: add check for missing fragmentation in check_alloc_to_lru_ref() bcachefs: bch2_btree_write_buffer_maybe_flush() bcachefs: Add missing printbuf_tabstops_reset() calls bcachefs: Fix loop restart in bch2_btree_transactions_read() bcachefs: Fix bch2_read_retry_nodecode() bcachefs: Don't use the new_fs() bucket alloc path on an initialized fs bcachefs: Fix shift greater than integer size bcachefs: Change bch2_fs_journal_stop() BUG_ON() to warning ...
2024-07-10bcachefs: fix scheduling while atomic in break_cycle()Kent Overstreet3-4/+24
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Fix RCU splatKent Overstreet2-1/+9
Reported-by: syzbot+e74fea078710bbca6f4b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Add missing bch2_trans_begin()Kent Overstreet1-0/+2
this fixes a 'transaction should be locked' error in backpointers fsck Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Fix missing error check in journal_entry_btree_keys_validate()Kent Overstreet1-0/+2
Closes: https://syzkaller.appspot.com/bug?extid=8996d8f176cf946ef641 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Warn on attempting a move with no replicasKent Overstreet1-0/+10
Instead of popping an assert in bch2_write(), WARN and print out some debugging info. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: bch2_data_update_to_text()Kent Overstreet3-25/+39
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Log mount failure error codeKent Overstreet1-0/+2
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Fix undefined behaviour in eytzinger1_first()Kent Overstreet1-2/+4
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Mark bch_inode_info as SLAB_ACCOUNTYouling Tang1-1/+2
After commit 230e9fc28604 ("slab: add SLAB_ACCOUNT flag"), we need to mark the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9a0 ("kmemcg: account for certain kmem allocations to memcg") Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Fix bch2_inode_insert() race path for tmpfilesKent Overstreet1-0/+6
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10bcachefs: Fix journal getting stuck on a flush commitKent Overstreet1-4/+6
silly race Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10minixfs: Fix minixfs_rename with HIGHMEMMatthew Wilcox (Oracle)1-2/+1
minixfs now uses kmap_local_page(), so we can't call kunmap() to undo it. This one call was missed as part of the commit this fixes. Fixes: 6628f69ee66a (minixfs: Use dir_put_page() in minix_unlink() and minix_rename()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240709195841.1986374-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-09Merge tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds2-4/+52
Pull smb server fixes from Steve French: - fix access flags to address fuse incompatibility - fix device type returned by get filesystem info * tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd: ksmbd: discard write access to the directory open ksmbd: return FILE_DEVICE_DISK instead of super magic
2024-07-07Merge tag '6.10-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds1-30/+0
Pull smb client fix from Steve French: "Fix for smb3 readahead performance regression" * tag '6.10-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix read-performance regression by dropping readahead expansion
2024-07-06hfsplus: fix uninit-value in copy_nameEdward Adam Davis1-1/+1
[syzbot reported] BUG: KMSAN: uninit-value in sized_strscpy+0xc4/0x160 sized_strscpy+0xc4/0x160 copy_name+0x2af/0x320 fs/hfsplus/xattr.c:411 hfsplus_listxattr+0x11e9/0x1a50 fs/hfsplus/xattr.c:750 vfs_listxattr fs/xattr.c:493 [inline] listxattr+0x1f3/0x6b0 fs/xattr.c:840 path_listxattr fs/xattr.c:864 [inline] __do_sys_listxattr fs/xattr.c:876 [inline] __se_sys_listxattr fs/xattr.c:873 [inline] __x64_sys_listxattr+0x16b/0x2f0 fs/xattr.c:873 x64_sys_call+0x2ba0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:195 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was created at: slab_post_alloc_hook mm/slub.c:3877 [inline] slab_alloc_node mm/slub.c:3918 [inline] kmalloc_trace+0x57b/0xbe0 mm/slub.c:4065 kmalloc include/linux/slab.h:628 [inline] hfsplus_listxattr+0x4cc/0x1a50 fs/hfsplus/xattr.c:699 vfs_listxattr fs/xattr.c:493 [inline] listxattr+0x1f3/0x6b0 fs/xattr.c:840 path_listxattr fs/xattr.c:864 [inline] __do_sys_listxattr fs/xattr.c:876 [inline] __se_sys_listxattr fs/xattr.c:873 [inline] __x64_sys_listxattr+0x16b/0x2f0 fs/xattr.c:873 x64_sys_call+0x2ba0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:195 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [Fix] When allocating memory to strbuf, initialize memory to 0. Reported-and-tested-by: syzbot+efde959319469ff8d4d7@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Link: https://lore.kernel.org/r/tencent_8BBB6433BC9E1C1B7B4BDF1BF52574BA8808@qq.com Reported-and-tested-by: syzbot+01ade747b16e9c8030e0@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-05vfs: don't mod negative dentry count when on shrinker listBrian Foster1-3/+9
The nr_dentry_negative counter is intended to only account negative dentries that are present on the superblock LRU. Therefore, the LRU add, remove and isolate helpers modify the counter based on whether the dentry is negative, but the shrinker list related helpers do not modify the counter, and the paths that change a dentry between positive and negative only do so if DCACHE_LRU_LIST is set. The problem with this is that a dentry on a shrinker list still has DCACHE_LRU_LIST set to indicate ->d_lru is in use. The additional DCACHE_SHRINK_LIST flag denotes whether the dentry is on LRU or a shrink related list. Therefore if a relevant operation (i.e. unlink) occurs while a dentry is present on a shrinker list, and the associated codepath only checks for DCACHE_LRU_LIST, then it is technically possible to modify the negative dentry count for a dentry that is off the LRU. Since the shrinker list related helpers do not modify the negative dentry count (because non-LRU dentries should not be included in the count) when the dentry is ultimately removed from the shrinker list, this can cause the negative dentry count to become permanently inaccurate. This problem can be reproduced via a heavy file create/unlink vs. drop_caches workload. On an 80xcpu system, I start 80 tasks each running a 1k file create/delete loop, and one task spinning on drop_caches. After 10 minutes or so of runtime, the idle/clean cache negative dentry count increases from somewhere in the range of 5-10 entries to several hundred (and increasingly grows beyond nr_dentry_unused). Tweak the logic in the paths that turn a dentry negative or positive to filter out the case where the dentry is present on a shrink related list. This allows the above workload to maintain an accurate negative dentry count. Fixes: af0c9af1b3f6 ("fs/dcache: Track & report number of negative dentries") Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20240703121301.247680-1-bfoster@redhat.com Acked-by: Ian Kent <ikent@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-05filelock: fix potential use-after-free in posix_lock_inodeJeff Layton1-1/+1
Light Hsieh reported a KASAN UAF warning in trace_posix_lock_inode(). The request pointer had been changed earlier to point to a lock entry that was added to the inode's list. However, before the tracepoint could fire, another task raced in and freed that lock. Fix this by moving the tracepoint inside the spinlock, which should ensure that this doesn't happen. Fixes: 74f6f5912693 ("locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock") Link: https://lore.kernel.org/linux-fsdevel/724ffb0a2962e912ea62bb0515deadf39c325112.camel@kernel.org/ Reported-by: Light Hsieh (謝明燈) <Light.Hsieh@mediatek.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240702-filelock-6-10-v1-1-96e766aadc98@kernel.org Reviewed-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-05Merge patch series "cachefiles: random bugfixes"Christian Brauner21-126/+191
libaokun@huaweicloud.com <libaokun@huaweicloud.com> says: This is the third version of this patch series, in which another patch set is subsumed into this one to avoid confusing the two patch sets. (https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914) We've been testing ondemand mode for cachefiles since January, and we're almost done. We hit a lot of issues during the testing period, and this patch series fixes some of the issues. The patches have passed internal testing without regression. The following is a brief overview of the patches, see the patches for more details. Patch 1-2: Add fscache_try_get_volume() helper function to avoid fscache_volume use-after-free on cache withdrawal. Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache() concurrency causing cachefiles_volume use-after-free. Patch 4: Propagate error codes returned by vfs_getxattr() to avoid endless loops. Patch 5-7: A read request waiting for reopen could be closed maliciously before the reopen worker is executing or waiting to be scheduled. So ondemand_object_worker() may be called after the info and object and even the cache have been freed and trigger use-after-free. So use cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the reopen worker or wait for it to finish. Since it makes no sense to wait for the daemon to complete the reopen request, to avoid this pointless operation blocking cancel_work_sync(), Patch 1 avoids request generation by the DROPPING state when the request has not been sent, and Patch 2 flushes the requests of the current object before cancel_work_sync(). Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading the daemon to cause hung. Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing use-after-free. This issue was triggered frequently in our tests, and we found that anolis 5.10 had fixed it. So to avoid failing the test, this patch is pushed upstream as well. Baokun Li (7): netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume() cachefiles: fix slab-use-after-free in fscache_withdraw_volume() cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie() cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop cachefiles: stop sending new request when dropping object cachefiles: cancel all requests for the object that is being dropped cachefiles: cyclic allocation of msg_id to avoid reuse Hou Tao (1): cachefiles: wait for ondemand_object_worker to finish when dropping object Jingbo Xu (1): cachefiles: add missing lock protection when polling fs/cachefiles/cache.c | 45 ++++++++++++++++++++++++++++- fs/cachefiles/daemon.c | 4 +-- fs/cachefiles/internal.h | 3 ++ fs/cachefiles/ondemand.c | 52 ++++++++++++++++++++++++++++++---- fs/cachefiles/volume.c | 1 - fs/cachefiles/xattr.c | 5 +++- fs/netfs/fscache_volume.c | 14 +++++++++ fs/netfs/internal.h | 2 -- include/linux/fscache-cache.h | 6 ++++ include/trace/events/fscache.h | 4 +++ 10 files changed, 123 insertions(+), 13 deletions(-) Link: https://lore.kernel.org/r/20240628062930.2467993-1-libaokun@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-05ksmbd: discard write access to the directory openHobin Woo1-2/+11
may_open() does not allow a directory to be opened with the write access. However, some writing flags set by client result in adding write access on server, making ksmbd incompatible with FUSE file system. Simply, let's discard the write access when opening a directory. list_add corruption. next is NULL. ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:26! pc : __list_add_valid+0x88/0xbc lr : __list_add_valid+0x88/0xbc Call trace: __list_add_valid+0x88/0xbc fuse_finish_open+0x11c/0x170 fuse_open_common+0x284/0x5e8 fuse_dir_open+0x14/0x24 do_dentry_open+0x2a4/0x4e0 dentry_open+0x50/0x80 smb2_open+0xbe4/0x15a4 handle_ksmbd_work+0x478/0x5ec process_one_work+0x1b4/0x448 worker_thread+0x25c/0x430 kthread+0x104/0x1d4 ret_from_fork+0x10/0x20 Cc: stable@vger.kernel.org Signed-off-by: Yoonho Shin <yoonho.shin@samsung.com> Signed-off-by: Hobin Woo <hobin.woo@samsung.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-07-04Merge tag 'for-6.10-rc6-tag' of ↵Linus Torvalds5-9/+38
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix folio refcounting when releasing them (encoded write, dummy extent buffer) - fix out of bounds read when checking qgroup inherit data - fix how configurable chunk size is handled in zoned mode - in the ref-verify tool, fix uninitialized return value when checking extent owner ref and simple quota are not enabled * tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix folio refcount in __alloc_dummy_extent_buffer() btrfs: fix folio refcount in btrfs_do_encoded_write() btrfs: fix uninitialized return value in the ref-verify tool btrfs: always do the basic checks for btrfs_qgroup_inherit structure btrfs: zoned: fix calc_available_free_space() for zoned mode
2024-07-04Merge tag 'mm-hotfixes-stable-2024-07-03-22-23' of ↵Linus Torvalds8-15/+41
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from, Andrew Morton: "6 hotfies, all cc:stable. Some fixes for longstanding nilfs2 issues and three unrelated MM fixes" * tag 'mm-hotfixes-stable-2024-07-03-22-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nilfs2: fix incorrect inode allocation from reserved inodes nilfs2: add missing check for inode numbers on directory entries nilfs2: fix inode number range checks mm: avoid overflows in dirty throttling logic Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again" mm: optimize the redundant loop of mm_update_owner_next()
2024-07-04nilfs2: fix kernel bug on rename operation of broken directoryRyusuke Konishi1-2/+30
Syzbot reported that in rename directory operation on broken directory on nilfs2, __block_write_begin_int() called to prepare block write may fail BUG_ON check for access exceeding the folio/page size. This is because nilfs_dotdot(), which gets parent directory reference entry ("..") of the directory to be moved or renamed, does not check consistency enough, and may return location exceeding folio/page size for broken directories. Fix this issue by checking required directory entries ("." and "..") in the first chunk of the directory in nilfs_dotdot(). Link: https://lkml.kernel.org/r/20240628165107.9006-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+d3abed1ad3d367fa2627@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d3abed1ad3d367fa2627 Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04Fix userfaultfd_api to return EINVAL as expectedAudra Mitchell1-1/+6
Currently if we request a feature that is not set in the Kernel config we fail silently and return all the available features. However, the man page indicates we should return an EINVAL. We need to fix this issue since we can end up with a Kernel warning should a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with the config not set with this feature. [ 200.812896] WARNING: CPU: 91 PID: 13634 at mm/memory.c:1660 zap_pte_range+0x43d/0x660 [ 200.820738] Modules linked in: [ 200.869387] CPU: 91 PID: 13634 Comm: userfaultfd Kdump: loaded Not tainted 6.9.0-rc5+ #8 [ 200.877477] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.7.3 03/30/2022 [ 200.885052] RIP: 0010:zap_pte_range+0x43d/0x660 Link: https://lkml.kernel.org/r/20240626130513.120193-1-audra@redhat.com Fixes: e06f1e1dd499 ("userfaultfd: wp: enabled write protection in userfaultfd API") Signed-off-by: Audra Mitchell <audra@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04ksmbd: return FILE_DEVICE_DISK instead of super magicNamjae Jeon2-2/+41
MS-SMB2 specification describes setting ->DeviceType to FILE_DEVICE_DISK or FILE_DEVICE_CD_ROM. Set FILE_DEVICE_DISK instead of super magic in FS_DEVICE_INFORMATION. And Set FILE_READ_ONLY_DEVICE for read-only share. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-07-04btrfs: fix folio refcount in __alloc_dummy_extent_buffer()Boris Burkov1-1/+1
Another improper use of __folio_put() in an error path after freshly allocating pages/folios which returns them with the refcount initialized to 1. The refactor from __free_pages() -> __folio_put() (instead of folio_put) removed a refcount decrement found in __free_pages() and folio_put but absent from __folio_put(). Fixes: 13df3775efca ("btrfs: cleanup metadata page pointer usage") CC: stable@vger.kernel.org # 6.8+ Tested-by: Ed Tomlinson <edtoml@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-04btrfs: fix folio refcount in btrfs_do_encoded_write()Boris Burkov1-1/+1
The conversion to folios switched __free_page() to __folio_put() in the error path in btrfs_do_encoded_write(). However, this gets the page refcounting wrong. If we do hit that error path (I reproduced by modifying btrfs_do_encoded_write to pretend to always fail in a way that jumps to out_folios and running the fstests case btrfs/281), then we always hit the following BUG freeing the folio: BUG: Bad page state in process btrfs pfn:40ab0b page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff) raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: nonzero _refcount Call Trace: <TASK> dump_stack_lvl+0x3d/0xe0 bad_page+0xea/0xf0 free_unref_page+0x8e1/0x900 ? __mem_cgroup_uncharge+0x69/0x90 __folio_put+0xe6/0x190 btrfs_do_encoded_write+0x445/0x780 ? current_time+0x25/0xd0 btrfs_do_write_iter+0x2cc/0x4b0 btrfs_ioctl_encoded_write+0x2b6/0x340 It turns out __free_page() decreases the page reference count while __folio_put() does not. Switch __folio_put() to folio_put() which decreases the folio reference count first. Fixes: 400b172b8cdc ("btrfs: compression: migrate compression/decompression paths to folios") Tested-by: Ed Tomlinson <edtoml@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-03nilfs2: fix incorrect inode allocation from reserved inodesRyusuke Konishi4-12/+20
If the bitmap block that manages the inode allocation status is corrupted, nilfs_ifile_create_inode() may allocate a new inode from the reserved inode area where it should not be allocated. Previous fix commit d325dc6eb763 ("nilfs2: fix use-after-free bug of struct nilfs_root"), fixed the problem that reserved inodes with inode numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to bitmap corruption, but since the start number of non-reserved inodes is read from the super block and may change, in which case inode allocation may occur from the extended reserved inode area. If that happens, access to that inode will cause an IO error, causing the file system to degrade to an error state. Fix this potential issue by adding a wraparound option to the common metadata object allocation routine and by modifying nilfs_ifile_create_inode() to disable the option so that it only allocates inodes with inode numbers greater than or equal to the inode number read in "nilfs->ns_first_ino", regardless of the bitmap status of reserved inodes. Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03nilfs2: add missing check for inode numbers on directory entriesRyusuke Konishi2-0/+11
Syzbot reported that mounting and unmounting a specific pattern of corrupted nilfs2 filesystem images causes a use-after-free of metadata file inodes, which triggers a kernel bug in lru_add_fn(). As Jan Kara pointed out, this is because the link count of a metadata file gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(), tries to delete that inode (ifile inode in this case). The inconsistency occurs because directories containing the inode numbers of these metadata files that should not be visible in the namespace are read without checking. Fix this issue by treating the inode numbers of these internal files as errors in the sanity check helper when reading directory folios/pages. Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer analysis. Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8 Reported-by: Jan Kara <jack@suse.cz> Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3 Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03nilfs2: fix inode number range checksRyusuke Konishi3-3/+10
Patch series "nilfs2: fix potential issues related to reserved inodes". This series fixes one use-after-free issue reported by syzbot, caused by nilfs2's internal inode being exposed in the namespace on a corrupted filesystem, and a couple of flaws that cause problems if the starting number of non-reserved inodes written in the on-disk super block is intentionally (or corruptly) changed from its default value. This patch (of 3): In the current implementation of nilfs2, "nilfs->ns_first_ino", which gives the first non-reserved inode number, is read from the superblock, but its lower limit is not checked. As a result, if a number that overlaps with the inode number range of reserved inodes such as the root directory or metadata files is set in the super block parameter, the inode number test macros (NILFS_MDT_INODE and NILFS_VALID_INODE) will not function properly. In addition, these test macros use left bit-shift calculations using with the inode number as the shift count via the BIT macro, but the result of a shift calculation that exceeds the bit width of an integer is undefined in the C specification, so if "ns_first_ino" is set to a large value other than the default value NILFS_USER_INO (=11), the macros may potentially malfunction depending on the environment. Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and by preventing bit shifts equal to or greater than the NILFS_USER_INO constant in the inode number test macros. Also, change the type of "ns_first_ino" from signed integer to unsigned integer to avoid the need for type casting in comparisons such as the lower bound check introduced this time. Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03cachefiles: add missing lock protection when pollingJingbo Xu1-2/+2
Add missing lock protection in poll routine when iterating xarray, otherwise: Even with RCU read lock held, only the slot of the radix tree is ensured to be pinned there, while the data structure (e.g. struct cachefiles_req) stored in the slot has no such guarantee. The poll routine will iterate the radix tree and dereference cachefiles_req accordingly. Thus RCU read lock is not adequate in this case and spinlock is needed here. Fixes: b817e22b2e91 ("cachefiles: narrow the scope of triggering EPOLLIN events in ondemand mode") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-10-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03cachefiles: cyclic allocation of msg_id to avoid reuseBaokun Li2-4/+17
Reusing the msg_id after a maliciously completed reopen request may cause a read request to remain unprocessed and result in a hung, as shown below: t1 | t2 | t3 ------------------------------------------------- cachefiles_ondemand_select_req cachefiles_ondemand_object_is_close(A) cachefiles_ondemand_set_object_reopening(A) queue_work(fscache_object_wq, &info->work) ondemand_object_worker cachefiles_ondemand_init_object(A) cachefiles_ondemand_send_req(OPEN) // get msg_id 6 wait_for_completion(&req_A->done) cachefiles_ondemand_daemon_read // read msg_id 6 req_A cachefiles_ondemand_get_fd copy_to_user // Malicious completion msg_id 6 copen 6,-1 cachefiles_ondemand_copen complete(&req_A->done) // will not set the object to close // because ondemand_id && fd is valid. // ondemand_object_worker() is done // but the object is still reopening. // new open req_B cachefiles_ondemand_init_object(B) cachefiles_ondemand_send_req(OPEN) // reuse msg_id 6 process_open_req copen 6,A.size // The expected failed copen was executed successfully Expect copen to fail, and when it does, it closes fd, which sets the object to close, and then close triggers reopen again. However, due to msg_id reuse resulting in a successful copen, the anonymous fd is not closed until the daemon exits. Therefore read requests waiting for reopen to complete may trigger hung task. To avoid this issue, allocate the msg_id cyclically to avoid reusing the msg_id for a very short duration of time. Fixes: c8383054506c ("cachefiles: notify the user daemon when looking up cookie") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-9-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03cachefiles: wait for ondemand_object_worker to finish when dropping objectHou Tao1-0/+3
When queuing ondemand_object_worker() to re-open the object, cachefiles_object is not pinned. The cachefiles_object may be freed when the pending read request is completed intentionally and the related erofs is umounted. If ondemand_object_worker() runs after the object is freed, it will incur use-after-free problem as shown below. process A processs B process C process D cachefiles_ondemand_send_req() // send a read req X // wait for its completion // close ondemand fd cachefiles_ondemand_fd_release() // set object as CLOSE cachefiles_ondemand_daemon_read() // set object as REOPENING queue_work(fscache_wq, &info->ondemand_work) // close /dev/cachefiles cachefiles_daemon_release cachefiles_flush_reqs complete(&req->done) // read req X is completed // umount the erofs fs cachefiles_put_object() // object will be freed cachefiles_ondemand_deinit_obj_info() kmem_cache_free(object) // both info and object are freed ondemand_object_worker() When dropping an object, it is no longer necessary to reopen the object, so use cancel_work_sync() to cancel or wait for ondemand_object_worker() to finish. Fixes: 0a7e54c1959c ("cachefiles: resend an open request if the read request's object is closed") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-8-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03cachefiles: cancel all requests for the object that is being droppedBaokun Li1-0/+19
Because after an object is dropped, requests for that object are useless, cancel them to avoid causing other problems. This prepares for the later addition of cancel_work_sync(). After the reopen requests is generated, cancel it to avoid cancel_work_sync() blocking by waiting for daemon to complete the reopen requests. Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-7-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03cachefiles: stop sending new request when dropping objectBaokun Li2-2/+10
Added CACHEFILES_ONDEMAND_OBJSTATE_DROPPING indicates that the cachefiles object is being dropped, and is set after the close request for the dropped object completes, and no new requests are allowed to be sent after this state. This prepares for the later addition of cancel_work_sync(). It prevents leftover reopen requests from being sent, to avoid processing unnecessary requests and to avoid cancel_work_sync() blocking by waiting for daemon to complete the reopen requests. Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-6-libaokun@huaweicloud.com Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03cachefiles: propagate errors from vfs_getxattr() to avoid infinite loopBaokun Li1-1/+4
In cachefiles_check_volume_xattr(), the error returned by vfs_getxattr() is not passed to ret, so it ends up returning -ESTALE, which leads to an endless loop as follows: cachefiles_acquire_volume retry: ret = cachefiles_check_volume_xattr ret = -ESTALE xlen = vfs_getxattr // return -EIO // The ret is not updated when xlen < 0, so -ESTALE is returned. return ret // Supposed to jump out of the loop at this judgement. if (ret != -ESTALE) goto error_dir; cachefiles_bury_object // EIO causes rename failure goto retry; Hence propagate the error returned by vfs_getxattr() to avoid the above issue. Do the same in cachefiles_check_auxdata(). Fixes: 32e150037dce ("fscache, cachefiles: Store the volume coherency data") Fixes: 72b957856b0c ("cachefiles: Implement metadata/coherency data storage in xattrs") Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-5-libaokun@huaweicloud.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-03cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()Baokun Li2-2/+34
We got the following issue in our fault injection stress test: ================================================================== BUG: KASAN: slab-use-after-free in cachefiles_withdraw_cookie+0x4d9/0x600 Read of size 8 at addr ffff888118efc000 by task kworker/u78:0/109 CPU: 13 PID: 109 Comm: kworker/u78:0 Not tainted 6.8.0-dirty #566 Call Trace: <TASK> kasan_report+0x93/0xc0 cachefiles_withdraw_cookie+0x4d9/0x600 fscache_cookie_state_machine+0x5c8/0x1230 fscache_cookie_worker+0x91/0x1c0 process_one_work+0x7fa/0x1800 [...] Allocated by task 117: kmalloc_trace+0x1b3/0x3c0 cachefiles_acquire_volume+0xf3/0x9c0 fscache_create_volume_work+0x97/0x150 process_one_work+0x7fa/0x1800 [...] Freed by task 120301: kfree+0xf1/0x2c0 cachefiles_withdraw_cache+0x3fa/0x920 cachefiles_put_unbind_pincount+0x1f6/0x250 cachefiles_daemon_release+0x13b/0x290 __fput+0x204/0xa00 task_work_run+0x139/0x230 do_exit+0x87a/0x29b0 [...] ================================================================== Following is the process that triggers the issue: p1 | p2 ------------------------------------------------------------ fscache_begin_lookup fscache_begin_volume_access fscache_cache_is_live(fscache_cache) cachefiles_daemon_release cachefiles_put_unbind_pincount cachefiles_daemon_unbind cachefiles_withdraw_cache fscache_withdraw_cache fscache_set_cache_state(cache, FSCACHE_CACHE_IS_WITHDRAWN); cachefiles_withdraw_objects(cache) fscache_wait_for_objects(fscache) atomic_read(&fscache_cache->object_count) == 0 fscache_perform_lookup cachefiles_lookup_cookie cachefiles_alloc_object refcount_set(&object->ref, 1); object->volume = volume fscache_count_object(vcookie->cache); atomic_inc(&fscache_cache->object_count) cachefiles_withdraw_volumes cachefiles_withdraw_volume fscache_withdraw_volume __cachefiles_free_volume kfree(cachefiles_volume) fscache_cookie_state_machine cachefiles_withdraw_cookie cache = object->volume->cache; // cachefiles_volume UAF !!! After setting FSCACHE_CACHE_IS_WITHDRAWN, wait for all the cookie lookups to complete first, and then wait for fscache_cache->object_count == 0 to avoid the cookie exiting after the volume has been freed and triggering the above issue. Therefore call fscache_withdraw_volume() before calling cachefiles_withdraw_objects(). This way, after setting FSCACHE_CACHE_IS_WITHDRAWN, only the following two cases will occur: 1) fscache_begin_lookup fails in fscache_begin_volume_access(). 2) fscache_withdraw_volume() will ensure that fscache_count_object() has been executed before calling fscache_wait_for_objects(). Fixes: fe2140e2f57f ("cachefiles: Implement volume support") Suggested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20240628062930.2467993-4-libaokun@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org>