summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-10-25pNFS: Fix a hang in nfs4_evict_inode()Trond Myklebust1-10/+23
commit f63955721a8020e979b99cc417dcb6da3106aa24 upstream. We are not allowed to call pnfs_mark_matching_lsegs_return() without also holding a reference to the layout header, since doing so could lead to the reference count going to zero when we call pnfs_layout_remove_lseg(). This again can lead to a hang when we get to nfs4_evict_inode() and are unable to clear the layout pointer. pnfs_layout_return_unused_byserver() is guilty of this behaviour, and has been seen to trigger the refcount warning prior to a hang. Fixes: b6d49ecd1081 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25btrfs: fix some -Wmaybe-uninitialized warnings in ioctl.cJosef Bacik1-2/+2
[ Upstream commit 9147b9ded499d9853bdf0e9804b7eaa99c4429ed ] Jens reported the following warnings from -Wmaybe-uninitialized recent Linus' branch. In file included from ./include/asm-generic/rwonce.h:26, from ./arch/arm64/include/asm/rwonce.h:71, from ./include/linux/compiler.h:246, from ./include/linux/export.h:5, from ./include/linux/linkage.h:7, from ./include/linux/kernel.h:17, from fs/btrfs/ioctl.c:6: In function ‘instrument_copy_from_user_before’, inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3, inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7, inlined from ‘btrfs_ioctl_space_info’ at fs/btrfs/ioctl.c:2999:6, inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4616:10: ./include/linux/kasan-checks.h:38:27: warning: ‘space_args’ may be used uninitialized [-Wmaybe-uninitialized] 38 | #define kasan_check_write __kasan_check_write ./include/linux/instrumented.h:129:9: note: in expansion of macro ‘kasan_check_write’ 129 | kasan_check_write(to, n); | ^~~~~~~~~~~~~~~~~ ./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’: ./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const volatile void *’ to ‘__kasan_check_write’ declared here 20 | bool __kasan_check_write(const volatile void *p, unsigned int size); | ^~~~~~~~~~~~~~~~~~~ fs/btrfs/ioctl.c:2981:39: note: ‘space_args’ declared here 2981 | struct btrfs_ioctl_space_args space_args; | ^~~~~~~~~~ In function ‘instrument_copy_from_user_before’, inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3, inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7, inlined from ‘_btrfs_ioctl_send’ at fs/btrfs/ioctl.c:4343:9, inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4658:10: ./include/linux/kasan-checks.h:38:27: warning: ‘args32’ may be used uninitialized [-Wmaybe-uninitialized] 38 | #define kasan_check_write __kasan_check_write ./include/linux/instrumented.h:129:9: note: in expansion of macro ‘kasan_check_write’ 129 | kasan_check_write(to, n); | ^~~~~~~~~~~~~~~~~ ./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’: ./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const volatile void *’ to ‘__kasan_check_write’ declared here 20 | bool __kasan_check_write(const volatile void *p, unsigned int size); | ^~~~~~~~~~~~~~~~~~~ fs/btrfs/ioctl.c:4341:49: note: ‘args32’ declared here 4341 | struct btrfs_ioctl_send_args_32 args32; | ^~~~~~ This was due to his config options and having KASAN turned on, which adds some extra checks around copy_from_user(), which then triggered the -Wmaybe-uninitialized checker for these cases. Fix the warnings by initializing the different structs we're copying into. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25overlayfs: set ctime when setting mtime and atimeJeff Layton1-1/+1
[ Upstream commit 03dbab3bba5f009d053635c729d1244f2c8bad38 ] Nathan reported that he was seeing the new warning in setattr_copy_mgtime pop when starting podman containers. Overlayfs is trying to set the atime and mtime via notify_change without also setting the ctime. POSIX states that when the atime and mtime are updated via utimes() that we must also update the ctime to the current time. The situation with overlayfs copy-up is analogies, so add ATTR_CTIME to the bitmask. notify_change will fill in the value. Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230913-ctime-v1-1-c6bc509cbc27@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25btrfs: initialize start_slot in btrfs_log_prealloc_extentsJosef Bacik1-1/+1
[ Upstream commit b4c639f699349880b7918b861e1bd360442ec450 ] Jens reported a compiler warning when using CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’: fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used uninitialized [-Wmaybe-uninitialized] 4828 | ret = copy_items(trans, inode, dst_path, path, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4829 | start_slot, ins_nr, 1, 0); | ~~~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here 4725 | int start_slot; | ^~~~~~~~~~ The compiler is incorrect, as we only use this code when ins_len > 0, and when ins_len > 0 we have start_slot properly initialized. However we generally find the -Wmaybe-uninitialized warnings valuable, so initialize start_slot to get rid of the warning. Reported-by: Jens Axboe <axboe@kernel.dk> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1Filipe Manana1-3/+3
[ Upstream commit 1bf76df3fee56d6637718e267f7c34ed70d0c7dc ] When running a delayed tree reference, if we find a ref count different from 1, we return -EIO. This isn't an IO error, as it indicates either a bug in the delayed refs code or a memory corruption, so change the error code from -EIO to -EUCLEAN. Also tag the branch as 'unlikely' as this is not expected to ever happen, and change the error message to print the tree block's bytenr without the parenthesis (and there was a missing space between the 'block' word and the opening parenthesis), for consistency as that's the style we used everywhere else. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-25ceph: fix type promotion bug on 32bit systemsDan Carpenter1-1/+1
commit 07bb00ef00ace88dd6f695fadbba76565756e55c upstream. In this code "ret" is type long and "src_objlen" is unsigned int. The problem is that on 32bit systems, when we do the comparison signed longs are type promoted to unsigned int. So negative error codes from do_splice_direct() are treated as success instead of failure. Cc: stable@vger.kernel.org Fixes: 1b0c3b9f91f0 ("ceph: re-org copy_file_range and fix some error paths") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25ceph: fix incorrect revoked caps assert in ceph_fill_file_size()Xiubo Li1-3/+1
commit 15c0a870dc44ed14e01efbdd319d232234ee639f upstream. When truncating the inode the MDS will acquire the xlock for the ifile Locker, which will revoke the 'Frwsxl' caps from the clients. But when the client just releases and flushes the 'Fw' caps to MDS, for exmaple, and once the MDS receives the caps flushing msg it just thought the revocation has finished. Then the MDS will continue truncating the inode and then issued the truncate notification to all the clients. While just before the clients receives the cap flushing ack they receive the truncation notification, the clients will detecte that the 'issued | dirty' is still holding the 'Fw' caps. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/56693 Fixes: b0d7c2231015 ("ceph: introduce i_truncate_mutex") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25quota: Fix slow quotaoffJan Kara1-27/+39
commit 869b6ea1609f655a43251bf41757aa44e5350a8f upstream. Eric has reported that commit dabc8b207566 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide") heavily increases runtime of generic/270 xfstest for ext4 in nojournal mode. The reason for this is that ext4 in nojournal mode leaves dquots dirty until the last dqput() and thus the cleanup done in quota_release_workfn() has to write them all. Due to the way quota_release_workfn() is written this results in synchronize_srcu() call for each dirty dquot which makes the dquot cleanup when turning quotas off extremely slow. To be able to avoid synchronize_srcu() for each dirty dquot we need to rework how we track dquots to be cleaned up. Instead of keeping the last dquot reference while it is on releasing_dquots list, we drop it right away and mark the dquot with new DQ_RELEASING_B bit instead. This way we can we can remove dquot from releasing_dquots list when new reference to it is acquired and thus there's no need to call synchronize_srcu() each time we drop dq_list_lock. References: https://lore.kernel.org/all/ZRytn6CxFK2oECUt@debian-BULLSEYE-live-builder-AMD64 Reported-by: Eric Whitney <enwlinux@gmail.com> Fixes: dabc8b207566 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10NFSv4: Fix a nfs4_state_manager() raceTrond Myklebust1-0/+7
[ Upstream commit ed1cc05aa1f7fe8197d300e914afc28ab9818f89 ] If the NFS4CLNT_RUN_MANAGER flag got set just before we cleared NFS4CLNT_MANAGER_RUNNING, then we might have won the race against nfs4_schedule_state_manager(), and are responsible for handling the recovery situation. Fixes: aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10NFSv4: Fix a state manager thread deadlock regressionTrond Myklebust2-12/+28
[ Upstream commit 956fd46f97d238032cb5fa4771cdaccc6e760f9a ] Commit 4dc73c679114 reintroduces the deadlock that was fixed by commit aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") because it prevents the setup of new threads to handle reboot recovery, while the older recovery thread is stuck returning delegations. Fixes: 4dc73c679114 ("NFSv4: keep state manager thread active if swap is enabled") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10NFS: rename nfs_client_kset to nfs_ksetBenjamin Coddington1-8/+8
[ Upstream commit 8b18a2edecc0741b0eecf8b18fdb356a0f8682de ] Be brief and match the subsystem name. There's no need to distinguish this kset variable from the server. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Stable-dep-of: 956fd46f97d2 ("NFSv4: Fix a state manager thread deadlock regression") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10NFS: Cleanup unused rpc_clnt variableBenjamin Coddington1-4/+0
[ Upstream commit e025f0a73f6acb920d86549b2177a5883535421d ] The root rpc_clnt is not used here, clean it up. Fixes: 4dc73c679114 ("NFSv4: keep state manager thread active if swap is enabled") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Stable-dep-of: 956fd46f97d2 ("NFSv4: Fix a state manager thread deadlock regression") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10fs: binfmt_elf_efpic: fix personality for ELF-FDPICGreg Ungerer1-3/+2
commit 7c3151585730b7095287be8162b846d31e6eee61 upstream. The elf-fdpic loader hard sets the process personality to either PER_LINUX_FDPIC for true elf-fdpic binaries or to PER_LINUX for normal ELF binaries (in this case they would be constant displacement compiled with -pie for example). The problem with that is that it will lose any other bits that may be in the ELF header personality (such as the "bug emulation" bits). On the ARM architecture the ADDR_LIMIT_32BIT flag is used to signify a normal 32bit binary - as opposed to a legacy 26bit address binary. This matters since start_thread() will set the ARM CPSR register as required based on this flag. If the elf-fdpic loader loses this bit the process will be mis-configured and crash out pretty quickly. Modify elf-fdpic loader personality setting so that it preserves the upper three bytes by using the SET_PERSONALITY macro to set it. This macro in the generic case sets PER_LINUX and preserves the upper bytes. Architectures can override this for their specific use case, and ARM does exactly this. The problem shows up quite easily running under qemu using the ARM architecture, but not necessarily on all types of real ARM hardware. If the underlying ARM processor does not support the legacy 26-bit addressing mode then everything will work as expected. Link: https://lkml.kernel.org/r/20230907011808.2985083-1-gerg@kernel.org Fixes: 1bde925d23547 ("fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries") Signed-off-by: Greg Ungerer <gerg@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10btrfs: properly report 0 avail for very full file systemsJosef Bacik1-1/+1
commit 58bfe2ccec5f9f137b41dd38f335290dcc13cd5c upstream. A user reported some issues with smaller file systems that get very full. While investigating this issue I noticed that df wasn't showing 100% full, despite having 0 chunk space and having < 1MiB of available metadata space. This turns out to be an overflow issue, we're doing: total_available_metadata_space - SZ_4M < global_block_rsv_size to determine if there's not enough space to make metadata allocations, which overflows if total_available_metadata_space is < 4M. Fix this by checking to see if our available space is greater than the 4M threshold. This makes df properly report 100% usage on the file system. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10proc: nommu: /proc/<pid>/maps: release mmap read lockBen Wolsieffer1-12/+15
commit 578d7699e5c2add8c2e9549d9d75dfb56c460cb3 upstream. The no-MMU implementation of /proc/<pid>/map doesn't normally release the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine whether to release the lock. Since _vml is NULL when the end of the mappings is reached, the lock is not released. Reading /proc/1/maps twice doesn't cause a hang because it only takes the read lock, which can be taken multiple times and therefore doesn't show any problem if the lock isn't released. Instead, you need to perform some operation that attempts to take the write lock after reading /proc/<pid>/maps. To actually reproduce the bug, compile the following code as 'proc_maps_bug': #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main(int argc, char *argv[]) { void *buf; sleep(1); buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); puts("mmap returned"); return 0; } Then, run: ./proc_maps_bug &; cat /proc/$!/maps; fg Without this patch, mmap() will hang and the command will never complete. This code was incorrectly adapted from the MMU implementation, which at the time released the lock in m_next() before returning the last entry. The MMU implementation has diverged further from the no-MMU version since then, so this patch brings their locking and error handling into sync, fixing the bug and hopefully avoiding similar issues in the future. Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.com Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Giulio Benetti <giulio.benetti@benettiengineering.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()Pan Bian1-3/+3
commit 7ee29facd8a9c5a26079148e36bcf07141b3a6bc upstream. In nilfs_gccache_submit_read_data(), brelse(bh) is called to drop the reference count of bh when the call to nilfs_dat_translate() fails. If the reference count hits 0 and its owner page gets unlocked, bh may be freed. However, bh->b_page is dereferenced to put the page after that, which may result in a use-after-free bug. This patch moves the release operation after unlocking and putting the page. NOTE: The function in question is only called in GC, and in combination with current userland tools, address translation using DAT does not occur in that function, so the code path that causes this issue will not be executed. However, it is possible to run that code path by intentionally modifying the userland GC library or by calling the GC ioctl directly. [konishi.ryusuke@gmail.com: NOTE added to the commit log] Link: https://lkml.kernel.org/r/1543201709-53191-1-git-send-email-bianpan2016@163.com Link: https://lkml.kernel.org/r/20230921141731.10073-1-konishi.ryusuke@gmail.com Fixes: a3d93f709e89 ("nilfs2: block cache for garbage collection") Signed-off-by: Pan Bian <bianpan2016@163.com> Reported-by: Ferry Meng <mengferry@linux.alibaba.com> Closes: https://lkml.kernel.org/r/20230818092022.111054-1-mengferry@linux.alibaba.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10btrfs: reset destination buffer when read_extent_buffer() gets invalid rangeQu Wenruo1-1/+7
[ Upstream commit 74ee79142c0a344d4eae2eb7012ebc4e82254109 ] Commit f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") changed how we handle invalid extent buffer range for read_extent_buffer(). Previously if the range is invalid we just set the destination to zero, but after the patch we do nothing and error out. This can lead to smatch static checker errors like: fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'. fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'. fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'. fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'. fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'. Fix those warnings by reverting back to the old memset() behavior. By this we keep the static checker happy and would still make a lot of noise when such invalid ranges are passed in. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10ext4: do not let fstrim block system suspendJan Kara1-2/+10
[ Upstream commit 5229a658f6453362fbb9da6bf96872ef25a7097e ] Len Brown has reported that system suspend sometimes fail due to inability to freeze a task working in ext4_trim_fs() for one minute. Trimming a large filesystem on a disk that slowly processes discard requests can indeed take a long time. Since discard is just an advisory call, it is perfectly fine to interrupt it at any time and the return number of discarded blocks until that moment. Do that when we detect the task is being frozen. Cc: stable@kernel.org Reported-by: Len Brown <lenb@kernel.org> Suggested-by: Dave Chinner <david@fromorbit.com> References: https://bugzilla.kernel.org/show_bug.cgi?id=216322 Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10ext4: move setting of trimmed bit into ext4_try_to_trim_range()Jan Kara1-21/+25
[ Upstream commit 45e4ab320c9b5fa67b1fc3b6a9b381cfcc0c8488 ] Currently we set the group's trimmed bit in ext4_trim_all_free() based on return value of ext4_try_to_trim_range(). However when we will want to abort trimming because of suspend attempt, we want to return success from ext4_try_to_trim_range() but not set the trimmed bit. Instead implementing awkward propagation of this information, just move setting of trimmed bit into ext4_try_to_trim_range() when the whole group is trimmed. Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10ext4: replace the traditional ternary conditional operator with with max()/min()Kemeng Shi1-4/+2
[ Upstream commit de8bf0e5ee7482585450357c6d4eddec8efc5cb7 ] Replace the traditional ternary conditional operator with with max()/min() Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10ext4: mark group as trimmed only if it was fully scannedDmitry Monakhov1-6/+12
[ Upstream commit d63c00ea435a5352f486c259665a4ced60399421 ] Otherwise nonaligned fstrim calls will works inconveniently for iterative scanners, for example: // trim [0,16MB] for group-1, but mark full group as trimmed fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m // handle [16MB,16MB] for group-1, do nothing because group already has the flag. fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m [ Update function documentation for ext4_trim_all_free -- TYT ] Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10ext4: change s_last_trim_minblks type to unsigned longLukas Czerner2-3/+3
[ Upstream commit 2327fb2e23416cfb2795ccca2f77d4d65925be99 ] There is no good reason for the s_last_trim_minblks to be atomic. There is no data integrity needed and there is no real danger in setting and reading it in a racy manner. Change it to be unsigned long, the same type as s_clusters_per_group which is the maximum that's allowed. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20211103145122.17338-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10ext4: scope ret locally in ext4_try_to_trim_range()Lukas Bulwahn1-3/+2
[ Upstream commit afcc4e32f606dbfb47aa7309172c89174b86e74c ] As commit 6920b3913235 ("ext4: add new helper interface ext4_try_to_trim_range()") moves some code into the separate function ext4_try_to_trim_range(), the use of the variable ret within that function is more limited and can be adjusted as well. Scope the use of the variable ret locally and drop dead assignments. No functional change. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20210820120853.23134-1-lukas.bulwahn@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10ext4: add new helper interface ext4_try_to_trim_range()Wang Jianchao1-45/+57
[ Upstream commit 6920b3913235f517728bb69abe9b39047a987113 ] There is no functional change in this patch but just split the codes, which serachs free block and does trim, into a new function ext4_try_to_trim_range. This is preparing for the following async backgroup discard. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210724074124.25731-3-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10ext4: remove the 'group' parameter of ext4_trim_extentWang Jianchao1-5/+4
[ Upstream commit bd2eea8d0a6b6a9aca22f20bf74f73b71d8808af ] Get rid of the 'group' parameter of ext4_trim_extent as we can get it from the 'e4b'. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210724074124.25731-2-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10NFS/pNFS: Report EINVAL errors from connect() to the serverTrond Myklebust1-0/+1
[ Upstream commit dd7d7ee3ba2a70d12d02defb478790cf57d5b87b ] With IPv6, connect() can occasionally return EINVAL if a route is unavailable. If this happens during I/O to a data server, we want to report it using LAYOUTERROR as an inability to connect. Fixes: dd52128afdde ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10NFS: Use the correct commit info in nfs_join_page_group()Trond Myklebust2-14/+17
[ Upstream commit b193a78ddb5ee7dba074d3f28dc050069ba083c0 ] Ensure that nfs_clear_request_commit() updates the correct counters when it removes them from the commit list. Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23ext4: fix rec_len verify errorShida Zhang1-11/+15
commit 7fda67e8c3ab6069f75888f67958a6d30454a9f6 upstream. With the configuration PAGE_SIZE 64k and filesystem blocksize 64k, a problem occurred when more than 13 million files were directly created under a directory: EXT4-fs error (device xx): ext4_dx_csum_set:492: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D. EXT4-fs error (device xx): ext4_dx_csum_verify:463: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D. EXT4-fs error (device xx): dx_probe:856: inode #xxxx: block 8188: comm xxxxx: Directory index failed checksum When enough files are created, the fake_dirent->reclen will be 0xffff. it doesn't equal to the blocksize 65536, i.e. 0x10000. But it is not the same condition when blocksize equals to 4k. when enough files are created, the fake_dirent->reclen will be 0x1000. it equals to the blocksize 4k, i.e. 0x1000. The problem seems to be related to the limitation of the 16-bit field when the blocksize is set to 64k. To address this, helpers like ext4_rec_len_{from,to}_disk has already been introduced to complete the conversion between the encoded and the plain form of rec_len. So fix this one by using the helper, and all the other in this file too. Cc: stable@kernel.org Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") Suggested-by: Andreas Dilger <adilger@dilger.ca> Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20230803060938.1929759-1-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracefs: Add missing lockdown check to tracefs_create_dir()Steven Rostedt (Google)1-0/+3
commit 51aab5ffceb43e05119eb059048fd75765d2bc21 upstream. The function tracefs_create_dir() was missing a lockdown check and was called by the RV code. This gave an inconsistent behavior of this function returning success while other tracefs functions failed. This caused the inode being freed by the wrong kmem_cache. Link: https://lkml.kernel.org/r/20230905182711.692687042@goodmis.org Link: https://lore.kernel.org/all/202309050916.58201dc6-oliver.sang@intel.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Cc: Ching-lin Yu <chinglinyu@google.com> Fixes: bf8e602186ec4 ("tracing: Do not create tracefs files if tracefs lockdown is in effect") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23nfsd: fix change_info in NFSv4 RENAME repliesJeff Layton1-2/+2
commit fdd2630a7398191e84822612e589062063bd4f3d upstream. nfsd sends the transposed directory change info in the RENAME reply. The source directory is in save_fh and the target is in current_fh. Reported-by: Zhi Li <yieli@redhat.com> Reported-by: Benjamin Coddington <bcodding@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2218844 Signed-off-by: Jeff Layton <jlayton@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23proc: fix a dentry lock race between release_task and lookupZhihao Cheng1-8/+38
commit d919a1e79bac890421537cf02ae773007bf55e6b upstream. Commit 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc") moved proc_flush_task() behind __exit_signal(). Then, process systemd can take long period high cpu usage during releasing task in following concurrent processes: systemd ps kernel_waitid stat(/proc/tgid) do_wait filename_lookup wait_consider_task lookup_fast release_task __exit_signal __unhash_process detach_pid __change_pid // remove task->pid_links d_revalidate -> pid_revalidate // 0 d_invalidate(/proc/tgid) shrink_dcache_parent(/proc/tgid) d_walk(/proc/tgid) spin_lock_nested(/proc/tgid/fd) // iterating opened fd proc_flush_pid | d_invalidate (/proc/tgid/fd) | shrink_dcache_parent(/proc/tgid/fd) | shrink_dentry_list(subdirs) ↓ shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock Function d_invalidate() will remove dentry from hash firstly, but why does proc_flush_pid() process dentry '/proc/tgid/fd' before dentry '/proc/tgid'? That's because proc_pid_make_inode() adds proc inode in reverse order by invoking hlist_add_head_rcu(). But proc should not add any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by adding inode into 'pid->inodes' only if the inode is /proc/tgid or /proc/tgid/task/pid. Performance regression: Create 200 tasks, each task open one file for 50,000 times. Kill all tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr). Before fix: $ time killall -wq aa real 4m40.946s # During this period, we can see 'ps' and 'systemd' taking high cpu usage. After fix: $ time killall -wq aa real 1m20.732s # During this period, we can see 'systemd' taking high cpu usage. Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com Fixes: 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc") Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Suggested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Context adjustments ] Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23btrfs: release path before inode lookup during the ino lookup ioctlFilipe Manana1-1/+7
commit ee34a82e890a7babb5585daf1a6dd7d4d1cf142a upstream. During the ino lookup ioctl we can end up calling btrfs_iget() to get an inode reference while we are holding on a root's btree. If btrfs_iget() needs to lookup the inode from the root's btree, because it's not currently loaded in memory, then it will need to lock another or the same path in the same root btree. This may result in a deadlock and trigger the following lockdep splat: WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Not tainted ------------------------------------------------------ syz-executor277/5012 is trying to acquire lock: ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 but task is already holding lock: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302 btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955 btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline] btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338 btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline] open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494 btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154 btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519 legacy_get_tree+0xef/0x190 fs/fs_context.c:611 vfs_get_tree+0x8c/0x270 fs/super.c:1519 fc_mount fs/namespace.c:1112 [inline] vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142 btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579 legacy_get_tree+0xef/0x190 fs/fs_context.c:611 vfs_get_tree+0x8c/0x270 fs/super.c:1519 do_new_mount+0x28f/0xae0 fs/namespace.c:3335 do_mount fs/namespace.c:3675 [inline] __do_sys_mount fs/namespace.c:3884 [inline] __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-tree-01){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline] btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281 btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154 btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412 btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline] btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716 btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline] btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105 btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(btrfs-tree-00); lock(btrfs-tree-01); lock(btrfs-tree-00); rlock(btrfs-tree-01); *** DEADLOCK *** 1 lock held by syz-executor277/5012: #0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 stack backtrace: CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline] btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281 btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154 btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412 btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline] btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716 btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline] btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105 btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f0bec94ea39 Fix this simply by releasing the path before calling btrfs_iget() as at point we don't need the path anymore. Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/ Fixes: 23d0b79dfaed ("btrfs: Add unprivileged version of ino_lookup ioctl") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23btrfs: fix lockdep splat and potential deadlock after failure running ↵Filipe Manana1-3/+16
delayed items commit e110f8911ddb93e6f55da14ccbbe705397b30d0b upstream. When running delayed items we are holding a delayed node's mutex and then we will attempt to modify a subvolume btree to insert/update/delete the delayed items. However if have an error during the insertions for example, btrfs_insert_delayed_items() may return with a path that has locked extent buffers (a leaf at the very least), and then we attempt to release the delayed node at __btrfs_run_delayed_items(), which requires taking the delayed node's mutex, causing an ABBA type of deadlock. This was reported by syzbot and the lockdep splat is the following: WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Not tainted ------------------------------------------------------ syz-executor.2/13257 is trying to acquire lock: ffff88801835c0c0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 but task is already holding lock: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: __lock_release kernel/locking/lockdep.c:5475 [inline] lock_release+0x36f/0x9d0 kernel/locking/lockdep.c:5781 up_write+0x79/0x580 kernel/locking/rwsem.c:1625 btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline] btrfs_unlock_up_safe+0x179/0x3b0 fs/btrfs/locking.c:239 search_leaf fs/btrfs/ctree.c:1986 [inline] btrfs_search_slot+0x2511/0x2f80 fs/btrfs/ctree.c:2230 btrfs_insert_empty_items+0x9c/0x180 fs/btrfs/ctree.c:4376 btrfs_insert_delayed_item fs/btrfs/delayed-inode.c:746 [inline] btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline] __btrfs_commit_inode_delayed_items+0xd24/0x2410 fs/btrfs/delayed-inode.c:1111 __btrfs_run_delayed_items+0x1db/0x430 fs/btrfs/delayed-inode.c:1153 flush_space+0x269/0xe70 fs/btrfs/space-info.c:723 btrfs_async_reclaim_metadata_space+0x106/0x350 fs/btrfs/space-info.c:1078 process_one_work+0x92c/0x12c0 kernel/workqueue.c:2600 worker_thread+0xa63/0x1210 kernel/workqueue.c:2751 kthread+0x2b8/0x350 kernel/kthread.c:389 ret_from_fork+0x2e/0x60 arch/x86/kernel/process.c:145 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799 __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156 btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276 btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988 vfs_fsync_range fs/sync.c:188 [inline] vfs_fsync fs/sync.c:202 [inline] do_fsync fs/sync.c:212 [inline] __do_sys_fsync fs/sync.c:220 [inline] __se_sys_fsync fs/sync.c:218 [inline] __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(&delayed_node->mutex); lock(btrfs-tree-00); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by syz-executor.2/13257: #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: spin_unlock include/linux/spinlock.h:391 [inline] #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0xb87/0xe00 fs/btrfs/transaction.c:287 #1: ffff88802c1ee398 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0xbb2/0xe00 fs/btrfs/transaction.c:288 #2: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198 stack backtrace: CPU: 0 PID: 13257 Comm: syz-executor.2 Not tainted 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799 __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156 btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276 btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988 vfs_fsync_range fs/sync.c:188 [inline] vfs_fsync fs/sync.c:202 [inline] do_fsync fs/sync.c:212 [inline] __do_sys_fsync fs/sync.c:220 [inline] __se_sys_fsync fs/sync.c:218 [inline] __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f3ad047cae9 Code: 28 00 00 00 75 (...) RSP: 002b:00007f3ad12510c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 00007f3ad059bf80 RCX: 00007f3ad047cae9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007f3ad04c847a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007f3ad059bf80 R15: 00007ffe56af92f8 </TASK> ------------[ cut here ]------------ Fix this by releasing the path before releasing the delayed node in the error path at __btrfs_run_delayed_items(). Reported-by: syzbot+a379155f07c134ea9879@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000abba27060403b5bd@google.com/ CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23ovl: fix incorrect fdput() on aio completionAmir Goldstein1-6/+3
commit 724768a39374d35b70eaeae8dd87048a2ec7ae8e upstream. ovl_{read,write}_iter() always call fdput(real) to put one or zero refcounts of the real file, but for aio, whether it was submitted or not, ovl_aio_put() also calls fdput(), which is not balanced. This is only a problem in the less common case when FDPUT_FPUT flag is set. To fix the problem use get_file() to take file refcount and use fput() instead of fdput() in ovl_aio_put(). Fixes: 2406a307ac7d ("ovl: implement async IO routines") Cc: <stable@vger.kernel.org> # v5.6 Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23attr: block mode changes of symlinksChristian Brauner1-2/+18
commit 5d1f903f75a80daa4dfb3d84e114ec8ecbf29956 upstream. Changing the mode of symlinks is meaningless as the vfs doesn't take the mode of a symlink into account during path lookup permission checking. However, the vfs doesn't block mode changes on symlinks. This however, has lead to an untenable mess roughly classifiable into the following two categories: (1) Filesystems that don't implement a i_op->setattr() for symlinks. Such filesystems may or may not know that without i_op->setattr() defined, notify_change() falls back to simple_setattr() causing the inode's mode in the inode cache to be changed. That's a generic issue as this will affect all non-size changing inode attributes including ownership changes. Example: afs (2) Filesystems that fail with EOPNOTSUPP but change the mode of the symlink nonetheless. Some filesystems will happily update the mode of a symlink but still return EOPNOTSUPP. This is the biggest source of confusion for userspace. The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it comes from filesystems that call posix_acl_chmod(), e.g., btrfs via if (!err && attr->ia_valid & ATTR_MODE) err = posix_acl_chmod(idmap, dentry, inode->i_mode); Filesystems including btrfs don't implement i_op->set_acl() so posix_acl_chmod() will report EOPNOTSUPP. When posix_acl_chmod() is called, most filesystems will have finished updating the inode. Perversely, this has the consequences that this behavior may depend on two kconfig options and mount options: * CONFIG_POSIX_ACL={y,n} * CONFIG_${FSTYPE}_POSIX_ACL={y,n} * Opt_acl, Opt_noacl Example: btrfs, ext4, xfs The only way to change the mode on a symlink currently involves abusing an O_PATH file descriptor in the following manner: fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW); char path[PATH_MAX]; snprintf(path, sizeof(path), "/proc/self/fd/%d", fd); chmod(path, 0000); But for most major filesystems with POSIX ACL support such as btrfs, ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with the mode still updated due to the aforementioned posix_acl_chmod() nonsense. So, given that for all major filesystems this would fail with EOPNOTSUPP and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode changes on symlinks we should just try and block mode changes on symlinks directly in the vfs and have a clean break with this nonsense. If this causes any regressions, we do the next best thing and fix up all filesystems that do return EOPNOTSUPP with the mode updated to not call posix_acl_chmod() on symlinks. But as usual, let's try the clean cut solution first. It's a simple patch that can be easily reverted. Not marking this for backport as I'll do that manually if we're reasonably sure that this works and there are no strong objections. We could block this in chmod_common() but it's more appropriate to do it notify_change() as it will also mean that we catch filesystems that change symlink permissions explicitly or accidently. Similar proposals were floated in the past as in [3] and [4] and again recently in [5]. There's also a couple of bugs about this inconsistency as in [6] and [7]. Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1] Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2] Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3] Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4] Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5] Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6] Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7] Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged Suggested-by: Christoph Hellwig <hch@lst.de> Suggested-by: Florian Weimer <fweimer@redhat.com> Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_superAnand Jain1-5/+3
[ Upstream commit 6bfe3959b0e7a526f5c64747801a8613f002f05a ] The function btrfs_validate_super() should verify the metadata_uuid in the provided superblock argument. Because, all its callers expect it to do that. Such as in the following stacks: write_all_supers() sb = fs_info->super_for_commit; btrfs_validate_write_super(.., sb) btrfs_validate_super(.., sb, ..) scrub_one_super() btrfs_validate_super(.., sb, ..) And check_dev_super() btrfs_validate_super(.., sb, ..) However, it currently verifies the fs_info::super_copy::metadata_uuid instead. Fix this using the correct metadata_uuid in the superblock argument. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23btrfs: add a helper to read the superblock metadata_uuidAnand Jain2-0/+9
[ Upstream commit 4844c3664a72d36cc79752cb651c78860b14c240 ] In some cases, we need to read the FSID from the superblock when the metadata_uuid is not set, and otherwise, read the metadata_uuid. So, add a helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 6bfe3959b0e7 ("btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23btrfs: move btrfs_pinned_by_swapfile prototype into volumes.hJosef Bacik2-2/+2
[ Upstream commit c2e79e865b87c2920a3cd39de69c35f2bc758a51 ] This is defined in volumes.c, move the prototype into volumes.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 6bfe3959b0e7 ("btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23jfs: fix invalid free of JFS_IP(ipimap)->i_imap in diUnmountLiu Shixin via Jfs-discussion1-0/+1
[ Upstream commit 6e2bda2c192d0244b5a78b787ef20aa10cb319b7 ] syzbot found an invalid-free in diUnmount: BUG: KASAN: double-free in slab_free mm/slub.c:3661 [inline] BUG: KASAN: double-free in __kmem_cache_free+0x71/0x110 mm/slub.c:3674 Free of addr ffff88806f410000 by task syz-executor131/3632 CPU: 0 PID: 3632 Comm: syz-executor131 Not tainted 6.1.0-rc7-syzkaller-00012-gca57f02295f1 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report_invalid_free+0xac/0xd0 mm/kasan/report.c:460 ____kasan_slab_free+0xfb/0x120 kasan_slab_free include/linux/kasan.h:177 [inline] slab_free_hook mm/slub.c:1724 [inline] slab_free_freelist_hook+0x12e/0x1a0 mm/slub.c:1750 slab_free mm/slub.c:3661 [inline] __kmem_cache_free+0x71/0x110 mm/slub.c:3674 diUnmount+0xef/0x100 fs/jfs/jfs_imap.c:195 jfs_umount+0x108/0x370 fs/jfs/jfs_umount.c:63 jfs_put_super+0x86/0x190 fs/jfs/super.c:194 generic_shutdown_super+0x130/0x310 fs/super.c:492 kill_block_super+0x79/0xd0 fs/super.c:1428 deactivate_locked_super+0xa7/0xf0 fs/super.c:332 cleanup_mnt+0x494/0x520 fs/namespace.c:1186 task_work_run+0x243/0x300 kernel/task_work.c:179 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x664/0x2070 kernel/exit.c:820 do_group_exit+0x1fd/0x2b0 kernel/exit.c:950 __do_sys_exit_group kernel/exit.c:961 [inline] __se_sys_exit_group kernel/exit.c:959 [inline] __x64_sys_exit_group+0x3b/0x40 kernel/exit.c:959 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] JFS_IP(ipimap)->i_imap is not setting to NULL after free in diUnmount. If jfs_remount() free JFS_IP(ipimap)->i_imap but then failed at diMount(). JFS_IP(ipimap)->i_imap will be freed once again. Fix this problem by setting JFS_IP(ipimap)->i_imap to NULL after free. Reported-by: syzbot+90a11e6b1e810785c6ff@syzkaller.appspotmail.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23fs/jfs: prevent double-free in dbUnmount() after failed jfs_remount()Andrew Kanner1-0/+1
[ Upstream commit cade5397e5461295f3cb87880534b6a07cafa427 ] Syzkaller reported the following issue: ================================================================== BUG: KASAN: double-free in slab_free mm/slub.c:3787 [inline] BUG: KASAN: double-free in __kmem_cache_free+0x71/0x110 mm/slub.c:3800 Free of addr ffff888086408000 by task syz-executor.4/12750 [...] Call Trace: <TASK> [...] kasan_report_invalid_free+0xac/0xd0 mm/kasan/report.c:482 ____kasan_slab_free+0xfb/0x120 kasan_slab_free include/linux/kasan.h:177 [inline] slab_free_hook mm/slub.c:1781 [inline] slab_free_freelist_hook+0x12e/0x1a0 mm/slub.c:1807 slab_free mm/slub.c:3787 [inline] __kmem_cache_free+0x71/0x110 mm/slub.c:3800 dbUnmount+0xf4/0x110 fs/jfs/jfs_dmap.c:264 jfs_umount+0x248/0x3b0 fs/jfs/jfs_umount.c:87 jfs_put_super+0x86/0x190 fs/jfs/super.c:194 generic_shutdown_super+0x130/0x310 fs/super.c:492 kill_block_super+0x79/0xd0 fs/super.c:1386 deactivate_locked_super+0xa7/0xf0 fs/super.c:332 cleanup_mnt+0x494/0x520 fs/namespace.c:1291 task_work_run+0x243/0x300 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop+0x124/0x150 kernel/entry/common.c:171 exit_to_user_mode_prepare+0xb2/0x140 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x26/0x60 kernel/entry/common.c:296 do_syscall_64+0x49/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] </TASK> Allocated by task 13352: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x3d/0x60 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:371 [inline] __kasan_kmalloc+0x97/0xb0 mm/kasan/common.c:380 kmalloc include/linux/slab.h:580 [inline] dbMount+0x54/0x980 fs/jfs/jfs_dmap.c:164 jfs_mount+0x1dd/0x830 fs/jfs/jfs_mount.c:121 jfs_fill_super+0x590/0xc50 fs/jfs/super.c:556 mount_bdev+0x26c/0x3a0 fs/super.c:1359 legacy_get_tree+0xea/0x180 fs/fs_context.c:610 vfs_get_tree+0x88/0x270 fs/super.c:1489 do_new_mount+0x289/0xad0 fs/namespace.c:3145 do_mount fs/namespace.c:3488 [inline] __do_sys_mount fs/namespace.c:3697 [inline] __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 13352: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x3d/0x60 mm/kasan/common.c:52 kasan_save_free_info+0x27/0x40 mm/kasan/generic.c:518 ____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236 kasan_slab_free include/linux/kasan.h:177 [inline] slab_free_hook mm/slub.c:1781 [inline] slab_free_freelist_hook+0x12e/0x1a0 mm/slub.c:1807 slab_free mm/slub.c:3787 [inline] __kmem_cache_free+0x71/0x110 mm/slub.c:3800 dbUnmount+0xf4/0x110 fs/jfs/jfs_dmap.c:264 jfs_mount_rw+0x545/0x740 fs/jfs/jfs_mount.c:247 jfs_remount+0x3db/0x710 fs/jfs/super.c:454 reconfigure_super+0x3bc/0x7b0 fs/super.c:935 vfs_fsconfig_locked fs/fsopen.c:254 [inline] __do_sys_fsconfig fs/fsopen.c:439 [inline] __se_sys_fsconfig+0xad5/0x1060 fs/fsopen.c:314 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] JFS_SBI(ipbmap->i_sb)->bmap wasn't set to NULL after kfree() in dbUnmount(). Syzkaller uses faultinject to reproduce this KASAN double-free warning. The issue is triggered if either diMount() or dbMount() fail in jfs_remount(), since diUnmount() or dbUnmount() already happened in such a case - they will do double-free on next execution: jfs_umount or jfs_remount. Tested on both upstream and jfs-next by syzkaller. Reported-and-tested-by: syzbot+6a93efb725385bc4b2e9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000471f2d05f1ce8bad@google.com/T/ Link: https://syzkaller.appspot.com/bug?extid=6a93efb725385bc4b2e9 Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23ext2: fix datatype of block number in ext2_xattr_set2()Georg Ottinger1-2/+2
[ Upstream commit e88076348425b7d0491c8c98d8732a7df8de7aa3 ] I run a small server that uses external hard drives for backups. The backup software I use uses ext2 filesystems with 4KiB block size and the server is running SELinux and therefore relies on xattr. I recently upgraded the hard drives from 4TB to 12TB models. I noticed that after transferring some TBs I got a filesystem error "Freeing blocks not in datazone - block = 18446744071529317386, count = 1" and the backup process stopped. Trying to fix the fs with e2fsck resulted in a completely corrupted fs. The error probably came from ext2_free_blocks(), and because of the large number 18e19 this problem immediately looked like some kind of integer overflow. Whereas the 4TB fs was about 1e9 blocks, the new 12TB is about 3e9 blocks. So, searching the ext2 code, I came across the line in fs/ext2/xattr.c:745 where ext2_new_block() is called and the resulting block number is stored in the variable block as an int datatype. If a block with a block number greater than INT32_MAX is returned, this variable overflows and the call to sb_getblk() at line fs/ext2/xattr.c:750 fails, then the call to ext2_free_blocks() produces the error. Signed-off-by: Georg Ottinger <g.ottinger@gmx.at> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230815100340.22121-1-g.ottinger@gmx.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lockWill Shiu1-1/+1
[ Upstream commit 74f6f5912693ce454384eaeec48705646a21c74f ] As following backtrace, the struct file_lock request , in posix_lock_inode is free before ftrace function using. Replace the ftrace function ahead free flow could fix the use-after-free issue. [name:report&]=============================================== BUG:KASAN: use-after-free in trace_event_raw_event_filelock_lock+0x80/0x12c [name:report&]Read at addr f6ffff8025622620 by task NativeThread/16753 [name:report_hw_tags&]Pointer tag: [f6], memory tag: [fe] [name:report&] BT: Hardware name: MT6897 (DT) Call trace: dump_backtrace+0xf8/0x148 show_stack+0x18/0x24 dump_stack_lvl+0x60/0x7c print_report+0x2c8/0xa08 kasan_report+0xb0/0x120 __do_kernel_fault+0xc8/0x248 do_bad_area+0x30/0xdc do_tag_check_fault+0x1c/0x30 do_mem_abort+0x58/0xbc el1_abort+0x3c/0x5c el1h_64_sync_handler+0x54/0x90 el1h_64_sync+0x68/0x6c trace_event_raw_event_filelock_lock+0x80/0x12c posix_lock_inode+0xd0c/0xd60 do_lock_file_wait+0xb8/0x190 fcntl_setlk+0x2d8/0x440 ... [name:report&] [name:report&]Allocated by task 16752: ... slab_post_alloc_hook+0x74/0x340 kmem_cache_alloc+0x1b0/0x2f0 posix_lock_inode+0xb0/0xd60 ... [name:report&] [name:report&]Freed by task 16752: ... kmem_cache_free+0x274/0x5b0 locks_dispose_list+0x3c/0x148 posix_lock_inode+0xc40/0xd60 do_lock_file_wait+0xb8/0x190 fcntl_setlk+0x2d8/0x440 do_fcntl+0x150/0xc18 ... Signed-off-by: Will Shiu <Will.Shiu@mediatek.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23btrfs: output extra debug info if we failed to find an inline backrefQu Wenruo1-0/+5
[ Upstream commit 7f72f50547b7af4ddf985b07fc56600a4deba281 ] [BUG] Syzbot reported several warning triggered inside lookup_inline_extent_backref(). [CAUSE] As usual, the reproducer doesn't reliably trigger locally here, but at least we know the WARN_ON() is triggered when an inline backref can not be found, and it can only be triggered when @insert is true. (I.e. inserting a new inline backref, which means the backref should already exist) [ENHANCEMENT] After the WARN_ON(), dump all the parameters and the extent tree leaf to help debug. Link: https://syzkaller.appspot.com/bug?extid=d6f9ff86c1d804ba2bc6 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23autofs: fix memory leak of waitqueues in autofs_catatonic_modeFedor Pchelkin1-1/+2
[ Upstream commit ccbe77f7e45dfb4420f7f531b650c00c6e9c7507 ] Syzkaller reports a memory leak: BUG: memory leak unreferenced object 0xffff88810b279e00 (size 96): comm "syz-executor399", pid 3631, jiffies 4294964921 (age 23.870s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 08 9e 27 0b 81 88 ff ff ..........'..... 08 9e 27 0b 81 88 ff ff 00 00 00 00 00 00 00 00 ..'............. backtrace: [<ffffffff814cfc90>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046 [<ffffffff81bb75ca>] kmalloc include/linux/slab.h:576 [inline] [<ffffffff81bb75ca>] autofs_wait+0x3fa/0x9a0 fs/autofs/waitq.c:378 [<ffffffff81bb88a7>] autofs_do_expire_multi+0xa7/0x3e0 fs/autofs/expire.c:593 [<ffffffff81bb8c33>] autofs_expire_multi+0x53/0x80 fs/autofs/expire.c:619 [<ffffffff81bb6972>] autofs_root_ioctl_unlocked+0x322/0x3b0 fs/autofs/root.c:897 [<ffffffff81bb6a95>] autofs_root_ioctl+0x25/0x30 fs/autofs/root.c:910 [<ffffffff81602a9c>] vfs_ioctl fs/ioctl.c:51 [inline] [<ffffffff81602a9c>] __do_sys_ioctl fs/ioctl.c:870 [inline] [<ffffffff81602a9c>] __se_sys_ioctl fs/ioctl.c:856 [inline] [<ffffffff81602a9c>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:856 [<ffffffff84608225>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff84608225>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd autofs_wait_queue structs should be freed if their wait_ctr becomes zero. Otherwise they will be lost. In this case an AUTOFS_IOC_EXPIRE_MULTI ioctl is done, then a new waitqueue struct is allocated in autofs_wait(), its initial wait_ctr equals 2. After that wait_event_killable() is interrupted (it returns -ERESTARTSYS), so that 'wq->name.name == NULL' condition may be not satisfied. Actually, this condition can be satisfied when autofs_wait_release() or autofs_catatonic_mode() is called and, what is also important, wait_ctr is decremented in those places. Upon the exit of autofs_wait(), wait_ctr is decremented to 1. Then the unmounting process begins: kill_sb calls autofs_catatonic_mode(), which should have freed the waitqueues, but it only decrements its usage counter to zero which is not a correct behaviour. edit:imk This description is of course not correct. The umount performed as a result of an expire is a umount of a mount that has been automounted, it's not the autofs mount itself. They happen independently, usually after everything mounted within the autofs file system has been expired away. If everything hasn't been expired away the automount daemon can still exit leaving mounts in place. But expires done in both cases will result in a notification that calls autofs_wait_release() with a result status. The problem case is the summary execution of of the automount daemon. In this case any waiting processes won't be woken up until either they are terminated or the mount is umounted. end edit: imk So in catatonic mode we should free waitqueues which counter becomes zero. edit: imk Initially I was concerned that the calling of autofs_wait_release() and autofs_catatonic_mode() was not mutually exclusive but that can't be the case (obviously) because the queue entry (or entries) is removed from the list when either of these two functions are called. Consequently the wait entry will be freed by only one of these functions or by the woken process in autofs_wait() depending on the order of the calls. end edit: imk Reported-by: syzbot+5e53f70e69ff0c0a1c0c@syzkaller.appspotmail.com Suggested-by: Takeshi Misawa <jeliantsurux@gmail.com> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Ian Kent <raven@themaw.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: autofs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <169112719161.7590.6700123246297365841.stgit@donald.themaw.net> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-21Revert "configfs: fix a race in configfs_lookup()"Greg Kroah-Hartman1-2/+0
This reverts commit c709c7ca020a52ef6530733a9a5f014bd6b71937 which is commit c42dd069be8dfc9b2239a5c89e73bbd08ab35de0 upstream. It was backported incorrectly, and will come back later in a way that actually works. Cc: Sishuai Gong <sishuai@purdue.edu> Cc: Christoph Hellwig <hch@lst.de> Cc: Kyle Zeng <zengyhkyle@gmail.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/27f978f3-11ca-6054-5160-ab66d4c918bc@roeck-us.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-19btrfs: use the correct superblock to compare fsid in btrfs_validate_superAnand Jain1-3/+2
commit d167aa76dc0683828588c25767da07fb549e4f48 upstream. The function btrfs_validate_super() should verify the fsid in the provided superblock argument. Because, all its callers expect it to do that. Such as in the following stack: write_all_supers() sb = fs_info->super_for_commit; btrfs_validate_write_super(.., sb) btrfs_validate_super(.., sb, ..) scrub_one_super() btrfs_validate_super(.., sb, ..) And check_dev_super() btrfs_validate_super(.., sb, ..) However, it currently verifies the fs_info::super_copy::fsid instead, which is not correct. Fix this using the correct fsid in the superblock argument. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-19btrfs: don't start transaction when joining with TRANS_JOIN_NOSTARTFilipe Manana1-3/+4
commit 4490e803e1fe9fab8db5025e44e23b55df54078b upstream. When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a running transaction we end up creating one. This goes against the purpose of TRANS_JOIN_NOSTART which is to join a running transaction if its state is at or below the state TRANS_STATE_COMMIT_START, otherwise return an -ENOENT error and don't start a new transaction. So fix this to not create a new transaction if there's no running transaction at or below that state. CC: stable@vger.kernel.org # 4.14+ Fixes: a6d155d2e363 ("Btrfs: fix deadlock between fiemap and transaction commits") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-19fuse: nlookup missing decrement in fuse_direntplus_linkruanmeisi1-1/+9
commit b8bd342d50cbf606666488488f9fea374aceb2d5 upstream. During our debugging of glusterfs, we found an Assertion failed error: inode_lookup >= nlookup, which was caused by the nlookup value in the kernel being greater than that in the FUSE file system. The issue was introduced by fuse_direntplus_link, where in the function, fuse_iget increments nlookup, and if d_splice_alias returns failure, fuse_direntplus_link returns failure without decrementing nlookup https://github.com/gluster/glusterfs/pull/4081 Signed-off-by: ruanmeisi <ruan.meisi@zte.com.cn> Fixes: 0b05b18381ee ("fuse: implement NFS-like readdirplus support") Cc: <stable@vger.kernel.org> # v3.9 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-19ext4: add correct group descriptors and reserved GDT blocks to system zoneWang Jianjian3-8/+17
commit 68228da51c9a436872a4ef4b5a7692e29f7e5bc7 upstream. When setup_system_zone, flex_bg is not initialized so it is always 1. Use a new helper function, ext4_num_base_meta_blocks() which does not depend on sbi->s_log_groups_per_flex being initialized. [ Squashed two patches in the Link URL's below together into a single commit, which is simpler to review/understand. Also fix checkpatch warnings. --TYT ] Cc: stable@kernel.org Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com> Link: https://lore.kernel.org/r/tencent_21AF0D446A9916ED5C51492CC6C9A0A77B05@qq.com Link: https://lore.kernel.org/r/tencent_D744D1450CC169AEA77FCF0A64719909ED05@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-19NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_infoFedor Pchelkin1-1/+1
commit 96562c45af5c31b89a197af28f79bfa838fb8391 upstream. It is an almost improbable error case but when page allocating loop in nfs4_get_device_info() fails then we should only free the already allocated pages, as __free_page() can't deal with NULL arguments. Found by Linux Verification Center (linuxtesting.org). Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>