summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-10-10btrfs: file_remove_privs needs an exclusive lock in direct io writeBernd Schubert1-2/+14
commit 9af86694fd5d387992699ec99007ed374966ce9a upstream. This was noticed by Miklos that file_remove_privs might call into notify_change(), which requires to hold an exclusive lock. The problem exists in FUSE and btrfs. We can fix it without any additional helpers from VFS, in case the privileges would need to be dropped, change the lock type to be exclusive and redo the loop. Fixes: e9adabb9712e ("btrfs: use shared lock for direct writes within EOF") CC: Miklos Szeredi <miklos@szeredi.hu> CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10btrfs: fix fscrypt name leak after failure to join log transactionFilipe Manana1-1/+4
commit fee4c19937439693f2420a916169d08e88576e8e upstream. When logging a new name, we don't expect to fail joining a log transaction since we know at least one of the inodes was logged before in the current transaction. However if we fail for some unexpected reason, we end up not freeing the fscrypt name we previously allocated. So fix that by freeing the name in case we failed to join a log transaction. Fixes: ab3c5c18e8fa ("btrfs: setup qstr from dentrys using fscrypt helper") Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10btrfs: fix an error handling path in btrfs_rename()Christophe JAILLET1-2/+4
commit abe3bf7425fb695a9b37394af18b9ea58a800802 upstream. If new_whiteout_inode() fails, some resources need to be freed. Add the missing goto to the error handling path. Fixes: ab3c5c18e8fa ("btrfs: setup qstr from dentrys using fscrypt helper") Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10ksmbd: fix uaf in smb20_oplock_break_ackluosili1-2/+2
commit c69813471a1ec081a0b9bf0c6bd7e8afd818afce upstream. drop reference after use opinfo. Signed-off-by: luosili <rootlab@huawei.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10ksmbd: fix race condition between session lookup and expireNamjae Jeon3-3/+10
commit 53ff5cf89142b978b1a5ca8dc4d4425e6a09745f upstream. Thread A + Thread B ksmbd_session_lookup | smb2_sess_setup sess = xa_load | | | xa_erase(&conn->sessions, sess->id); | | ksmbd_session_destroy(sess) --> kfree(sess) | // UAF! | sess->last_active = jiffies | + This patch add rwsem to fix race condition between ksmbd_session_lookup and ksmbd_expire_session. Reported-by: luosili <rootlab@huawei.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10smb: use kernel_connect() and kernel_bind()Jordan Rife1-5/+5
commit cedc019b9f260facfadd20c6c490e403abf292e3 upstream. Recent changes to kernel_connect() and kernel_bind() ensure that callers are insulated from changes to the address parameter made by BPF SOCK_ADDR hooks. This patch wraps direct calls to ops->connect() and ops->bind() with kernel_connect() and kernel_bind() to ensure that SMB mounts do not see their mount address overwritten in such cases. Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/ Cc: <stable@vger.kernel.org> # 6.0+ Signed-off-by: Jordan Rife <jrife@google.com> Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-10NFSv4: Fix a nfs4_state_manager() raceTrond Myklebust1-0/+7
[ Upstream commit ed1cc05aa1f7fe8197d300e914afc28ab9818f89 ] If the NFS4CLNT_RUN_MANAGER flag got set just before we cleared NFS4CLNT_MANAGER_RUNNING, then we might have won the race against nfs4_schedule_state_manager(), and are responsible for handling the recovery situation. Fixes: aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10erofs: fix memory leak of LZMA global compressed deduplicationGao Xiang1-1/+4
[ Upstream commit 75a5221630fe5aa3fedba7a06be618db0f79ba1e ] When stressing microLZMA EROFS images with the new global compressed deduplication feature enabled (`-Ededupe`), I found some short-lived temporary pages weren't properly released, which could slowly cause unexpected OOMs hours later. Let's fix it now (LZ4 and DEFLATE don't have this issue.) Fixes: 5c2a64252c5d ("erofs: introduce partial-referenced pclusters") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230907050542.97152-1-hsiangkao@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"Benjamin Coddington1-4/+2
[ Upstream commit 5b4a82a0724af1dfd1320826e0266117b6a57fbd ] Olga Kornievskaia reports that this patch breaks NFSv4.0 state recovery. It also introduces additional complexity in the error paths for cases not related to the original problem. Let's revert it for now, and address the original problem in another manner. This reverts commit f5ea16137a3fa2858620dc9084466491c128535f. Fixes: f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during delegation return") Reported-by: Kornievskaia, Olga <Olga.Kornievskaia@netapp.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10btrfs: use struct fscrypt_str instead of struct qstrSweet Tea Dorminy12-118/+95
[ Upstream commit 6db75318823a169e836a478ca57d6a7c0a156b77 ] While struct qstr is more natural without fscrypt, since it's provided by dentries, struct fscrypt_str is provided by the fscrypt handlers processing dentries, and is thus more natural in the fscrypt world. Replace all of the struct qstr uses with struct fscrypt_str. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10btrfs: setup qstr from dentrys using fscrypt helperSweet Tea Dorminy4-57/+189
[ Upstream commit ab3c5c18e8fa3f8ea116016095d25adab466cd39 ] Most places where we get a struct qstr, we are doing so from a dentry. With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt provides a helper to convert a dentry name to the appropriate disk name if necessary. Convert each of the dentry name accesses to use fscrypt_setup_filename(), then convert the resulting fscrypt_name back to an unencrypted qstr. This does not work for nokey names, but the specific locations that could spawn nokey names are noted. At present, since there are no encrypted directories, nothing goes down the filename encryption paths. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10btrfs: use struct qstr instead of name and namelen pairsSweet Tea Dorminy12-335/+287
[ Upstream commit e43eec81c5167b655b72c781b0e75e62a05e415e ] Many functions throughout btrfs take name buffer and name length arguments. Most of these functions at the highest level are usually called with these arguments extracted from a supplied dentry's name. But the entire name can be passed instead, making each function a little more elegant. Each function whose arguments are currently the name and length extracted from a dentry is herein converted to instead take a pointer to the name in the dentry. The couple of calls to these calls without a struct dentry are converted to create an appropriate qstr to pass in. Additionally, every function which is only called with a name/len extracted directly from a qstr is also converted. This change has positive effect on stack consumption, frame of many functions is reduced but this will be used in the future for fscrypt related structures. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10NFSv4: Fix a state manager thread deadlock regressionTrond Myklebust2-12/+28
[ Upstream commit 956fd46f97d238032cb5fa4771cdaccc6e760f9a ] Commit 4dc73c679114 reintroduces the deadlock that was fixed by commit aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") because it prevents the setup of new threads to handle reboot recovery, while the older recovery thread is stuck returning delegations. Fixes: 4dc73c679114 ("NFSv4: keep state manager thread active if swap is enabled") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10NFS: rename nfs_client_kset to nfs_ksetBenjamin Coddington1-8/+8
[ Upstream commit 8b18a2edecc0741b0eecf8b18fdb356a0f8682de ] Be brief and match the subsystem name. There's no need to distinguish this kset variable from the server. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Stable-dep-of: 956fd46f97d2 ("NFSv4: Fix a state manager thread deadlock regression") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10NFS: Cleanup unused rpc_clnt variableBenjamin Coddington1-4/+0
[ Upstream commit e025f0a73f6acb920d86549b2177a5883535421d ] The root rpc_clnt is not used here, clean it up. Fixes: 4dc73c679114 ("NFSv4: keep state manager thread active if swap is enabled") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Stable-dep-of: 956fd46f97d2 ("NFSv4: Fix a state manager thread deadlock regression") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06fs: binfmt_elf_efpic: fix personality for ELF-FDPICGreg Ungerer1-3/+2
commit 7c3151585730b7095287be8162b846d31e6eee61 upstream. The elf-fdpic loader hard sets the process personality to either PER_LINUX_FDPIC for true elf-fdpic binaries or to PER_LINUX for normal ELF binaries (in this case they would be constant displacement compiled with -pie for example). The problem with that is that it will lose any other bits that may be in the ELF header personality (such as the "bug emulation" bits). On the ARM architecture the ADDR_LIMIT_32BIT flag is used to signify a normal 32bit binary - as opposed to a legacy 26bit address binary. This matters since start_thread() will set the ARM CPSR register as required based on this flag. If the elf-fdpic loader loses this bit the process will be mis-configured and crash out pretty quickly. Modify elf-fdpic loader personality setting so that it preserves the upper three bytes by using the SET_PERSONALITY macro to set it. This macro in the generic case sets PER_LINUX and preserves the upper bytes. Architectures can override this for their specific use case, and ARM does exactly this. The problem shows up quite easily running under qemu using the ARM architecture, but not necessarily on all types of real ARM hardware. If the underlying ARM processor does not support the legacy 26-bit addressing mode then everything will work as expected. Link: https://lkml.kernel.org/r/20230907011808.2985083-1-gerg@kernel.org Fixes: 1bde925d23547 ("fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries") Signed-off-by: Greg Ungerer <gerg@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06fs/smb/client: Reset password pointer to NULLQuang Le1-0/+1
commit e6e43b8aa7cd3c3af686caf0c2e11819a886d705 upstream. Forget to reset ctx->password to NULL will lead to bug like double free Cc: stable@vger.kernel.org Cc: Willy Tarreau <w@1wt.eu> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Quang Le <quanglex97@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06btrfs: properly report 0 avail for very full file systemsJosef Bacik1-1/+1
commit 58bfe2ccec5f9f137b41dd38f335290dcc13cd5c upstream. A user reported some issues with smaller file systems that get very full. While investigating this issue I noticed that df wasn't showing 100% full, despite having 0 chunk space and having < 1MiB of available metadata space. This turns out to be an overflow issue, we're doing: total_available_metadata_space - SZ_4M < global_block_rsv_size to determine if there's not enough space to make metadata allocations, which overflows if total_available_metadata_space is < 4M. Fix this by checking to see if our available space is greater than the 4M threshold. This makes df properly report 100% usage on the file system. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()Pan Bian1-3/+3
commit 7ee29facd8a9c5a26079148e36bcf07141b3a6bc upstream. In nilfs_gccache_submit_read_data(), brelse(bh) is called to drop the reference count of bh when the call to nilfs_dat_translate() fails. If the reference count hits 0 and its owner page gets unlocked, bh may be freed. However, bh->b_page is dereferenced to put the page after that, which may result in a use-after-free bug. This patch moves the release operation after unlocking and putting the page. NOTE: The function in question is only called in GC, and in combination with current userland tools, address translation using DAT does not occur in that function, so the code path that causes this issue will not be executed. However, it is possible to run that code path by intentionally modifying the userland GC library or by calling the GC ioctl directly. [konishi.ryusuke@gmail.com: NOTE added to the commit log] Link: https://lkml.kernel.org/r/1543201709-53191-1-git-send-email-bianpan2016@163.com Link: https://lkml.kernel.org/r/20230921141731.10073-1-konishi.ryusuke@gmail.com Fixes: a3d93f709e89 ("nilfs2: block cache for garbage collection") Signed-off-by: Pan Bian <bianpan2016@163.com> Reported-by: Ferry Meng <mengferry@linux.alibaba.com> Closes: https://lkml.kernel.org/r/20230818092022.111054-1-mengferry@linux.alibaba.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06NFSv4.1: fix zero value filehandle in post open getattrOlga Kornievskaia1-1/+5
[ Upstream commit 4506f23e117161a20104c8fa04f33e1ca63c26af ] Currently, if the OPEN compound experiencing an error and needs to get the file attributes separately, it will send a stand alone GETATTR but it would use the filehandle from the results of the OPEN compound. In case of the CLAIM_FH OPEN, nfs_openres's fh is zero value. That generate a GETATTR that's sent with a zero value filehandle, and results in the server returning an error. Instead, for the CLAIM_FH OPEN, take the filehandle that was used in the PUTFH of the OPEN compound. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06smb3: correct places where ENOTSUPP is used instead of preferred EOPNOTSUPPSteve French2-4/+4
[ Upstream commit ebc3d4e44a7e05457825e03d0560153687265523 ] checkpatch flagged a few places with: WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP Also fixed minor typo Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06btrfs: assert delayed node locked when removing delayed itemFilipe Manana1-4/+8
[ Upstream commit a57c2d4e46f519b24558ae0752c17eec416ac72a ] When removing a delayed item, or releasing which will remove it as well, we will modify one of the delayed node's rbtrees and item counter if the delayed item is in one of the rbtrees. This require having the delayed node's mutex locked, otherwise we will race with other tasks modifying the rbtrees and the counter. This is motivated by a previous version of another patch actually calling btrfs_release_delayed_item() after unlocking the delayed node's mutex and against a delayed item that is in a rbtree. So assert at __btrfs_remove_delayed_item() that the delayed node's mutex is locked. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06ceph: drop messages from MDS when unmountingXiubo Li7-22/+109
[ Upstream commit e3dfcab2080dc1f9a4b09cc1327361bc2845bfcd ] When unmounting all the dirty buffers will be flushed and after the last osd request is finished the last reference of the i_count will be released. Then it will flush the dirty cap/snap to MDSs, and the unmounting won't wait the possible acks, which will ihold the inodes when updating the metadata locally but makes no sense any more, of this. This will make the evict_inodes() to skip these inodes. If encrypt is enabled the kernel generate a warning when removing the encrypt keys when the skipped inodes still hold the keyring: WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> generic_shutdown_super+0x47/0x120 kill_anon_super+0x14/0x30 ceph_kill_sb+0x36/0x90 [ceph] deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x67/0xb0 exit_to_user_mode_prepare+0x23d/0x240 syscall_exit_to_user_mode+0x25/0x60 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd83dc39e9b Later the kernel will crash when iput() the inodes and dereferencing the "sb->s_master_keys", which has been released by the generic_shutdown_super(). Link: https://tracker.ceph.com/issues/59162 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06btrfs: reset destination buffer when read_extent_buffer() gets invalid rangeQu Wenruo1-1/+7
[ Upstream commit 74ee79142c0a344d4eae2eb7012ebc4e82254109 ] Commit f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") changed how we handle invalid extent buffer range for read_extent_buffer(). Previously if the range is invalid we just set the destination to zero, but after the patch we do nothing and error out. This can lead to smatch static checker errors like: fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'. fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'. fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'. fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'. fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'. Fix those warnings by reverting back to the old memset() behavior. By this we keep the static checker happy and would still make a lot of noise when such invalid ranges are passed in. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06f2fs: get out of a repeat loop when getting a locked data pageJaegeuk Kim1-6/+2
[ Upstream commit d2d9bb3b6d2fbccb5b33d3a85a2830971625a4ea ] https://bugzilla.kernel.org/show_bug.cgi?id=216050 Somehow we're getting a page which has a different mapping. Let's avoid the infinite loop. Cc: <stable@vger.kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06f2fs: optimize iteration over sparse directoriesChao Yu4-19/+41
[ Upstream commit 59237a21776f70ffb0420611c23e7158e1317037 ] Wei Chen reports a kernel bug as blew: INFO: task syz-executor.0:29056 blocked for more than 143 seconds. Not tainted 5.15.0-rc5 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor.0 state:D stack:14632 pid:29056 ppid: 6574 flags:0x00000004 Call Trace: __schedule+0x4a1/0x1720 schedule+0x36/0xe0 rwsem_down_write_slowpath+0x322/0x7a0 fscrypt_ioctl_set_policy+0x11f/0x2a0 __f2fs_ioctl+0x1a9f/0x5780 f2fs_ioctl+0x89/0x3a0 __x64_sys_ioctl+0xe8/0x140 do_syscall_64+0x34/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Eric did some investigation on this issue, quoted from reply of Eric: "Well, the quality of this bug report has a lot to be desired (not on upstream kernel, reproducer is full of totally irrelevant stuff, not sent to the mailing list of the filesystem whose disk image is being fuzzed, etc.). But what is going on is that f2fs_empty_dir() doesn't consider the case of a directory with an extremely large i_size on a malicious disk image. Specifically, the reproducer mounts an f2fs image with a directory that has an i_size of 14814520042850357248, then calls FS_IOC_SET_ENCRYPTION_POLICY on it. That results in a call to f2fs_empty_dir() to check whether the directory is empty. f2fs_empty_dir() then iterates through all 3616826182336513 blocks the directory allegedly contains to check whether any contain anything. i_rwsem is held during this, so anything else that tries to take it will hang." In order to solve this issue, let's use f2fs_get_next_page_offset() to speed up iteration by skipping holes for all below functions: - f2fs_empty_dir - f2fs_readdir - find_in_level The way why we can speed up iteration was described in 'commit 3cf4574705b4 ("f2fs: introduce get_next_page_offset to speed up SEEK_DATA")'. Meanwhile, in f2fs_empty_dir(), let's use f2fs_find_data_page() instead f2fs_get_lock_data_page(), due to i_rwsem was held in caller of f2fs_empty_dir(), there shouldn't be any races, so it's fine to not lock dentry page during lookuping dirents in the page. Link: https://lore.kernel.org/lkml/536944df-a0ae-1dd8-148f-510b476e1347@kernel.org/T/ Reported-by: Wei Chen <harperchen1110@gmail.com> Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Stable-dep-of: d2d9bb3b6d2f ("f2fs: get out of a repeat loop when getting a locked data page") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06cifs: Fix UAF in cifs_demultiplex_thread()Zhang Xiaoxu2-11/+24
[ Upstream commit d527f51331cace562393a8038d870b3e9916686f ] There is a UAF when xfstests on cifs: BUG: KASAN: use-after-free in smb2_is_network_name_deleted+0x27/0x160 Read of size 4 at addr ffff88810103fc08 by task cifsd/923 CPU: 1 PID: 923 Comm: cifsd Not tainted 6.1.0-rc4+ #45 ... Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report+0x171/0x472 kasan_report+0xad/0x130 kasan_check_range+0x145/0x1a0 smb2_is_network_name_deleted+0x27/0x160 cifs_demultiplex_thread.cold+0x172/0x5a4 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 923: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_slab_alloc+0x54/0x60 kmem_cache_alloc+0x147/0x320 mempool_alloc+0xe1/0x260 cifs_small_buf_get+0x24/0x60 allocate_buffers+0xa1/0x1c0 cifs_demultiplex_thread+0x199/0x10d0 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 Freed by task 921: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 ____kasan_slab_free+0x143/0x1b0 kmem_cache_free+0xe3/0x4d0 cifs_small_buf_release+0x29/0x90 SMB2_negotiate+0x8b7/0x1c60 smb2_negotiate+0x51/0x70 cifs_negotiate_protocol+0xf0/0x160 cifs_get_smb_ses+0x5fa/0x13c0 mount_get_conns+0x7a/0x750 cifs_mount+0x103/0xd00 cifs_smb3_do_mount+0x1dd/0xcb0 smb3_get_tree+0x1d5/0x300 vfs_get_tree+0x41/0xf0 path_mount+0x9b3/0xdd0 __x64_sys_mount+0x190/0x1d0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The UAF is because: mount(pid: 921) | cifsd(pid: 923) -------------------------------|------------------------------- | cifs_demultiplex_thread SMB2_negotiate | cifs_send_recv | compound_send_recv | smb_send_rqst | wait_for_response | wait_event_state [1] | | standard_receive3 | cifs_handle_standard | handle_mid | mid->resp_buf = buf; [2] | dequeue_mid [3] KILL the process [4] | resp_iov[i].iov_base = buf | free_rsp_buf [5] | | is_network_name_deleted [6] | callback 1. After send request to server, wait the response until mid->mid_state != SUBMITTED; 2. Receive response from server, and set it to mid; 3. Set the mid state to RECEIVED; 4. Kill the process, the mid state already RECEIVED, get 0; 5. Handle and release the negotiate response; 6. UAF. It can be easily reproduce with add some delay in [3] - [6]. Only sync call has the problem since async call's callback is executed in cifsd process. Add an extra state to mark the mid state to READY before wakeup the waitter, then it can get the resp safely. Fixes: ec637e3ffb6b ("[CIFS] Avoid extra large buffer allocation (and memcpy) in cifs_readpages") Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06proc: nommu: fix empty /proc/<pid>/mapsBen Wolsieffer2-17/+22
[ Upstream commit fe4419801617514765974f3e796269bc512ad146 ] On no-MMU, /proc/<pid>/maps reads as an empty file. This happens because find_vma(mm, 0) always returns NULL (assuming no vma actually contains the zero address, which is normally the case). To fix this bug and improve the maintainability in the future, this patch makes the no-MMU implementation as similar as possible to the MMU implementation. The only remaining differences are the lack of hold/release_task_mempolicy and the extra code to shoehorn the gate vma into the iterator. This has been tested on top of 6.5.3 on an STM32F746. Link: https://lkml.kernel.org/r/20230915160055.971059-2-ben.wolsieffer@hefring.com Fixes: 0c563f148043 ("proc: remove VMA rbtree use from nommu") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Giulio Benetti <giulio.benetti@benettiengineering.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06proc: nommu: /proc/<pid>/maps: release mmap read lockBen Wolsieffer1-12/+15
[ Upstream commit 578d7699e5c2add8c2e9549d9d75dfb56c460cb3 ] The no-MMU implementation of /proc/<pid>/map doesn't normally release the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine whether to release the lock. Since _vml is NULL when the end of the mappings is reached, the lock is not released. Reading /proc/1/maps twice doesn't cause a hang because it only takes the read lock, which can be taken multiple times and therefore doesn't show any problem if the lock isn't released. Instead, you need to perform some operation that attempts to take the write lock after reading /proc/<pid>/maps. To actually reproduce the bug, compile the following code as 'proc_maps_bug': #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main(int argc, char *argv[]) { void *buf; sleep(1); buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); puts("mmap returned"); return 0; } Then, run: ./proc_maps_bug &; cat /proc/$!/maps; fg Without this patch, mmap() will hang and the command will never complete. This code was incorrectly adapted from the MMU implementation, which at the time released the lock in m_next() before returning the last entry. The MMU implementation has diverged further from the no-MMU version since then, so this patch brings their locking and error handling into sync, fixing the bug and hopefully avoiding similar issues in the future. Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.com Fixes: 47fecca15c09 ("fs/proc/task_nommu.c: don't use priv->task->mm") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Giulio Benetti <giulio.benetti@benettiengineering.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: fe4419801617 ("proc: nommu: fix empty /proc/<pid>/maps") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06ext4: do not let fstrim block system suspendJan Kara1-2/+10
[ Upstream commit 5229a658f6453362fbb9da6bf96872ef25a7097e ] Len Brown has reported that system suspend sometimes fail due to inability to freeze a task working in ext4_trim_fs() for one minute. Trimming a large filesystem on a disk that slowly processes discard requests can indeed take a long time. Since discard is just an advisory call, it is perfectly fine to interrupt it at any time and the return number of discarded blocks until that moment. Do that when we detect the task is being frozen. Cc: stable@kernel.org Reported-by: Len Brown <lenb@kernel.org> Suggested-by: Dave Chinner <david@fromorbit.com> References: https://bugzilla.kernel.org/show_bug.cgi?id=216322 Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06ext4: move setting of trimmed bit into ext4_try_to_trim_range()Jan Kara1-21/+25
[ Upstream commit 45e4ab320c9b5fa67b1fc3b6a9b381cfcc0c8488 ] Currently we set the group's trimmed bit in ext4_trim_all_free() based on return value of ext4_try_to_trim_range(). However when we will want to abort trimming because of suspend attempt, we want to return success from ext4_try_to_trim_range() but not set the trimmed bit. Instead implementing awkward propagation of this information, just move setting of trimmed bit into ext4_try_to_trim_range() when the whole group is trimmed. Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06ext4: replace the traditional ternary conditional operator with with max()/min()Kemeng Shi1-4/+2
[ Upstream commit de8bf0e5ee7482585450357c6d4eddec8efc5cb7 ] Replace the traditional ternary conditional operator with with max()/min() Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06btrfs: remove BUG() after failure to insert delayed dir index itemFilipe Manana1-27/+47
[ Upstream commit 2c58c3931ede7cd08cbecf1f1a4acaf0a04a41a9 ] Instead of calling BUG() when we fail to insert a delayed dir index item into the delayed node's tree, we can just release all the resources we have allocated/acquired before and return the error to the caller. This is fine because all existing call chains undo anything they have done before calling btrfs_insert_delayed_dir_index() or BUG_ON (when creating pending snapshots in the transaction commit path). So remove the BUG() call and do proper error handling. This relates to a syzbot report linked below, but does not fix it because it only prevents hitting a BUG(), it does not fix the issue where somehow we attempt to use twice the same index number for different index items. Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06btrfs: improve error message after failure to add delayed dir index itemFilipe Manana1-3/+4
[ Upstream commit 91bfe3104b8db0310f76f2dcb6aacef24c889366 ] If we fail to add a delayed dir index item because there's already another item with the same index number, we print an error message (and then BUG). However that message isn't very helpful to debug anything because we don't know what's the index number and what are the values of index counters in the inode and its delayed inode (index_cnt fields of struct btrfs_inode and struct btrfs_delayed_node). So update the error message to include the index number and counters. We actually had a recent case where this issue was hit by a syzbot report (see the link below). Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 2c58c3931ede ("btrfs: remove BUG() after failure to insert delayed dir index item") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06netfs: Only call folio_start_fscache() one time for each folioDave Wysochanski1-1/+5
[ Upstream commit df1c357f25d808e30b216188330e708e09e1a412 ] If a network filesystem using netfs implements a clamp_length() function, it can set subrequest lengths smaller than a page size. When we loop through the folios in netfs_rreq_unlock_folios() to set any folios to be written back, we need to make sure we only call folio_start_fscache() once for each folio. Otherwise, this simple testcase: mount -o fsc,rsize=1024,wsize=1024 127.0.0.1:/export /mnt/nfs dd if=/dev/zero of=/mnt/nfs/file.bin bs=4096 count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.0126359 s, 324 kB/s echo 3 > /proc/sys/vm/drop_caches cat /mnt/nfs/file.bin > /dev/null will trigger an oops similar to the following: page dumped because: VM_BUG_ON_FOLIO(folio_test_private_2(folio)) ------------[ cut here ]------------ kernel BUG at include/linux/netfs.h:44! ... CPU: 5 PID: 134 Comm: kworker/u16:5 Kdump: loaded Not tainted 6.4.0-rc5 ... RIP: 0010:netfs_rreq_unlock_folios+0x68e/0x730 [netfs] ... Call Trace: netfs_rreq_assess+0x497/0x660 [netfs] netfs_subreq_terminated+0x32b/0x610 [netfs] nfs_netfs_read_completion+0x14e/0x1a0 [nfs] nfs_read_completion+0x2f9/0x330 [nfs] rpc_free_task+0x72/0xa0 [sunrpc] rpc_async_release+0x46/0x70 [sunrpc] process_one_work+0x3bd/0x710 worker_thread+0x89/0x610 kthread+0x181/0x1c0 ret_from_fork+0x29/0x50 Fixes: 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers" Link: https://bugzilla.redhat.com/show_bug.cgi?id=2210612 Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230608214137.856006-1-dwysocha@redhat.com/ # v1 Link: https://lore.kernel.org/r/20230915185704.1082982-1-dwysocha@redhat.com/ # v2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06NFSv4.1: fix pnfs MDS=DS session trunkingOlga Kornievskaia1-1/+5
[ Upstream commit 806a3bc421a115fbb287c1efce63a48c54ee804b ] Currently, when GETDEVICEINFO returns multiple locations where each is a different IP but the server's identity is same as MDS, then nfs4_set_ds_client() finds the existing nfs_client structure which has the MDS's max_connect value (and if it's 1), then the 1st IP on the DS's list will get dropped due to MDS trunking rules. Other IPs would be added as they fall under the pnfs trunking rules. For the list of IPs the 1st goes thru calling nfs4_set_ds_client() which will eventually call nfs4_add_trunk() and call into rpc_clnt_test_and_add_xprt() which has the check for MDS trunking. The other IPs (after the 1st one), would call rpc_clnt_add_xprt() which doesn't go thru that check. nfs4_add_trunk() is called when MDS trunking is happening and it needs to enforce the usage of max_connect mount option of the 1st mount. However, this shouldn't be applied to pnfs flow. Instead, this patch proposed to treat MDS=DS as DS trunking and make sure that MDS's max_connect limit does not apply to the 1st IP returned in the GETDEVICEINFO list. It does so by marking the newly created client with a new flag NFS_CS_PNFS which then used to pass max_connect value to use into the rpc_clnt_test_and_add_xprt() instead of the existing rpc client's max_connect value set by the MDS connection. For example, mount was done without max_connect value set so MDS's rpc client has cl_max_connect=1. Upon calling into rpc_clnt_test_and_add_xprt() and using rpc client's value, the caller passes in max_connect value which is previously been set in the pnfs path (as a part of handling GETDEVICEINFO list of IPs) in nfs4_set_ds_client(). However, when NFS_CS_PNFS flag is not set and we know we are doing MDS trunking, comparing a new IP of the same server, we then set the max_connect value to the existing MDS's value and pass that into rpc_clnt_test_and_add_xprt(). Fixes: dc48e0abee24 ("SUNRPC enforce creation of no more than max_connect xprts") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS serverOlga Kornievskaia2-0/+7
[ Upstream commit 51d674a5e4889f1c8e223ac131cf218e1631e423 ] After receiving the location(s) of the DS server(s) in the GETDEVINCEINFO, create the request for the clientid to such server and indicate that the client is connecting to a DS. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Stable-dep-of: 806a3bc421a1 ("NFSv4.1: fix pnfs MDS=DS session trunking") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06NFS/pNFS: Report EINVAL errors from connect() to the serverTrond Myklebust1-0/+1
[ Upstream commit dd7d7ee3ba2a70d12d02defb478790cf57d5b87b ] With IPv6, connect() can occasionally return EINVAL if a route is unavailable. If this happens during I/O to a data server, we want to report it using LAYOUTERROR as an inability to connect. Fixes: dd52128afdde ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06NFS: More fixes for nfs_direct_write_reschedule_io()Trond Myklebust1-6/+11
[ Upstream commit b11243f720ee5f9376861099019c8542969b6318 ] Ensure that all requests are put back onto the commit list so that they can be rescheduled. Fixes: 4daaeba93822 ("NFS: Fix nfs_direct_write_reschedule_io()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06NFS: Use the correct commit info in nfs_join_page_group()Trond Myklebust2-14/+17
[ Upstream commit b193a78ddb5ee7dba074d3f28dc050069ba083c0 ] Ensure that nfs_clear_request_commit() updates the correct counters when it removes them from the commit list. Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06NFS: More O_DIRECT accounting fixes for error pathsTrond Myklebust1-16/+31
[ Upstream commit 8982f7aff39fb526aba4441fff2525fcedd5e1a3 ] If we hit a fatal error when retransmitting, we do need to record the removal of the request from the count of written bytes. Fixes: 031d73ed768a ("NFS: Fix O_DIRECT accounting of number of bytes read/written") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06NFS: Fix O_DIRECT locking issuesTrond Myklebust1-4/+4
[ Upstream commit 7c6339322ce0c6128acbe36aacc1eeb986dd7bf1 ] The dreq fields are protected by the dreq->lock. Fixes: 954998b60caa ("NFS: Fix error handling for O_DIRECT write scheduling") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06NFS: Fix error handling for O_DIRECT write schedulingTrond Myklebust1-16/+46
[ Upstream commit 954998b60caa8f2a3bf3abe490de6f08d283687a ] If we fail to schedule a request for transmission, there are 2 possibilities: 1) Either we hit a fatal error, and we just want to drop the remaining requests on the floor. 2) We were asked to try again, in which case we should allow the outstanding RPC calls to complete, so that we can recoalesce requests and try again. Fixes: d600ad1f2bdb ("NFS41: pop some layoutget errors to application") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23ext4: fix rec_len verify errorShida Zhang1-11/+15
commit 7fda67e8c3ab6069f75888f67958a6d30454a9f6 upstream. With the configuration PAGE_SIZE 64k and filesystem blocksize 64k, a problem occurred when more than 13 million files were directly created under a directory: EXT4-fs error (device xx): ext4_dx_csum_set:492: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D. EXT4-fs error (device xx): ext4_dx_csum_verify:463: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D. EXT4-fs error (device xx): dx_probe:856: inode #xxxx: block 8188: comm xxxxx: Directory index failed checksum When enough files are created, the fake_dirent->reclen will be 0xffff. it doesn't equal to the blocksize 65536, i.e. 0x10000. But it is not the same condition when blocksize equals to 4k. when enough files are created, the fake_dirent->reclen will be 0x1000. it equals to the blocksize 4k, i.e. 0x1000. The problem seems to be related to the limitation of the 16-bit field when the blocksize is set to 64k. To address this, helpers like ext4_rec_len_{from,to}_disk has already been introduced to complete the conversion between the encoded and the plain form of rec_len. So fix this one by using the helper, and all the other in this file too. Cc: stable@kernel.org Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") Suggested-by: Andreas Dilger <adilger@dilger.ca> Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20230803060938.1929759-1-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23tracefs: Add missing lockdown check to tracefs_create_dir()Steven Rostedt (Google)1-0/+3
commit 51aab5ffceb43e05119eb059048fd75765d2bc21 upstream. The function tracefs_create_dir() was missing a lockdown check and was called by the RV code. This gave an inconsistent behavior of this function returning success while other tracefs functions failed. This caused the inode being freed by the wrong kmem_cache. Link: https://lkml.kernel.org/r/20230905182711.692687042@goodmis.org Link: https://lore.kernel.org/all/202309050916.58201dc6-oliver.sang@intel.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Cc: Ching-lin Yu <chinglinyu@google.com> Fixes: bf8e602186ec4 ("tracing: Do not create tracefs files if tracefs lockdown is in effect") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23nfsd: fix change_info in NFSv4 RENAME repliesJeff Layton1-2/+2
commit fdd2630a7398191e84822612e589062063bd4f3d upstream. nfsd sends the transposed directory change info in the RENAME reply. The source directory is in save_fh and the target is in current_fh. Reported-by: Zhi Li <yieli@redhat.com> Reported-by: Benjamin Coddington <bcodding@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2218844 Signed-off-by: Jeff Layton <jlayton@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23btrfs: check for BTRFS_FS_ERROR in pending ordered assertJosef Bacik1-1/+1
commit 4ca8e03cf2bfaeef7c85939fa1ea0c749cd116ab upstream. If we do fast tree logging we increment a counter on the current transaction for every ordered extent we need to wait for. This means we expect the transaction to still be there when we clear pending on the ordered extent. However if we happen to abort the transaction and clean it up, there could be no running transaction, and thus we'll trip the "ASSERT(trans)" check. This is obviously incorrect, and the code properly deals with the case that the transaction doesn't exist. Fix this ASSERT() to only fire if there's no trans and we don't have BTRFS_FS_ERROR() set on the file system. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23btrfs: release path before inode lookup during the ino lookup ioctlFilipe Manana1-1/+7
commit ee34a82e890a7babb5585daf1a6dd7d4d1cf142a upstream. During the ino lookup ioctl we can end up calling btrfs_iget() to get an inode reference while we are holding on a root's btree. If btrfs_iget() needs to lookup the inode from the root's btree, because it's not currently loaded in memory, then it will need to lock another or the same path in the same root btree. This may result in a deadlock and trigger the following lockdep splat: WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Not tainted ------------------------------------------------------ syz-executor277/5012 is trying to acquire lock: ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 but task is already holding lock: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302 btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955 btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline] btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338 btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline] open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494 btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154 btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519 legacy_get_tree+0xef/0x190 fs/fs_context.c:611 vfs_get_tree+0x8c/0x270 fs/super.c:1519 fc_mount fs/namespace.c:1112 [inline] vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142 btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579 legacy_get_tree+0xef/0x190 fs/fs_context.c:611 vfs_get_tree+0x8c/0x270 fs/super.c:1519 do_new_mount+0x28f/0xae0 fs/namespace.c:3335 do_mount fs/namespace.c:3675 [inline] __do_sys_mount fs/namespace.c:3884 [inline] __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-tree-01){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline] btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281 btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154 btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412 btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline] btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716 btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline] btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105 btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(btrfs-tree-00); lock(btrfs-tree-01); lock(btrfs-tree-00); rlock(btrfs-tree-01); *** DEADLOCK *** 1 lock held by syz-executor277/5012: #0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 stack backtrace: CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline] btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281 btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154 btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412 btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline] btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716 btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline] btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105 btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f0bec94ea39 Fix this simply by releasing the path before calling btrfs_iget() as at point we don't need the path anymore. Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/ Fixes: 23d0b79dfaed ("btrfs: Add unprivileged version of ino_lookup ioctl") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folioQu Wenruo1-6/+8
commit 5e0e879926c1ce7e1f5e0dfaacaf2d105f7d8a05 upstream. [BUG] After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap"), the DEBUG section of btree_dirty_folio() would no longer compile. [CAUSE] If DEBUG is defined, we would do extra checks for btree_dirty_folio(), mostly to make sure the range we marked dirty has an extent buffer and that extent buffer is dirty. For subpage, we need to iterate through all the extent buffers covered by that page range, and make sure they all matches the criteria. However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap") changes how we store the bitmap, we pack all the 16 bits bitmaps into a larger bitmap, which would save some space. This means we no longer have btrfs_subpage::dirty_bitmap, instead the dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a length of btrfs_subpage_info::bitmap_nr_bits. [FIX] Although I'm not sure if it still makes sense to maintain such code, at least let it compile. This patch would let us test the bits one by one through the bitmaps. CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23btrfs: fix lockdep splat and potential deadlock after failure running ↵Filipe Manana1-3/+16
delayed items commit e110f8911ddb93e6f55da14ccbbe705397b30d0b upstream. When running delayed items we are holding a delayed node's mutex and then we will attempt to modify a subvolume btree to insert/update/delete the delayed items. However if have an error during the insertions for example, btrfs_insert_delayed_items() may return with a path that has locked extent buffers (a leaf at the very least), and then we attempt to release the delayed node at __btrfs_run_delayed_items(), which requires taking the delayed node's mutex, causing an ABBA type of deadlock. This was reported by syzbot and the lockdep splat is the following: WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Not tainted ------------------------------------------------------ syz-executor.2/13257 is trying to acquire lock: ffff88801835c0c0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 but task is already holding lock: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: __lock_release kernel/locking/lockdep.c:5475 [inline] lock_release+0x36f/0x9d0 kernel/locking/lockdep.c:5781 up_write+0x79/0x580 kernel/locking/rwsem.c:1625 btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline] btrfs_unlock_up_safe+0x179/0x3b0 fs/btrfs/locking.c:239 search_leaf fs/btrfs/ctree.c:1986 [inline] btrfs_search_slot+0x2511/0x2f80 fs/btrfs/ctree.c:2230 btrfs_insert_empty_items+0x9c/0x180 fs/btrfs/ctree.c:4376 btrfs_insert_delayed_item fs/btrfs/delayed-inode.c:746 [inline] btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline] __btrfs_commit_inode_delayed_items+0xd24/0x2410 fs/btrfs/delayed-inode.c:1111 __btrfs_run_delayed_items+0x1db/0x430 fs/btrfs/delayed-inode.c:1153 flush_space+0x269/0xe70 fs/btrfs/space-info.c:723 btrfs_async_reclaim_metadata_space+0x106/0x350 fs/btrfs/space-info.c:1078 process_one_work+0x92c/0x12c0 kernel/workqueue.c:2600 worker_thread+0xa63/0x1210 kernel/workqueue.c:2751 kthread+0x2b8/0x350 kernel/kthread.c:389 ret_from_fork+0x2e/0x60 arch/x86/kernel/process.c:145 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799 __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156 btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276 btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988 vfs_fsync_range fs/sync.c:188 [inline] vfs_fsync fs/sync.c:202 [inline] do_fsync fs/sync.c:212 [inline] __do_sys_fsync fs/sync.c:220 [inline] __se_sys_fsync fs/sync.c:218 [inline] __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(&delayed_node->mutex); lock(btrfs-tree-00); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by syz-executor.2/13257: #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: spin_unlock include/linux/spinlock.h:391 [inline] #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0xb87/0xe00 fs/btrfs/transaction.c:287 #1: ffff88802c1ee398 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0xbb2/0xe00 fs/btrfs/transaction.c:288 #2: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198 stack backtrace: CPU: 0 PID: 13257 Comm: syz-executor.2 Not tainted 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799 __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156 btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276 btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988 vfs_fsync_range fs/sync.c:188 [inline] vfs_fsync fs/sync.c:202 [inline] do_fsync fs/sync.c:212 [inline] __do_sys_fsync fs/sync.c:220 [inline] __se_sys_fsync fs/sync.c:218 [inline] __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f3ad047cae9 Code: 28 00 00 00 75 (...) RSP: 002b:00007f3ad12510c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 00007f3ad059bf80 RCX: 00007f3ad047cae9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007f3ad04c847a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007f3ad059bf80 R15: 00007ffe56af92f8 </TASK> ------------[ cut here ]------------ Fix this by releasing the path before releasing the delayed node in the error path at __btrfs_run_delayed_items(). Reported-by: syzbot+a379155f07c134ea9879@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000abba27060403b5bd@google.com/ CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>