summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-06-14zonefs: fix handling of explicit_open option on mountDamien Le Moal1-5/+6
commit a2a513be7139b279f1b5b2cee59c6c4950c34346 upstream. Ignoring the explicit_open mount option on mount for devices that do not have a limit on the number of open zones must be done after the mount options are parsed and set in s_mount_opts. Move the check to ignore the explicit_open option after the call to zonefs_parse_options() in zonefs_fill_super(). Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14writeback: Fix inode->i_io_list not be protected by inode->i_lock errorJchao Sun2-10/+29
commit 10e14073107dd0b6d97d9516a02845a8e501c2c9 upstream. Commit b35250c0816c ("writeback: Protect inode->i_io_list with inode->i_lock") made inode->i_io_list not only protected by wb->list_lock but also inode->i_lock, but inode_io_list_move_locked() was missed. Add lock there and also update comment describing things protected by inode->i_lock. This also fixes a race where __mark_inode_dirty() could move inode under flush worker's hands and thus sync(2) could miss writing some inodes. Fixes: b35250c0816c ("writeback: Protect inode->i_io_list with inode->i_lock") Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com CC: stable@vger.kernel.org Signed-off-by: Jchao Sun <sunjunchao2870@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14cifs: populate empty hostnames for extra channelsShyam Prasad N2-1/+8
commit 4c14d7043fede258957d7b01da0cad2d9fe3a205 upstream. Currently, the secondary channels of a multichannel session also get hostname populated based on the info in primary channel. However, this will end up with a wrong resolution of hostname to IP address during reconnect. This change fixes this by not populating hostname info for all secondary channels. Fixes: 5112d80c162f ("cifs: populate server_hostname for extra channels") Cc: stable@vger.kernel.org Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14cifs: fix reconnect on smb3 mount typesPaulo Alcantara3-13/+18
commit c36ee7dab7749f7be21f7a72392744490b2a9a2b upstream. cifs.ko defines two file system types: cifs & smb3, and __cifs_get_super() was not including smb3 file system type when looking up superblocks, therefore failing to reconnect tcons in cifs_tree_connect(). Fix this by calling iterate_supers_type() on both file system types. Link: https://lore.kernel.org/r/CAFrh3J9soC36+BVuwHB=g9z_KB5Og2+p2_W+BBoBOZveErz14w@mail.gmail.com Cc: stable@vger.kernel.org Tested-by: Satadru Pramanik <satadru@gmail.com> Reported-by: Satadru Pramanik <satadru@gmail.com> Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14cifs: return errors during session setup during reconnectsShyam Prasad N1-0/+3
commit 8ea21823aa584b55ba4b861307093b78054b0c1b upstream. During reconnects, we check the return value from cifs_negotiate_protocol, and have handlers for both success and failures. But if that passes, and cifs_setup_session returns any errors other than -EACCES, we do not handle that. This fix adds a handler for that, so that we don't go ahead and try a tree_connect on a failed session. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14cifs: version operations for smb20 unneeded when legacy support disabledSteve French2-2/+9
[ Upstream commit 7ef93ffccd55fb0ba000ed16ef6a81cd7dee07b5 ] We should not be including unused smb20 specific code when legacy support is disabled (CONFIG_CIFS_ALLOW_INSECURE_LEGACY turned off). For example smb2_operations and smb2_values aren't used in that case. Over time we can move more and more SMB1/CIFS and SMB2.0 code into the insecure legacy ifdefs Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14ceph: flush the mdlog for filesystem syncXiubo Li1-6/+27
[ Upstream commit 1b2ba3c5616e17ff951359e25c658a1c3f146f1e ] Before waiting for a request's safe reply, we will send the mdlog flush request to the relevant MDS. And this will also flush the mdlog for all the other unsafe requests in the same session, so we can record the last session and no need to flush mdlog again in the next loop. But there still have cases that it may send the mdlog flush requst twice or more, but that should be not often. Rename wait_unsafe_requests() to flush_mdlog_and_wait_mdsc_unsafe_requests() to make it more descriptive. [xiubli: fold in MDS request refcount leak fix from Jeff] URL: https://tracker.ceph.com/issues/55284 URL: https://tracker.ceph.com/issues/55411 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14ceph: allow ceph.dir.rctime xattr to be updatableVenky Shankar1-1/+9
[ Upstream commit d7a2dc523085f8b8c60548ceedc696934aefeb0e ] `rctime' has been a pain point in cephfs due to its buggy nature - inconsistent values reported and those sorts. Fixing rctime is non-trivial needing an overall redesign of the entire nested statistics infrastructure. As a workaround, PR http://github.com/ceph/ceph/pull/37938 allows this extended attribute to be manually set. This allows users to "fixup" inconsistent rctime values. While this sounds messy, its probably the wisest approach allowing users/scripts to workaround buggy rctime values. The above PR enables Ceph MDS to allow manually setting rctime extended attribute with the corresponding user-land changes. We may as well allow the same to be done via kclient for parity. Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14ksmbd: smbd: fix connection dropped issueHyunchul Lee1-0/+1
[ Upstream commit 5366afc4065075a4456941fbd51c33604d631ee5 ] When there are bursty connection requests, RDMA connection event handler is deferred and Negotiation requests are received even if connection status is NEW. To handle it, set the status to CONNECTED if Negotiation requests are received. Reported-by: Yufan Chen <wiz.chen@gmail.com> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Tested-by: Yufan Chen <wiz.chen@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14kernfs: Separate kernfs_pr_cont_buf and rename_lock.Hao Luo1-12/+19
[ Upstream commit 1a702dc88e150487c9c173a249b3d236498b9183 ] Previously the protection of kernfs_pr_cont_buf was piggy backed by rename_lock, which means that pr_cont() needs to be protected under rename_lock. This can cause potential circular lock dependencies. If there is an OOM, we have the following call hierarchy: -> cpuset_print_current_mems_allowed() -> pr_cont_cgroup_name() -> pr_cont_kernfs_name() pr_cont_kernfs_name() will grab rename_lock and call printk. So we have the following lock dependencies: kernfs_rename_lock -> console_sem Sometimes, printk does a wakeup before releasing console_sem, which has the dependence chain: console_sem -> p->pi_lock -> rq->lock Now, imagine one wants to read cgroup_name under rq->lock, for example, printing cgroup_name in a tracepoint in the scheduler code. They will be holding rq->lock and take rename_lock: rq->lock -> kernfs_rename_lock Now they will deadlock. A prevention to this circular lock dependency is to separate the protection of pr_cont_buf from rename_lock. In principle, rename_lock is to protect the integrity of cgroup name when copying to buf. Once pr_cont_buf has got its content, rename_lock can be dropped. So it's safe to drop rename_lock after kernfs_name_locked (and kernfs_path_from_node_locked) and rely on a dedicated pr_cont_lock to protect pr_cont_buf. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220516190951.3144144-1-haoluo@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14f2fs: fix to tag gcing flag on page during file defragmentChao Yu1-0/+1
[ Upstream commit 2d1fe8a86bf5e0663866fd0da83c2af1e1b0e362 ] In order to garantee migrated data be persisted during checkpoint, otherwise out-of-order persistency between data and node may cause data corruption after SPOR. Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14NFSv4: Don't hold the layoutget locks across multiple RPC callsTrond Myklebust1-0/+4
[ Upstream commit 6949493884fe88500de4af182588e071cf1544ee ] When doing layoutget as part of the open() compound, we have to be careful to release the layout locks before we can call any further RPC calls, such as setattr(). The reason is that those calls could trigger a recall, which could deadlock. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14f2fs: remove WARN_ON in f2fs_is_valid_blkaddrDongliang Mu1-2/+2
[ Upstream commit dc2f78e2d4cc844a1458653d57ce1b54d4a29f21 ] Syzbot triggers two WARNs in f2fs_is_valid_blkaddr and __is_bitmap_valid. For example, in f2fs_is_valid_blkaddr, if type is DATA_GENERIC_ENHANCE or DATA_GENERIC_ENHANCE_READ, it invokes WARN_ON if blkaddr is not in the right range. The call trace is as follows: f2fs_get_node_info+0x45f/0x1070 read_node_page+0x577/0x1190 __get_node_page.part.0+0x9e/0x10e0 __get_node_page f2fs_get_node_page+0x109/0x180 do_read_inode f2fs_iget+0x2a5/0x58b0 f2fs_fill_super+0x3b39/0x7ca0 Fix these two WARNs by replacing WARN_ON with dump_stack. Reported-by: syzbot+763ae12a2ede1d99d4dc@syzkaller.appspotmail.com Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14afs: Fix infinite loop found by xfstest generic/676David Howells1-1/+4
[ Upstream commit 17eabd42560f4636648ad65ba5b20228071e2363 ] In AFS, a directory is handled as a file that the client downloads and parses locally for the purposes of performing lookup and getdents operations. The in-kernel afs filesystem has a number of functions that do this. A directory file is arranged as a series of 2K blocks divided into 32-byte slots, where a directory entry occupies one or more slots, plus each block starts with one or more metadata blocks. When parsing a block, if the last slots are occupied by a dirent that occupies more than a single slot and the file position points at a slot that's not the initial one, the logic in afs_dir_iterate_block() that skips over it won't advance the file pointer to the end of it. This will cause an infinite loop in getdents() as it will keep retrying that block and failing to advance beyond the final entry. Fix this by advancing the file pointer if the next entry will be beyond it when we skip a block. This was found by the generic/676 xfstest but can also be triggered with something like: ~/xfstests-dev/src/t_readdir_3 /xfstest.test/z 4000 1 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: http://lore.kernel.org/r/165391973497.110268.2939296942213894166.stgit@warthog.procyon.org.uk/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14jffs2: fix memory leak in jffs2_do_fill_superBaokun Li1-0/+1
[ Upstream commit c14adb1cf70a984ed081c67e9d27bc3caad9537c ] If jffs2_iget() or d_make_root() in jffs2_do_fill_super() returns an error, we can observe the following kmemleak report: -------------------------------------------- unreferenced object 0xffff888105a65340 (size 64): comm "mount", pid 710, jiffies 4302851558 (age 58.239s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff859c45e5>] kmem_cache_alloc_trace+0x475/0x8a0 [<ffffffff86160146>] jffs2_sum_init+0x96/0x1a0 [<ffffffff86140e25>] jffs2_do_mount_fs+0x745/0x2120 [<ffffffff86149fec>] jffs2_do_fill_super+0x35c/0x810 [<ffffffff8614aae9>] jffs2_fill_super+0x2b9/0x3b0 [...] unreferenced object 0xffff8881bd7f0000 (size 65536): comm "mount", pid 710, jiffies 4302851558 (age 58.239s) hex dump (first 32 bytes): bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ backtrace: [<ffffffff858579ba>] kmalloc_order+0xda/0x110 [<ffffffff85857a11>] kmalloc_order_trace+0x21/0x130 [<ffffffff859c2ed1>] __kmalloc+0x711/0x8a0 [<ffffffff86160189>] jffs2_sum_init+0xd9/0x1a0 [<ffffffff86140e25>] jffs2_do_mount_fs+0x745/0x2120 [<ffffffff86149fec>] jffs2_do_fill_super+0x35c/0x810 [<ffffffff8614aae9>] jffs2_fill_super+0x2b9/0x3b0 [...] -------------------------------------------- This is because the resources allocated in jffs2_sum_init() are not released. Call jffs2_sum_exit() to release these resources to solve the problem. Fixes: e631ddba5887 ("[JFFS2] Add erase block summary support (mount time improvement)") Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14ksmbd: fix reference count leak in smb_check_perm_dacl()Xin Xiong1-0/+1
[ Upstream commit d21a580dafc69aa04f46e6099616146a536b0724 ] The issue happens in a specific path in smb_check_perm_dacl(). When "id" and "uid" have the same value, the function simply jumps out of the loop without decrementing the reference count of the object "posix_acls", which is increased by get_acl() earlier. This may result in memory leaks. Fix it by decreasing the reference count of "posix_acls" before jumping to label "check_access_bits". Fixes: 777cad1604d6 ("ksmbd: remove select FS_POSIX_ACL in Kconfig") Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-09fs/ntfs3: Fix invalid free in log_replayNamjae Jeon1-3/+3
commit f26967b9f7a830e228bb13fb41bd516ddd9d789d upstream. log_read_rst() returns ENOMEM error when there is not enough memory. In this case, if info is returned without initialization, it attempts to kfree the uninitialized info->r_page pointer. This patch moves the memset initialization code to before log_read_rst() is called. Reported-by: Gerald Lee <sundaywind2004@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09exportfs: support idmapped mountsChristian Brauner1-2/+3
commit 3a761d72fa62eec8913e45d29375344f61706541 upstream. Make the two locations where exportfs helpers check permission to lookup a given inode idmapped mount aware by switching it to the lookup_one() helper. This is a bugfix for the open_by_handle_at() system call which doesn't take idmapped mounts into account currently. It's not tied to a specific commit so we'll just Cc stable. In addition this is required to support idmapped base layers in overlay. The overlay filesystem uses exportfs to encode and decode file handles for its index=on mount option and when nfs_export=on. Cc: <stable@vger.kernel.org> Cc: <linux-fsdevel@vger.kernel.org> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09fs: add two trivial lookup helpersChristian Brauner1-10/+60
commit 00675017e0aeba5305665c52ded4ddce6a4c0231 upstream. Similar to the addition of lookup_one() add a version of lookup_one_unlocked() and lookup_one_positive_unlocked() that take idmapped mounts into account. This is required to port overlay to support idmapped base layers. Cc: <linux-fsdevel@vger.kernel.org> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: only allow test_dummy_encryption when supportedEric Biggers2-28/+38
commit 5f41fdaea63ddf96d921ab36b2af4a90ccdb5744 upstream. Make the test_dummy_encryption mount option require that the encrypt feature flag be already enabled on the filesystem, rather than automatically enabling it. Practically, this means that "-O encrypt" will need to be included in MKFS_OPTIONS when running xfstests with the test_dummy_encryption mount option. (ext4/053 also needs an update.) Moreover, as long as the preconditions for test_dummy_encryption are being tightened anyway, take the opportunity to start rejecting it when !CONFIG_FS_ENCRYPTION rather than ignoring it. The motivation for requiring the encrypt feature flag is that: - Having the filesystem auto-enable feature flags is problematic, as it bypasses the usual sanity checks. The specific issue which came up recently is that in kernel versions where ext4 supports casefold but not encrypt+casefold (v5.1 through v5.10), the kernel will happily add the encrypt flag to a filesystem that has the casefold flag, making it unmountable -- but only for subsequent mounts, not the initial one. This confused the casefold support detection in xfstests, causing generic/556 to fail rather than be skipped. - The xfstests-bld test runners (kvm-xfstests et al.) already use the required mkfs flag, so they will not be affected by this change. Only users of test_dummy_encryption alone will be affected. But, this option has always been for testing only, so it should be fine to require that the few users of this option update their test scripts. - f2fs already requires it (for its equivalent feature flag). Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09SMB3: EBADF/EIO errors in rename/open caused by race condition in ↵Steve French1-2/+0
smb2_compound_op commit 0a55cf74ffb5d004b93647e4389096880ce37d6b upstream. There is a race condition in smb2_compound_op: after_close: num_rqst++; if (cfile) { cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server cfile = NULL; This is triggered by smb2_query_path_info operation that happens during revalidate_dentry. In smb2_query_path_info, get_readable_path is called to load the cfile, increasing the reference counter. If in the meantime, this reference becomes the very last, this call to cifsFileInfo_put(cfile) will trigger a SMB2_CLOSE request sent to the server just before sending this compound request – and so then the compound request fails either with EBADF/EIO depending on the timing at the server, because the handle is already closed. In the first scenario, the race seems to be happening between smb2_query_path_info triggered by the rename operation, and between “cleanup” of asynchronous writes – while fsync(fd) likely waits for the asynchronous writes to complete, releasing the writeback structures can happen after the close(fd) call. So the EBADF/EIO errors will pop up if the timing is such that: 1) There are still outstanding references after close(fd) in the writeback structures 2) smb2_query_path_info successfully fetches the cfile, increasing the refcounter by 1 3) All writeback structures release the same cfile, reducing refcounter to 1 4) smb2_compound_op is called with that cfile In the second scenario, the race seems to be similar – here open triggers the smb2_query_path_info operation, and if all other threads in the meantime decrease the refcounter to 1 similarly to the first scenario, again SMB2_CLOSE will be sent to the server just before issuing the compound request. This case is harder to reproduce. See https://bugzilla.samba.org/show_bug.cgi?id=15051 Cc: stable@vger.kernel.org Fixes: 8de9e86c67ba ("cifs: create a helper to find a writeable handle by path name") Signed-off-by: Ondrej Hubsch <ohubsch@purestorage.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ceph: fix decoding of client session messages flagsLuís Henriques1-5/+9
commit ea16567f11018e2f58e72b667b0c803ff92b8153 upstream. The cephfs kernel client started to show the message: ceph: mds0 session blocklisted when mounting a filesystem. This is due to the fact that the session messages are being incorrectly decoded: the skip needs to take into account the 'len'. While there, fixed some whitespaces too. Cc: stable@vger.kernel.org Fixes: e1c9788cb397 ("ceph: don't rely on error_string to validate blocklisted session.") Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ksmbd: fix outstanding credits related bugsHyunchul Lee3-3/+5
commit 376b9133826865568167b4091ef92a68c4622b87 upstream. outstanding credits must be initialized to 0, because it means the sum of credits consumed by in-flight requests. And outstanding credits must be compared with total credits in smb2_validate_credit_charge(), because total credits are the sum of credits granted by ksmbd. This patch fix the following error, while frametest with Windows clients: Limits exceeding the maximum allowable outstanding requests, given : 128, pending : 8065 Fixes: b589f5db6d4a ("ksmbd: limits exceeding the maximum allowable outstanding requests") Cc: stable@vger.kernel.org Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Reported-by: Yufan Chen <wiz.chen@gmail.com> Tested-by: Yufan Chen <wiz.chen@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ocfs2: dlmfs: fix error handling of user_dlm_destroy_lockJunxiao Bi via Ocfs2-devel1-1/+15
commit 863e0d81b6683c4cbc588ad831f560c90e494bef upstream. When user_dlm_destroy_lock failed, it didn't clean up the flags it set before exit. For USER_LOCK_IN_TEARDOWN, if this function fails because of lock is still in used, next time when unlink invokes this function, it will return succeed, and then unlink will remove inode and dentry if lock is not in used(file closed), but the dlm lock is still linked in dlm lock resource, then when bast come in, it will trigger a panic due to user-after-free. See the following panic call trace. To fix this, USER_LOCK_IN_TEARDOWN should be reverted if fail. And also error should be returned if USER_LOCK_IN_TEARDOWN is set to let user know that unlink fail. For the case of ocfs2_dlm_unlock failure, besides USER_LOCK_IN_TEARDOWN, USER_LOCK_BUSY is also required to be cleared. Even though spin lock is released in between, but USER_LOCK_IN_TEARDOWN is still set, for USER_LOCK_BUSY, if before every place that waits on this flag, USER_LOCK_IN_TEARDOWN is checked to bail out, that will make sure no flow waits on the busy flag set by user_dlm_destroy_lock(), then we can simplely revert USER_LOCK_BUSY when ocfs2_dlm_unlock fails. Fix user_dlm_cluster_lock() which is the only function not following this. [ 941.336392] (python,26174,16):dlmfs_unlink:562 ERROR: unlink 004fb0000060000b5a90b8c847b72e1, error -16 from destroy [ 989.757536] ------------[ cut here ]------------ [ 989.757709] kernel BUG at fs/ocfs2/dlmfs/userdlm.c:173! [ 989.757876] invalid opcode: 0000 [#1] SMP [ 989.758027] Modules linked in: ksplice_2zhuk2jr_ib_ipoib_new(O) ksplice_2zhuk2jr(O) mptctl mptbase xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn cdc_ether usbnet mii ocfs2 jbd2 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfsv3 nfs_acl nfs fscache lockd grace ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bnx2fc fcoe libfcoe libfc scsi_transport_fc sunrpc ipmi_devintf bridge stp llc rds_rdma rds bonding ib_sdp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm falcon_lsm_serviceable(PE) falcon_nf_netcontain(PE) mlx4_vnic falcon_kal(E) falcon_lsm_pinned_13402(E) mlx4_ib ib_sa ib_mad ib_core ib_addr xenfs xen_privcmd dm_multipath iTCO_wdt iTCO_vendor_support pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core ipmi_ssif i2c_core ipmi_si ipmi_msghandler [ 989.760686] ioatdma sg ext3 jbd mbcache sd_mod ahci libahci ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel megaraid_sas mlx4_core crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ksplice_2zhuk2jr_ib_ipoib_old] [ 989.761987] CPU: 10 PID: 19102 Comm: dlm_thread Tainted: P OE 4.1.12-124.57.1.el6uek.x86_64 #2 [ 989.762290] Hardware name: Oracle Corporation ORACLE SERVER X5-2/ASM,MOTHERBOARD,1U, BIOS 30350100 06/17/2021 [ 989.762599] task: ffff880178af6200 ti: ffff88017f7c8000 task.ti: ffff88017f7c8000 [ 989.762848] RIP: e030:[<ffffffffc07d4316>] [<ffffffffc07d4316>] __user_dlm_queue_lockres.part.4+0x76/0x80 [ocfs2_dlmfs] [ 989.763185] RSP: e02b:ffff88017f7cbcb8 EFLAGS: 00010246 [ 989.763353] RAX: 0000000000000000 RBX: ffff880174d48008 RCX: 0000000000000003 [ 989.763565] RDX: 0000000000120012 RSI: 0000000000000003 RDI: ffff880174d48170 [ 989.763778] RBP: ffff88017f7cbcc8 R08: ffff88021f4293b0 R09: 0000000000000000 [ 989.763991] R10: ffff880179c8c000 R11: 0000000000000003 R12: ffff880174d48008 [ 989.764204] R13: 0000000000000003 R14: ffff880179c8c000 R15: ffff88021db7a000 [ 989.764422] FS: 0000000000000000(0000) GS:ffff880247480000(0000) knlGS:ffff880247480000 [ 989.764685] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 989.764865] CR2: ffff8000007f6800 CR3: 0000000001ae0000 CR4: 0000000000042660 [ 989.765081] Stack: [ 989.765167] 0000000000000003 ffff880174d48040 ffff88017f7cbd18 ffffffffc07d455f [ 989.765442] ffff88017f7cbd88 ffffffff816fb639 ffff88017f7cbd38 ffff8800361b5600 [ 989.765717] ffff88021db7a000 ffff88021f429380 0000000000000003 ffffffffc0453020 [ 989.765991] Call Trace: [ 989.766093] [<ffffffffc07d455f>] user_bast+0x5f/0xf0 [ocfs2_dlmfs] [ 989.766287] [<ffffffff816fb639>] ? schedule_timeout+0x169/0x2d0 [ 989.766475] [<ffffffffc0453020>] ? o2dlm_lock_ast_wrapper+0x20/0x20 [ocfs2_stack_o2cb] [ 989.766738] [<ffffffffc045303a>] o2dlm_blocking_ast_wrapper+0x1a/0x20 [ocfs2_stack_o2cb] [ 989.767010] [<ffffffffc0864ec6>] dlm_do_local_bast+0x46/0xe0 [ocfs2_dlm] [ 989.767217] [<ffffffffc084f5cc>] ? dlm_lockres_calc_usage+0x4c/0x60 [ocfs2_dlm] [ 989.767466] [<ffffffffc08501f1>] dlm_thread+0xa31/0x1140 [ocfs2_dlm] [ 989.767662] [<ffffffff816f78da>] ? __schedule+0x24a/0x810 [ 989.767834] [<ffffffff816f78ce>] ? __schedule+0x23e/0x810 [ 989.768006] [<ffffffff816f78da>] ? __schedule+0x24a/0x810 [ 989.768178] [<ffffffff816f78ce>] ? __schedule+0x23e/0x810 [ 989.768349] [<ffffffff816f78da>] ? __schedule+0x24a/0x810 [ 989.768521] [<ffffffff816f78ce>] ? __schedule+0x23e/0x810 [ 989.768693] [<ffffffff816f78da>] ? __schedule+0x24a/0x810 [ 989.768893] [<ffffffff816f78ce>] ? __schedule+0x23e/0x810 [ 989.769067] [<ffffffff816f78da>] ? __schedule+0x24a/0x810 [ 989.769241] [<ffffffff810ce4d0>] ? wait_woken+0x90/0x90 [ 989.769411] [<ffffffffc084f7c0>] ? dlm_kick_thread+0x80/0x80 [ocfs2_dlm] [ 989.769617] [<ffffffff810a8bbb>] kthread+0xcb/0xf0 [ 989.769774] [<ffffffff816f78da>] ? __schedule+0x24a/0x810 [ 989.769945] [<ffffffff816f78da>] ? __schedule+0x24a/0x810 [ 989.770117] [<ffffffff810a8af0>] ? kthread_create_on_node+0x180/0x180 [ 989.770321] [<ffffffff816fdaa1>] ret_from_fork+0x61/0x90 [ 989.770492] [<ffffffff810a8af0>] ? kthread_create_on_node+0x180/0x180 [ 989.770689] Code: d0 00 00 00 f0 45 7d c0 bf 00 20 00 00 48 89 83 c0 00 00 00 48 89 83 c8 00 00 00 e8 55 c1 8c c0 83 4b 04 10 48 83 c4 08 5b 5d c3 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 41 54 53 48 83 [ 989.771892] RIP [<ffffffffc07d4316>] __user_dlm_queue_lockres.part.4+0x76/0x80 [ocfs2_dlmfs] [ 989.772174] RSP <ffff88017f7cbcb8> [ 989.772704] ---[ end trace ebd1e38cebcc93a8 ]--- [ 989.772907] Kernel panic - not syncing: Fatal exception [ 989.773173] Kernel Offset: disabled Link: https://lkml.kernel.org/r/20220518235224.87100-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09dlm: fix missing lkb refcount handlingAlexander Aring1-2/+9
commit 1689c169134f4b5a39156122d799b7dca76d8ddb upstream. We always call hold_lkb(lkb) if we increment lkb->lkb_wait_count. So, we always need to call unhold_lkb(lkb) if we decrement lkb->lkb_wait_count. This patch will add missing unhold_lkb(lkb) if we decrement lkb->lkb_wait_count. In case of setting lkb->lkb_wait_count to zero we need to countdown until reaching zero and call unhold_lkb(lkb). The waiters list unhold_lkb(lkb) can be removed because it's done for the last lkb_wait_count decrement iteration as it's done in _remove_from_waiters(). This issue was discovered by a dlm gfs2 test case which use excessively dlm_unlock(LKF_CANCEL) feature. Probably the lkb->lkb_wait_count value never reached above 1 if this feature isn't used and so it was not discovered before. The testcase ended in a rsb on the rsb keep data structure with a refcount of 1 but no lkb was associated with it, which is itself an invalid behaviour. A side effect of that was a condition in which the dlm was sending remove messages in a looping behaviour. With this patch that has not been reproduced. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09dlm: fix wake_up() calls for pending removeAlexander Aring1-2/+2
commit f6f7418357457ed58cbb020fc97e74d4e0e7b47f upstream. This patch move the wake_up() call at the point when a remove message completed. Before it was only when a remove message was going to be sent. The possible waiter in wait_pending_remove() waits until a remove is done if the resource name matches with the per ls variable ls->ls_remove_name. If this is the case we must wait until a pending remove is done which is indicated if DLM_WAIT_PENDING_COND() returns false which will always be the case when ls_remove_len and ls_remove_name are unset to indicate that a remove is not going on anymore. Fixes: 21d9ac1a5376 ("fs: dlm: use event based wait for pending remove") Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09dlm: uninitialized variable on error in dlm_listen_for_all()Dan Carpenter1-1/+1
commit 1f4f10845e14690b02410de50d9ea9684625a4ae upstream. The "sock" variable is not initialized on this error path. Cc: stable@vger.kernel.org Fixes: 2dc6b1158c28 ("fs: dlm: introduce generic listen") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09dlm: fix plock invalid readAlexander Aring1-7/+5
commit 42252d0d2aa9b94d168241710a761588b3959019 upstream. This patch fixes an invalid read showed by KASAN. A unlock will allocate a "struct plock_op" and a followed send_op() will append it to a global send_list data structure. In some cases a followed dev_read() moves it to recv_list and dev_write() will cast it to "struct plock_xop" and access fields which are only available in those structures. At this point an invalid read happens by accessing those fields. To fix this issue the "callback" field is moved to "struct plock_op" to indicate that a cast to "plock_xop" is allowed and does the additional "plock_xop" handling if set. Example of the KASAN output which showed the invalid read: [ 2064.296453] ================================================================== [ 2064.304852] BUG: KASAN: slab-out-of-bounds in dev_write+0x52b/0x5a0 [dlm] [ 2064.306491] Read of size 8 at addr ffff88800ef227d8 by task dlm_controld/7484 [ 2064.308168] [ 2064.308575] CPU: 0 PID: 7484 Comm: dlm_controld Kdump: loaded Not tainted 5.14.0+ #9 [ 2064.310292] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 2064.311618] Call Trace: [ 2064.312218] dump_stack_lvl+0x56/0x7b [ 2064.313150] print_address_description.constprop.8+0x21/0x150 [ 2064.314578] ? dev_write+0x52b/0x5a0 [dlm] [ 2064.315610] ? dev_write+0x52b/0x5a0 [dlm] [ 2064.316595] kasan_report.cold.14+0x7f/0x11b [ 2064.317674] ? dev_write+0x52b/0x5a0 [dlm] [ 2064.318687] dev_write+0x52b/0x5a0 [dlm] [ 2064.319629] ? dev_read+0x4a0/0x4a0 [dlm] [ 2064.320713] ? bpf_lsm_kernfs_init_security+0x10/0x10 [ 2064.321926] vfs_write+0x17e/0x930 [ 2064.322769] ? __fget_light+0x1aa/0x220 [ 2064.323753] ksys_write+0xf1/0x1c0 [ 2064.324548] ? __ia32_sys_read+0xb0/0xb0 [ 2064.325464] do_syscall_64+0x3a/0x80 [ 2064.326387] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 2064.327606] RIP: 0033:0x7f807e4ba96f [ 2064.328470] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 87 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 7c 87 f8 ff 48 [ 2064.332902] RSP: 002b:00007ffd50cfe6e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 2064.334658] RAX: ffffffffffffffda RBX: 000055cc3886eb30 RCX: 00007f807e4ba96f [ 2064.336275] RDX: 0000000000000040 RSI: 00007ffd50cfe7e0 RDI: 0000000000000010 [ 2064.337980] RBP: 00007ffd50cfe7e0 R08: 0000000000000000 R09: 0000000000000001 [ 2064.339560] R10: 000055cc3886eb30 R11: 0000000000000293 R12: 000055cc3886eb80 [ 2064.341237] R13: 000055cc3886eb00 R14: 000055cc3886f590 R15: 0000000000000001 [ 2064.342857] [ 2064.343226] Allocated by task 12438: [ 2064.344057] kasan_save_stack+0x1c/0x40 [ 2064.345079] __kasan_kmalloc+0x84/0xa0 [ 2064.345933] kmem_cache_alloc_trace+0x13b/0x220 [ 2064.346953] dlm_posix_unlock+0xec/0x720 [dlm] [ 2064.348811] do_lock_file_wait.part.32+0xca/0x1d0 [ 2064.351070] fcntl_setlk+0x281/0xbc0 [ 2064.352879] do_fcntl+0x5e4/0xfe0 [ 2064.354657] __x64_sys_fcntl+0x11f/0x170 [ 2064.356550] do_syscall_64+0x3a/0x80 [ 2064.358259] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 2064.360745] [ 2064.361511] Last potentially related work creation: [ 2064.363957] kasan_save_stack+0x1c/0x40 [ 2064.365811] __kasan_record_aux_stack+0xaf/0xc0 [ 2064.368100] call_rcu+0x11b/0xf70 [ 2064.369785] dlm_process_incoming_buffer+0x47d/0xfd0 [dlm] [ 2064.372404] receive_from_sock+0x290/0x770 [dlm] [ 2064.374607] process_recv_sockets+0x32/0x40 [dlm] [ 2064.377290] process_one_work+0x9a8/0x16e0 [ 2064.379357] worker_thread+0x87/0xbf0 [ 2064.381188] kthread+0x3ac/0x490 [ 2064.383460] ret_from_fork+0x22/0x30 [ 2064.385588] [ 2064.386518] Second to last potentially related work creation: [ 2064.389219] kasan_save_stack+0x1c/0x40 [ 2064.391043] __kasan_record_aux_stack+0xaf/0xc0 [ 2064.393303] call_rcu+0x11b/0xf70 [ 2064.394885] dlm_process_incoming_buffer+0x47d/0xfd0 [dlm] [ 2064.397694] receive_from_sock+0x290/0x770 [dlm] [ 2064.399932] process_recv_sockets+0x32/0x40 [dlm] [ 2064.402180] process_one_work+0x9a8/0x16e0 [ 2064.404388] worker_thread+0x87/0xbf0 [ 2064.406124] kthread+0x3ac/0x490 [ 2064.408021] ret_from_fork+0x22/0x30 [ 2064.409834] [ 2064.410599] The buggy address belongs to the object at ffff88800ef22780 [ 2064.410599] which belongs to the cache kmalloc-96 of size 96 [ 2064.416495] The buggy address is located 88 bytes inside of [ 2064.416495] 96-byte region [ffff88800ef22780, ffff88800ef227e0) [ 2064.422045] The buggy address belongs to the page: [ 2064.424635] page:00000000b6bef8bc refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xef22 [ 2064.428970] flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff) [ 2064.432515] raw: 000fffffc0000200 ffffea0000d68b80 0000001400000014 ffff888001041780 [ 2064.436110] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 [ 2064.439813] page dumped because: kasan: bad access detected [ 2064.442548] [ 2064.443310] Memory state around the buggy address: [ 2064.445988] ffff88800ef22680: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc [ 2064.449444] ffff88800ef22700: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc [ 2064.452941] >ffff88800ef22780: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc [ 2064.456383] ^ [ 2064.459386] ffff88800ef22800: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc [ 2064.462788] ffff88800ef22880: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc [ 2064.466239] ================================================================== reproducer in python: import argparse import struct import fcntl import os parser = argparse.ArgumentParser() parser.add_argument('-f', '--file', help='file to use fcntl, must be on dlm lock filesystem e.g. gfs2') args = parser.parse_args() f = open(args.file, 'wb+') lockdata = struct.pack('hhllhh', fcntl.F_WRLCK,0,0,0,0,0) fcntl.fcntl(f, fcntl.F_SETLK, lockdata) lockdata = struct.pack('hhllhh', fcntl.F_UNLCK,0,0,0,0,0) fcntl.fcntl(f, fcntl.F_SETLK, lockdata) Fixes: 586759f03e2e ("gfs2: nfs lock support for gfs2") Cc: stable@vger.kernel.org Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: avoid cycles in directory h-treeJan Kara1-3/+19
commit 3ba733f879c2a88910744647e41edeefbc0d92b2 upstream. A maliciously corrupted filesystem can contain cycles in the h-tree stored inside a directory. That can easily lead to the kernel corrupting tree nodes that were already verified under its hands while doing a node split and consequently accessing unallocated memory. Fix the problem by verifying traversed block numbers are unique. Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220518093332.13986-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: verify dir block before splitting itJan Kara1-11/+21
commit 46c116b920ebec58031f0a78c5ea9599b0d2a371 upstream. Before splitting a directory block verify its directory entries are sane so that the splitting code does not access memory it should not. Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220518093332.13986-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: fix bug_on in __es_tree_searchBaokun Li1-5/+5
commit d36f6ed761b53933b0b4126486c10d3da7751e7f upstream. Hulk Robot reported a BUG_ON: ================================================================== kernel BUG at fs/ext4/extents_status.c:199! [...] RIP: 0010:ext4_es_end fs/ext4/extents_status.c:199 [inline] RIP: 0010:__es_tree_search+0x1e0/0x260 fs/ext4/extents_status.c:217 [...] Call Trace: ext4_es_cache_extent+0x109/0x340 fs/ext4/extents_status.c:766 ext4_cache_extents+0x239/0x2e0 fs/ext4/extents.c:561 ext4_find_extent+0x6b7/0xa20 fs/ext4/extents.c:964 ext4_ext_map_blocks+0x16b/0x4b70 fs/ext4/extents.c:4384 ext4_map_blocks+0xe26/0x19f0 fs/ext4/inode.c:567 ext4_getblk+0x320/0x4c0 fs/ext4/inode.c:980 ext4_bread+0x2d/0x170 fs/ext4/inode.c:1031 ext4_quota_read+0x248/0x320 fs/ext4/super.c:6257 v2_read_header+0x78/0x110 fs/quota/quota_v2.c:63 v2_check_quota_file+0x76/0x230 fs/quota/quota_v2.c:82 vfs_load_quota_inode+0x5d1/0x1530 fs/quota/dquot.c:2368 dquot_enable+0x28a/0x330 fs/quota/dquot.c:2490 ext4_quota_enable fs/ext4/super.c:6137 [inline] ext4_enable_quotas+0x5d7/0x960 fs/ext4/super.c:6163 ext4_fill_super+0xa7c9/0xdc00 fs/ext4/super.c:4754 mount_bdev+0x2e9/0x3b0 fs/super.c:1158 mount_fs+0x4b/0x1e4 fs/super.c:1261 [...] ================================================================== Above issue may happen as follows: ------------------------------------- ext4_fill_super ext4_enable_quotas ext4_quota_enable ext4_iget __ext4_iget ext4_ext_check_inode ext4_ext_check __ext4_ext_check ext4_valid_extent_entries Check for overlapping extents does't take effect dquot_enable vfs_load_quota_inode v2_check_quota_file v2_read_header ext4_quota_read ext4_bread ext4_getblk ext4_map_blocks ext4_ext_map_blocks ext4_find_extent ext4_cache_extents ext4_es_cache_extent ext4_es_cache_extent __es_tree_search ext4_es_end BUG_ON(es->es_lblk + es->es_len < es->es_lblk) The error ext4 extents is as follows: 0af3 0300 0400 0000 00000000 extent_header 00000000 0100 0000 12000000 extent1 00000000 0100 0000 18000000 extent2 02000000 0400 0000 14000000 extent3 In the ext4_valid_extent_entries function, if prev is 0, no error is returned even if lblock<=prev. This was intended to skip the check on the first extent, but in the error image above, prev=0+1-1=0 when checking the second extent, so even though lblock<=prev, the function does not return an error. As a result, bug_ON occurs in __es_tree_search and the system panics. To solve this problem, we only need to check that: 1. The lblock of the first extent is not less than 0. 2. The lblock of the next extent is not less than the next block of the previous extent. The same applies to extent_idx. Cc: stable@kernel.org Fixes: 5946d089379a ("ext4: check for overlapping extents in ext4_valid_extent_entries()") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220518120816.1541863-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_stateTheodore Ts'o1-2/+3
commit c878bea3c9d724ddfa05a813f30de3d25a0ba83f upstream. The EXT4_FC_REPLAY bit in sbi->s_mount_state is used to indicate that we are in the middle of replay the fast commit journal. This was actually a mistake, since the sbi->s_mount_info is initialized from es->s_state. Arguably s_mount_state is misleadingly named, but the name is historical --- s_mount_state and s_state dates back to ext2. What should have been used is the ext4_{set,clear,test}_mount_flag() inline functions, which sets EXT4_MF_* bits in sbi->s_mount_flags. The problem with using EXT4_FC_REPLAY is that a maliciously corrupted superblock could result in EXT4_FC_REPLAY getting set in s_mount_state. This bypasses some sanity checks, and this can trigger a BUG() in ext4_es_cache_extent(). As a easy-to-backport-fix, filter out the EXT4_FC_REPLAY bit for now. We should eventually transition away from EXT4_FC_REPLAY to something like EXT4_MF_REPLAY. Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20220420192312.1655305-1-phind.uet@gmail.com Link: https://lore.kernel.org/r/20220517174028.942119-1-tytso@mit.edu Reported-by: syzbot+c7358a3cd05ee786eb31@syzkaller.appspotmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: fix bug_on in ext4_writepagesYe Bin1-0/+12
commit ef09ed5d37b84d18562b30cf7253e57062d0db05 upstream. we got issue as follows: EXT4-fs error (device loop0): ext4_mb_generate_buddy:1141: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free cls ------------[ cut here ]------------ kernel BUG at fs/ext4/inode.c:2708! invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 2 PID: 2147 Comm: rep Not tainted 5.18.0-rc2-next-20220413+ #155 RIP: 0010:ext4_writepages+0x1977/0x1c10 RSP: 0018:ffff88811d3e7880 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88811c098000 RDX: 0000000000000000 RSI: ffff88811c098000 RDI: 0000000000000002 RBP: ffff888128140f50 R08: ffffffffb1ff6387 R09: 0000000000000000 R10: 0000000000000007 R11: ffffed10250281ea R12: 0000000000000001 R13: 00000000000000a4 R14: ffff88811d3e7bb8 R15: ffff888128141028 FS: 00007f443aed9740(0000) GS:ffff8883aef00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020007200 CR3: 000000011c2a4000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> do_writepages+0x130/0x3a0 filemap_fdatawrite_wbc+0x83/0xa0 filemap_flush+0xab/0xe0 ext4_alloc_da_blocks+0x51/0x120 __ext4_ioctl+0x1534/0x3210 __x64_sys_ioctl+0x12c/0x170 do_syscall_64+0x3b/0x90 It may happen as follows: 1. write inline_data inode vfs_write new_sync_write ext4_file_write_iter ext4_buffered_write_iter generic_perform_write ext4_da_write_begin ext4_da_write_inline_data_begin -> If inline data size too small will allocate block to write, then mapping will has dirty page ext4_da_convert_inline_data_to_extent ->clear EXT4_STATE_MAY_INLINE_DATA 2. fallocate do_vfs_ioctl ioctl_preallocate vfs_fallocate ext4_fallocate ext4_convert_inline_data ext4_convert_inline_data_nolock ext4_map_blocks -> fail will goto restore data ext4_restore_inline_data ext4_create_inline_data ext4_write_inline_data ext4_set_inode_state -> set inode EXT4_STATE_MAY_INLINE_DATA 3. writepages __ext4_ioctl ext4_alloc_da_blocks filemap_flush filemap_fdatawrite_wbc do_writepages ext4_writepages if (ext4_has_inline_data(inode)) BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) The root cause of this issue is we destory inline data until call ext4_writepages under delay allocation mode. But there maybe already convert from inline to extent. To solve this issue, we call filemap_flush first.. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220516122634.1690462-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: fix memory leak in parse_apply_sb_mount_options()Eric Biggers1-2/+4
commit c069db76ed7b681c69159f44be96d2137e9ca989 upstream. If processing the on-disk mount options fails after any memory was allocated in the ext4_fs_context, e.g. s_qf_names, then this memory is leaked. Fix this by calling ext4_fc_free() instead of kfree() directly. Reproducer: mkfs.ext4 -F /dev/vdc tune2fs /dev/vdc -E mount_opts=usrjquota=file echo clear > /sys/kernel/debug/kmemleak mount /dev/vdc /vdc echo scan > /sys/kernel/debug/kmemleak sleep 5 echo scan > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak Fixes: 7edfd85b1ffd ("ext4: Completely separate options parsing and sb setup") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220513231605.175121-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: fix warning in ext4_handle_inode_extensionYe Bin1-0/+4
commit f4534c9fc94d22383f187b9409abb3f9df2e3db3 upstream. We got issue as follows: EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc ------------[ cut here ]------------ WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 Modules linked in: CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: do_iter_readv_writev+0x2e5/0x360 do_iter_write+0x112/0x4c0 do_pwritev+0x1e5/0x390 __x64_sys_pwritev2+0x7e/0xa0 do_syscall_64+0x37/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Above issue may happen as follows: Assume inode.i_size=4096 EXT4_I(inode)->i_disksize=4096 step 1: set inode->i_isize = 8192 ext4_setattr if (attr->ia_size != inode->i_size) EXT4_I(inode)->i_disksize = attr->ia_size; rc = ext4_mark_inode_dirty ext4_reserve_inode_write ext4_get_inode_loc __ext4_get_inode_loc sb_getblk --> return -ENOMEM ... if (!error) ->will not update i_size i_size_write(inode, attr->ia_size); Now: inode.i_size=4096 EXT4_I(inode)->i_disksize=8192 step 2: Direct write 4096 bytes ext4_file_write_iter ext4_dio_write_iter iomap_dio_rw ->return error if (extend) ext4_handle_inode_extension WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); ->Then trigger warning. To solve above issue, if mark inode dirty failed in ext4_setattr just set 'EXT4_I(inode)->i_disksize' with old value. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20220326065351.761952-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: fix race condition between ext4_write and ext4_convert_inline_dataBaokun Li2-13/+6
commit f87c7a4b084afc13190cbb263538e444cb2b392a upstream. Hulk Robot reported a BUG_ON: ================================================================== EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters kernel BUG at fs/ext4/ext4_jbd2.c:53! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1 RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline] RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116 [...] Call Trace: ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795 generic_perform_write+0x279/0x3c0 mm/filemap.c:3344 ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270 ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520 do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732 do_iter_write+0x107/0x430 fs/read_write.c:861 vfs_writev fs/read_write.c:934 [inline] do_pwritev+0x1e5/0x380 fs/read_write.c:1031 [...] ================================================================== Above issue may happen as follows: cpu1 cpu2 __________________________|__________________________ do_pwritev vfs_writev do_iter_write ext4_file_write_iter ext4_buffered_write_iter generic_perform_write ext4_da_write_begin vfs_fallocate ext4_fallocate ext4_convert_inline_data ext4_convert_inline_data_nolock ext4_destroy_inline_data_nolock clear EXT4_STATE_MAY_INLINE_DATA ext4_map_blocks ext4_ext_map_blocks ext4_mb_new_blocks ext4_mb_regular_allocator ext4_mb_good_group_nolock ext4_mb_init_group ext4_mb_init_cache ext4_mb_generate_buddy --> error ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) ext4_restore_inline_data set EXT4_STATE_MAY_INLINE_DATA ext4_block_write_begin ext4_da_write_end ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) ext4_write_inline_data_end handle=NULL ext4_journal_stop(handle) __ext4_journal_stop ext4_put_nojournal(handle) ref_cnt = (unsigned long)handle BUG_ON(ref_cnt == 0) ---> BUG_ON The lock held by ext4_convert_inline_data is xattr_sem, but the lock held by generic_perform_write is i_rwsem. Therefore, the two locks can be concurrent. To solve above issue, we add inode_lock() for ext4_convert_inline_data(). At the same time, move ext4_convert_inline_data() in front of ext4_punch_hole(), remove similar handling from ext4_punch_hole(). Fixes: 0c8d414f163f ("ext4: let fallocate handle inline data correctly") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: fix journal_ioprio mount option handlingOjaswin Mujoo1-5/+10
commit e4e58e5df309d695799c494958962100a4c25039 upstream. In __ext4_super() we always overwrote the user specified journal_ioprio value with a default value, expecting parse_apply_sb_mount_options() to later correctly set ctx->journal_ioprio to the user specified value. However, if parse_apply_sb_mount_options() returned early because of empty sbi->es_s->s_mount_opts, the correct journal_ioprio value was never set. This patch fixes __ext4_super() to only use the default value if the user has not specified any value for journal_ioprio. Similarly, the remount behavior was to either use journal_ioprio value specified during initial mount, or use the default value irrespective of the journal_ioprio value specified during remount. This patch modifies this to first check if a new value for ioprio has been passed during remount and apply it. If no new value is passed, use the value specified during initial mount. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20220418083545.45778-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: fix use-after-free in ext4_rename_dir_prepareYe Bin1-3/+27
commit 0be698ecbe4471fcad80e81ec6a05001421041b3 upstream. We got issue as follows: EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478 ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000 ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae ================================================================== BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220 Read of size 4 at addr ffff88810beee6ae by task rep/1895 CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ #241 Call Trace: dump_stack+0xbe/0xf9 print_address_description.constprop.0+0x1e/0x220 kasan_report.cold+0x37/0x7f ext4_rename_dir_prepare+0x152/0x220 ext4_rename+0xf44/0x1ad0 ext4_rename2+0x11c/0x170 vfs_rename+0xa84/0x1440 do_renameat2+0x683/0x8f0 __x64_sys_renameat+0x53/0x60 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f45a6fc41c9 RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9 RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005 RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080 R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0 R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee flags: 0x200000000000000() raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint ext4_rename_dir_prepare: [2] parent_de->inode=3537895424 ext4_rename_dir_prepare: [3] dir=0xffff888124170140 ext4_rename_dir_prepare: [4] ino=2 ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872 Reason is first directory entry which 'rec_len' is 34478, then will get illegal parent entry. Now, we do not check directory entry after read directory block in 'ext4_get_first_dir_block'. To solve this issue, check directory entry in 'ext4_get_first_dir_block'. [ Trigger an ext4_error() instead of just warning if the directory is missing a '.' or '..' entry. Also make sure we return an error code if the file system is corrupted. -TYT ] Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220414025223.4113128-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09ext4: mark group as trimmed only if it was fully scannedDmitry Monakhov1-6/+12
commit d63c00ea435a5352f486c259665a4ced60399421 upstream. Otherwise nonaligned fstrim calls will works inconveniently for iterative scanners, for example: // trim [0,16MB] for group-1, but mark full group as trimmed fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m // handle [16MB,16MB] for group-1, do nothing because group already has the flag. fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m [ Update function documentation for ext4_trim_all_free -- TYT ] Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pagesZhihao Cheng1-5/+8
commit 68f4c6eba70df70a720188bce95c85570ddfcc87 upstream. Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and writeback_inodes_wb()") has us holding a plug during wb_writeback, which may cause a potential ABBA dead lock: wb_writeback fat_file_fsync blk_start_plug(&plug) for (;;) { iter i-1: some reqs have been added into plug->mq_list // LOCK A iter i: progress = __writeback_inodes_wb(wb, work) . writeback_sb_inodes // fat's bdev . __writeback_single_inode . . generic_writepages . . __block_write_full_page . . . . __generic_file_fsync . . . . sync_inode_metadata . . . . writeback_single_inode . . . . __writeback_single_inode . . . . fat_write_inode . . . . __fat_write_inode . . . . sync_dirty_buffer // fat's bdev . . . . lock_buffer(bh) // LOCK B . . . . submit_bh . . . . blk_mq_get_tag // LOCK A . . . trylock_buffer(bh) // LOCK B . . . redirty_page_for_writepage . . . wbc->pages_skipped++ . . --wbc->nr_to_write . wrote += write_chunk - wbc.nr_to_write // wrote > 0 . requeue_inode . redirty_tail_locked if (progress) // progress > 0 continue; iter i+1: queue_io // similar process with iter i, infinite for-loop ! } blk_finish_plug(&plug) // flush plug won't be called Above process triggers a hungtask like: [ 399.044861] INFO: task bb:2607 blocked for more than 30 seconds. [ 399.046824] Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty [ 399.051539] task:bb state:D stack: 0 pid: 2607 ppid: 2426 flags:0x00004000 [ 399.051556] Call Trace: [ 399.051570] __schedule+0x480/0x1050 [ 399.051592] schedule+0x92/0x1a0 [ 399.051602] io_schedule+0x22/0x50 [ 399.051613] blk_mq_get_tag+0x1d3/0x3c0 [ 399.051640] __blk_mq_alloc_requests+0x21d/0x3f0 [ 399.051657] blk_mq_submit_bio+0x68d/0xca0 [ 399.051674] __submit_bio+0x1b5/0x2d0 [ 399.051708] submit_bio_noacct+0x34e/0x720 [ 399.051718] submit_bio+0x3b/0x150 [ 399.051725] submit_bh_wbc+0x161/0x230 [ 399.051734] __sync_dirty_buffer+0xd1/0x420 [ 399.051744] sync_dirty_buffer+0x17/0x20 [ 399.051750] __fat_write_inode+0x289/0x310 [ 399.051766] fat_write_inode+0x2a/0xa0 [ 399.051783] __writeback_single_inode+0x53c/0x6f0 [ 399.051795] writeback_single_inode+0x145/0x200 [ 399.051803] sync_inode_metadata+0x45/0x70 [ 399.051856] __generic_file_fsync+0xa3/0x150 [ 399.051880] fat_file_fsync+0x1d/0x80 [ 399.051895] vfs_fsync_range+0x40/0xb0 [ 399.051929] __x64_sys_fsync+0x18/0x30 In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback: unplug before cond_resched in writeback_sb_inodes") in function 'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted from write_cache_pages(). Fix it by correcting wrote number according number of skipped pages in writeback_sb_inodes(). Goto Link to find a reproducer. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837 Cc: stable@vger.kernel.org # v4.3 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220510133805.1988292-1-chengzhihao1@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: fix to do sanity check for inline inodeChao Yu3-7/+26
commit 677a82b44ebf263d4f9a0cfbd576a6ade797a07b upstream. Yanming reported a kernel bug in Bugzilla kernel [1], which can be reproduced. The bug message is: The kernel message is shown below: kernel BUG at fs/inode.c:611! Call Trace: evict+0x282/0x4e0 __dentry_kill+0x2b2/0x4d0 dput+0x2dd/0x720 do_renameat2+0x596/0x970 __x64_sys_rename+0x78/0x90 do_syscall_64+0x3b/0x90 [1] https://bugzilla.kernel.org/show_bug.cgi?id=215895 The bug is due to fuzzed inode has both inline_data and encrypted flags. During f2fs_evict_inode(), as the inode was deleted by rename(), it will cause inline data conversion due to conflicting flags. The page cache will be polluted and the panic will be triggered in clear_inode(). Try fixing the bug by doing more sanity checks for inline data inode in sanity_check_inode(). Cc: stable@vger.kernel.org Reported-by: Ming Yan <yanming@tju.edu.cn> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: fix fallocate to use file_modified to update permissions consistentlyChao Yu1-0/+4
commit 958ed92922028ec67f504dcdc72bfdfd0f43936a upstream. This patch tries to fix permission consistency issue as all other mainline filesystems. Since the initial introduction of (posix) fallocate back at the turn of the century, it has been possible to use this syscall to change the user-visible contents of files. This can happen by extending the file size during a preallocation, or through any of the newer modes (punch, zero, collapse, insert range). Because the call can be used to change file contents, we should treat it like we do any other modification to a file -- update the mtime, and drop set[ug]id privileges/capabilities. The VFS function file_modified() does all this for us if pass it a locked inode, so let's make fallocate drop permissions correctly. Cc: stable@kernel.org Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: don't use casefolded comparison for "." and ".."Eric Biggers3-11/+13
commit b5639bb4313b9d455fc9fc4768d23a5e4ca8cb9d upstream. Tryng to rename a directory that has all following properties fails with EINVAL and triggers the 'WARN_ON_ONCE(!fscrypt_has_encryption_key(dir))' in f2fs_match_ci_name(): - The directory is casefolded - The directory is encrypted - The directory's encryption key is not yet set up - The parent directory is *not* encrypted The problem is incorrect handling of the lookup of ".." to get the parent reference to update. fscrypt_setup_filename() treats ".." (and ".") specially, as it's never encrypted. It's passed through as-is, and setting up the directory's key is not attempted. As the name isn't a no-key name, f2fs treats it as a "normal" name and attempts a casefolded comparison. That breaks the assumption of the WARN_ON_ONCE() in f2fs_match_ci_name() which assumes that for encrypted directories, casefolded comparisons only happen when the directory's key is set up. We could just remove this WARN_ON_ONCE(). However, since casefolding is always a no-op on "." and ".." anyway, let's instead just not casefold these names. This results in the standard bytewise comparison. Fixes: 7ad08a58bf67 ("f2fs: Handle casefolding with Encryption") Cc: <stable@vger.kernel.org> # v5.11+ Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: fix to do sanity check on total_data_blocksChao Yu3-13/+25
commit 6b8beca0edd32075a769bfe4178ca00c0dcd22a9 upstream. As Yanming reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215916 The kernel message is shown below: kernel BUG at fs/f2fs/segment.c:2560! Call Trace: allocate_segment_by_default+0x228/0x440 f2fs_allocate_data_block+0x13d1/0x31f0 do_write_page+0x18d/0x710 f2fs_outplace_write_data+0x151/0x250 f2fs_do_write_data_page+0xef9/0x1980 move_data_page+0x6af/0xbc0 do_garbage_collect+0x312f/0x46f0 f2fs_gc+0x6b0/0x3bc0 f2fs_balance_fs+0x921/0x2260 f2fs_write_single_data_page+0x16be/0x2370 f2fs_write_cache_pages+0x428/0xd00 f2fs_write_data_pages+0x96e/0xd50 do_writepages+0x168/0x550 __writeback_single_inode+0x9f/0x870 writeback_sb_inodes+0x47d/0xb20 __writeback_inodes_wb+0xb2/0x200 wb_writeback+0x4bd/0x660 wb_workfn+0x5f3/0xab0 process_one_work+0x79f/0x13e0 worker_thread+0x89/0xf60 kthread+0x26a/0x300 ret_from_fork+0x22/0x30 RIP: 0010:new_curseg+0xe8d/0x15f0 The root cause is: ckpt.valid_block_count is inconsistent with SIT table, stat info indicates filesystem has free blocks, but SIT table indicates filesystem has no free segment. So that during garbage colloection, it triggers panic when LFS allocator fails to find free segment. This patch tries to fix this issue by checking consistency in between ckpt.valid_block_count and block accounted from SIT. Cc: stable@vger.kernel.org Reported-by: Ming Yan <yanming@tju.edu.cn> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: don't need inode lock for system hidden quotaJaegeuk Kim1-2/+4
commit 6213f5d4d23c50d393a31dc8e351e63a1fd10dbe upstream. Let's avoid false-alarmed lockdep warning. [ 58.914674] [T1501146] -> #2 (&sb->s_type->i_mutex_key#20){+.+.}-{3:3}: [ 58.915975] [T1501146] system_server: down_write+0x7c/0xe0 [ 58.916738] [T1501146] system_server: f2fs_quota_sync+0x60/0x1a8 [ 58.917563] [T1501146] system_server: block_operations+0x16c/0x43c [ 58.918410] [T1501146] system_server: f2fs_write_checkpoint+0x114/0x318 [ 58.919312] [T1501146] system_server: f2fs_issue_checkpoint+0x178/0x21c [ 58.920214] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c [ 58.920999] [T1501146] system_server: f2fs_do_sync_file+0x334/0x738 [ 58.921862] [T1501146] system_server: f2fs_sync_file+0x30/0x48 [ 58.922667] [T1501146] system_server: __arm64_sys_fsync+0x84/0xf8 [ 58.923506] [T1501146] system_server: el0_svc_common.llvm.12821150825140585682+0xd8/0x20c [ 58.924604] [T1501146] system_server: do_el0_svc+0x28/0xa0 [ 58.925366] [T1501146] system_server: el0_svc+0x24/0x38 [ 58.926094] [T1501146] system_server: el0_sync_handler+0x88/0xec [ 58.926920] [T1501146] system_server: el0_sync+0x1b4/0x1c0 [ 58.927681] [T1501146] -> #1 (&sbi->cp_global_sem){+.+.}-{3:3}: [ 58.928889] [T1501146] system_server: down_write+0x7c/0xe0 [ 58.929650] [T1501146] system_server: f2fs_write_checkpoint+0xbc/0x318 [ 58.930541] [T1501146] system_server: f2fs_issue_checkpoint+0x178/0x21c [ 58.931443] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c [ 58.932226] [T1501146] system_server: sync_filesystem+0xac/0x130 [ 58.933053] [T1501146] system_server: generic_shutdown_super+0x38/0x150 [ 58.933958] [T1501146] system_server: kill_block_super+0x24/0x58 [ 58.934791] [T1501146] system_server: kill_f2fs_super+0xcc/0x124 [ 58.935618] [T1501146] system_server: deactivate_locked_super+0x90/0x120 [ 58.936529] [T1501146] system_server: deactivate_super+0x74/0xac [ 58.937356] [T1501146] system_server: cleanup_mnt+0x128/0x168 [ 58.938150] [T1501146] system_server: __cleanup_mnt+0x18/0x28 [ 58.938944] [T1501146] system_server: task_work_run+0xb8/0x14c [ 58.939749] [T1501146] system_server: do_notify_resume+0x114/0x1e8 [ 58.940595] [T1501146] system_server: work_pending+0xc/0x5f0 [ 58.941375] [T1501146] -> #0 (&sbi->gc_lock){+.+.}-{3:3}: [ 58.942519] [T1501146] system_server: __lock_acquire+0x1270/0x2868 [ 58.943366] [T1501146] system_server: lock_acquire+0x114/0x294 [ 58.944169] [T1501146] system_server: down_write+0x7c/0xe0 [ 58.944930] [T1501146] system_server: f2fs_issue_checkpoint+0x13c/0x21c [ 58.945831] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c [ 58.946614] [T1501146] system_server: f2fs_do_sync_file+0x334/0x738 [ 58.947472] [T1501146] system_server: f2fs_ioc_commit_atomic_write+0xc8/0x14c [ 58.948439] [T1501146] system_server: __f2fs_ioctl+0x674/0x154c [ 58.949253] [T1501146] system_server: f2fs_ioctl+0x54/0x88 [ 58.950018] [T1501146] system_server: __arm64_sys_ioctl+0xa8/0x110 [ 58.950865] [T1501146] system_server: el0_svc_common.llvm.12821150825140585682+0xd8/0x20c [ 58.951965] [T1501146] system_server: do_el0_svc+0x28/0xa0 [ 58.952727] [T1501146] system_server: el0_svc+0x24/0x38 [ 58.953454] [T1501146] system_server: el0_sync_handler+0x88/0xec [ 58.954279] [T1501146] system_server: el0_sync+0x1b4/0x1c0 Cc: stable@vger.kernel.org Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: fix deadloop in foreground GCChao Yu1-12/+20
commit cfd66bb715fd11fde3338d0660cffa1396adc27d upstream. As Yanming reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215914 The root cause is: in a very small sized image, it's very easy to exceed threshold of foreground GC, if we calculate free space and dirty data based on section granularity, in corner case, has_not_enough_free_secs() will always return true, result in deadloop in f2fs_gc(). So this patch refactors has_not_enough_free_secs() as below to fix this issue: 1. calculate needed space based on block granularity, and separate all blocks to two parts, section part, and block part, comparing section part to free section, and comparing block part to free space in openned log. 2. account F2FS_DIRTY_NODES, F2FS_DIRTY_IMETA and F2FS_DIRTY_DENTS as node block consumer; 3. account F2FS_DIRTY_DENTS as data block consumer; Cc: stable@vger.kernel.org Reported-by: Ming Yan <yanming@tju.edu.cn> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: fix to clear dirty inode in f2fs_evict_inode()Chao Yu1-1/+15
commit f2db71053dc0409fae785096ad19cce4c8a95af7 upstream. As Yanming reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215904 The kernel message is shown below: kernel BUG at fs/f2fs/inode.c:825! Call Trace: evict+0x282/0x4e0 __dentry_kill+0x2b2/0x4d0 shrink_dentry_list+0x17c/0x4f0 shrink_dcache_parent+0x143/0x1e0 do_one_tree+0x9/0x30 shrink_dcache_for_umount+0x51/0x120 generic_shutdown_super+0x5c/0x3a0 kill_block_super+0x90/0xd0 kill_f2fs_super+0x225/0x310 deactivate_locked_super+0x78/0xc0 cleanup_mnt+0x2b7/0x480 task_work_run+0xc8/0x150 exit_to_user_mode_prepare+0x14a/0x150 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0x90 The root cause is: inode node and dnode node share the same nid, so during f2fs_evict_inode(), dnode node truncation will invalidate its NAT entry, so when truncating inode node, it fails due to invalid NAT entry, result in inode is still marked as dirty, fix this issue by clearing dirty for inode and setting SBI_NEED_FSCK flag in filesystem. output from dump.f2fs: [print_node_info: 354] Node ID [0xf:15] is inode i_nid[0] [0x f : 15] Cc: stable@vger.kernel.org Reported-by: Ming Yan <yanming@tju.edu.cn> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: fix to do sanity check on block address in f2fs_do_zero_range()Chao Yu1-4/+12
commit 25f8236213a91efdf708b9d77e9e51b6fc3e141c upstream. As Yanming reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215894 I have encountered a bug in F2FS file system in kernel v5.17. I have uploaded the system call sequence as case.c, and a fuzzed image can be found in google net disk The kernel should enable CONFIG_KASAN=y and CONFIG_KASAN_INLINE=y. You can reproduce the bug by running the following commands: kernel BUG at fs/f2fs/segment.c:2291! Call Trace: f2fs_invalidate_blocks+0x193/0x2d0 f2fs_fallocate+0x2593/0x4a70 vfs_fallocate+0x2a5/0xac0 ksys_fallocate+0x35/0x70 __x64_sys_fallocate+0x8e/0xf0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The root cause is, after image was fuzzed, block mapping info in inode will be inconsistent with SIT table, so in f2fs_fallocate(), it will cause panic when updating SIT with invalid blkaddr. Let's fix the issue by adding sanity check on block address before updating SIT table with it. Cc: stable@vger.kernel.org Reported-by: Ming Yan <yanming@tju.edu.cn> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09f2fs: fix to avoid f2fs_bug_on() in dec_valid_node_count()Chao Yu1-4/+10
commit 4d17e6fe9293d57081ffdc11e1cf313e25e8fd9e upstream. As Yanming reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215897 I have encountered a bug in F2FS file system in kernel v5.17. The kernel should enable CONFIG_KASAN=y and CONFIG_KASAN_INLINE=y. You can reproduce the bug by running the following commands: The kernel message is shown below: kernel BUG at fs/f2fs/f2fs.h:2511! Call Trace: f2fs_remove_inode_page+0x2a2/0x830 f2fs_evict_inode+0x9b7/0x1510 evict+0x282/0x4e0 do_unlinkat+0x33a/0x540 __x64_sys_unlinkat+0x8e/0xd0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The root cause is: .total_valid_block_count or .total_valid_node_count could fuzzed to zero, then once dec_valid_node_count() was called, it will cause BUG_ON(), this patch fixes to print warning info and set SBI_NEED_FSCK into CP instead of panic. Cc: stable@vger.kernel.org Reported-by: Ming Yan <yanming@tju.edu.cn> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09NFSv4.1 mark qualified async operations as MOVEABLE tasksOlga Kornievskaia4-12/+29
[ Upstream commit 118f09eda21d392e1eeb9f8a4bee044958cccf20 ] Mark async operations such as RENAME, REMOVE, COMMIT MOVEABLE for the nfsv4.1+ sessions. Fixes: 85e39feead948 ("NFSv4.1 identify and mark RPC tasks that can move between transports") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>