summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-07-20nfs: fix acl memory leak of posix_acl_create()Gao Xiang1-2/+2
[ Upstream commit 1fcb6fcd74a222d9ead54d405842fc763bb86262 ] When looking into another nfs xfstests report, I found acl and default_acl in nfs3_proc_create() and nfs3_proc_mknod() error paths are possibly leaked. Fix them in advance. Fixes: 013cdf1088d7 ("nfs: use generic posix ACL infrastructure for v3 Posix ACLs") Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20ubifs: journal: Fix error return code in ubifs_jnl_write_inode()Zhen Lei1-0/+1
[ Upstream commit a2c2a622d41168f9fea2aa3f76b9fbaa88531aac ] Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 9ca2d7326444 ("ubifs: Limit number of xattrs per inode") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20NFSv4: Fix an Oops in pnfs_mark_request_commit() when doing O_DIRECTTrond Myklebust1-10/+7
[ Upstream commit 3731d44bba8e0116b052b1b374476c5f6dd9a456 ] Fix an Oopsable condition in pnfs_mark_request_commit() when we're putting a set of writes on the commit list to reschedule them after a failed pNFS attempt. Fixes: 9c455a8c1e14 ("NFS/pNFS: Clean up pNFS commit operations") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20NFSv4: Initialise connection to the server in nfs4_alloc_client()Trond Myklebust1-40/+42
[ Upstream commit dd99e9f98fbf423ff6d365b37a98e8879170f17c ] Set up the connection to the NFSv4 server in nfs4_alloc_client(), before we've added the struct nfs_client to the net-namespace's nfs_client_list so that a downed server won't cause other mounts to hang in the trunking detection code. Reported-by: Michael Wakabayashi <mwakabayashi@vmware.com> Fixes: 5c6e5b60aae4 ("NFS: Fix an Oops in the pNFS files and flexfiles connection setup to the DS") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20f2fs: compress: fix to disallow temp extensionChao Yu1-4/+12
[ Upstream commit 4a67d9b07ac8dce7f1034e0d887f2f4ee00fe118 ] This patch restricts to configure compress extension as format of: [filename + '.' + extension] rather than: [filename + '.' + extension + (optional: '.' + temp extension)] in order to avoid to enable compression incorrectly: 1. compress_extension=so 2. touch file.soa 3. touch file.so.tmp Fixes: 4c8ff7095bef ("f2fs: support data compression") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20f2fs: add MODULE_SOFTDEP to ensure crc32 is included in the initramfsChao Yu1-0/+1
[ Upstream commit 0dd571785d61528d62cdd8aa49d76bc6085152fe ] As marcosfrm reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=213089 Initramfs generators rely on "pre" softdeps (and "depends") to include additional required modules. F2FS does not declare "pre: crc32" softdep. Then every generator (dracut, mkinitcpio...) has to maintain a hardcoded list for this purpose. Hence let's use MODULE_SOFTDEP("pre: crc32") in f2fs code. Fixes: 43b6573bac95 ("f2fs: use cryptoapi crc32 functions") Reported-by: marcosfrm <marcosfrm@gmail.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20NFSD: Fix TP_printk() format specifier in nfsd_clid_classChuck Lever2-32/+0
[ Upstream commit a948b1142cae66785521a389cab2cce74069b547 ] Since commit 9a6944fee68e ("tracing: Add a verifier to check string pointers for trace events"), which was merged in v5.13-rc1, TP_printk() no longer tacitly supports the "%.*s" format specifier. These are low value tracepoints, so just remove them. Reported-by: David Wysochanski <dwysocha@redhat.com> Fixes: dd5e3fbc1f47 ("NFSD: Add tracepoints to the NFSD state management code") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20f2fs: atgc: fix to set default age thresholdChao Yu1-0/+1
[ Upstream commit 89e53ff1651a61cf2abef9356e2f60d0086215be ] Default age threshold value is missed to set, fix it. Fixes: 093749e296e2 ("f2fs: support age threshold based garbage collection") Reported-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20ext4: fix WARN_ON_ONCE(!buffer_uptodate) after an error writing the superblockYe Bin2-3/+11
[ Upstream commit 558d6450c7755aa005d89021204b6cdcae5e848f ] If a writeback of the superblock fails with an I/O error, the buffer is marked not uptodate. However, this can cause a WARN_ON to trigger when we attempt to write superblock a second time. (Which might succeed this time, for cerrtain types of block devices such as iSCSI devices over a flaky network.) Try to detect this case in flush_stashed_error_work(), and also change __ext4_handle_dirty_metadata() so we always set the uptodate flag, not just in the nojournal case. Before this commit, this problem can be repliciated via: 1. dmsetup create dust1 --table '0 2097152 dust /dev/sdc 0 4096' 2. mount /dev/mapper/dust1 /home/test 3. dmsetup message dust1 0 addbadblock 0 10 4. cd /home/test 5. echo "XXXXXXX" > t After a few seconds, we got following warning: [ 80.654487] end_buffer_async_write: bh=0xffff88842f18bdd0 [ 80.656134] Buffer I/O error on dev dm-0, logical block 0, lost async page write [ 85.774450] EXT4-fs error (device dm-0): ext4_check_bdev_write_error:193: comm kworker/u16:8: Error while async write back metadata [ 91.415513] mark_buffer_dirty: bh=0xffff88842f18bdd0 [ 91.417038] ------------[ cut here ]------------ [ 91.418450] WARNING: CPU: 1 PID: 1944 at fs/buffer.c:1092 mark_buffer_dirty.cold+0x1c/0x5e [ 91.440322] Call Trace: [ 91.440652] __jbd2_journal_temp_unlink_buffer+0x135/0x220 [ 91.441354] __jbd2_journal_unfile_buffer+0x24/0x90 [ 91.441981] __jbd2_journal_refile_buffer+0x134/0x1d0 [ 91.442628] jbd2_journal_commit_transaction+0x249a/0x3240 [ 91.443336] ? put_prev_entity+0x2a/0x200 [ 91.443856] ? kjournald2+0x12e/0x510 [ 91.444324] kjournald2+0x12e/0x510 [ 91.444773] ? woken_wake_function+0x30/0x30 [ 91.445326] kthread+0x150/0x1b0 [ 91.445739] ? commit_timeout+0x20/0x20 [ 91.446258] ? kthread_flush_worker+0xb0/0xb0 [ 91.446818] ret_from_fork+0x1f/0x30 [ 91.447293] ---[ end trace 66f0b6bf3d1abade ]--- Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20210615090537.3423231-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20NFS: nfs_find_open_context() may only select open filesTrond Myklebust1-0/+4
[ Upstream commit e97bc66377bca097e1f3349ca18ca17f202ff659 ] If a file has already been closed, then it should not be selected to support further I/O. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> [Trond: Fix an invalid pointer deref reported by Colin Ian King] Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirtyJeff Layton1-9/+1
[ Upstream commit 22d41cdcd3cfd467a4af074165357fcbea1c37f5 ] The checks for page->mapping are odd, as set_page_dirty is an address_space operation, and I don't see where it would be called on a non-pagecache page. The warning about the page lock also seems bogus. The comment over set_page_dirty() says that it can be called without the page lock in some rare cases. I don't think we want to warn if that's the case. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20orangefs: fix orangefs df output.Mike Marshall1-1/+1
[ Upstream commit 0fdec1b3c9fbb5e856a40db5993c9eaf91c74a83 ] Orangefs df output is whacky. Walt Ligon suggested this might fix it. It seems way more in line with reality now... Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20NFSv4: Fix handling of non-atomic change attrbute updatesTrond Myklebust1-18/+15
[ Upstream commit 20cf7d4ea4ad7d9830b01ff7444f6ac64a727a23 ] If the change attribute update is declared to be non-atomic by the server, or our cached value does not match the server's value before the operation was performed, then we should declare the inode cache invalid. On the other hand, if the change to the directory raced with a lookup or getattr which already updated the change attribute, then optimise away the revalidation. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20NFS: Fix up inode attribute revalidation timeoutsTrond Myklebust1-34/+16
[ Upstream commit 213bb58475b57786e4336bc8bfd5029e16257c49 ] The inode is considered revalidated when we've checked the value of the change attribute against our cached value since that suffices to establish whether or not the other cached values are valid. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20fuse: fix illegal access to inode with reused nodeidAmir Goldstein4-5/+15
[ Upstream commit 15db16837a35d8007cb8563358787412213db25e ] Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...) with ourarg containing nodeid and generation. If a fuse inode is found in inode cache with the same nodeid but different generation, the existing fuse inode should be unhashed and marked "bad" and a new inode with the new generation should be hashed instead. This can happen, for example, with passhrough fuse filesystem that returns the real filesystem ino/generation on lookup and where real inode numbers can get recycled due to real files being unlinked not via the fuse passthrough filesystem. With current code, this situation will not be detected and an old fuse dentry that used to point to an older generation real inode, can be used to access a completely new inode, which should be accessed only via the new dentry. Note that because the FORGET message carries the nodeid w/o generation, the server should wait to get FORGET counts for the nlookup counts of the old and reused inodes combined, before it can free the resources associated to that nodeid. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20virtiofs: propagate sync() to file serverGreg Kurz3-0/+44
[ Upstream commit 2d82ab251ef0f6e7716279b04e9b5a01a86ca530 ] Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though: sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> There are no good reasons not to honor the expected behavior of sync() actually: it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock: a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers: the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <rlk@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20NFSv4: Fix delegation return in cases where we have to retryTrond Myklebust3-15/+58
[ Upstream commit be20037725d17935ec669044bd2b15bc40c3b5ab ] If we're unable to immediately recover all locks because the server is unable to immediately service our reclaim calls, then we want to retry after we've finished servicing all the other asynchronous delegation returns on our queue. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20fs/jfs: Fix missing error code in lmLogInit()Jiapeng Chong1-0/+1
[ Upstream commit 492109333c29e1bb16d8732e1d597b02e8e0bf2e ] The error code is missing in this code scenario, add the error code '-EINVAL' to the return value 'rc. Eliminate the follow smatch warning: fs/jfs/jfs_logmgr.c:1327 lmLogInit() warn: missing error code 'rc'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20btrfs: zoned: fix wrong mutex unlock on failure to allocate log root treeFilipe Manana1-1/+1
commit ea32af47f00a046a1f953370514d6d946efe0152 upstream. When syncing the log, if we fail to allocate the root node for the log root tree: 1) We are unlocking fs_info->tree_log_mutex, but at this point we have not yet locked this mutex; 2) We have locked fs_info->tree_root->log_mutex, but we end up not unlocking it; So fix this by unlocking fs_info->tree_root->log_mutex instead of fs_info->tree_log_mutex. Fixes: e75f9fd194090e ("btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex") CC: stable@vger.kernel.org # 5.13+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20btrfs: don't block if we can't acquire the reclaim lockJohannes Thumshirn1-1/+9
commit 9cc0b837e14ae913581ec1ea6e979a738f71b0fd upstream. If we can't acquire the reclaim_bgs_lock on block group reclaim, we block until it is free. This can potentially stall for a long time. While reclaim of block groups is necessary for a good user experience on a zoned file system, there still is no need to block as it is best effort only, just like when we're deleting unused block groups. CC: stable@vger.kernel.org # 5.13 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20btrfs: rework chunk allocation to avoid exhaustion of the system chunk arrayFilipe Manana7-184/+546
commit 79bd37120b149532af5b21953643ed74af69654f upstream. Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") fixed a problem that resulted in exhausting the system chunk array in the superblock when there are many tasks allocating chunks in parallel. Basically too many tasks enter the first phase of chunk allocation without previous tasks having finished their second phase of allocation, resulting in too many system chunks being allocated. That was originally observed when running the fallocate tests of stress-ng on a PowerPC machine, using a node size of 64K. However that commit also introduced a deadlock where a task in phase 1 of the chunk allocation waited for another task that had allocated a system chunk to finish its phase 2, but that other task was waiting on an extent buffer lock held by the first task, therefore resulting in both tasks not making any progress. That change was later reverted by a patch with the subject "btrfs: fix deadlock with concurrent chunk allocations involving system chunks", since there is no simple and short solution to address it and the deadlock is relatively easy to trigger on zoned filesystems, while the system chunk array exhaustion is not so common. This change reworks the chunk allocation to avoid the system chunk array exhaustion. It accomplishes that by making the first phase of chunk allocation do the updates of the device items in the chunk btree and the insertion of the new chunk item in the chunk btree. This is done while under the protection of the chunk mutex (fs_info->chunk_mutex), in the same critical section that checks for available system space, allocates a new system chunk if needed and reserves system chunk space. This way we do not have chunk space reserved until the second phase completes. The same logic is applied to chunk removal as well, since it keeps reserved system space long after it is done updating the chunk btree. For direct allocation of system chunks, the previous behaviour remains, because otherwise we would deadlock on extent buffers of the chunk btree. Changes to the chunk btree are by large done by chunk allocation and chunk removal, which first reserve chunk system space and then later do changes to the chunk btree. The other remaining cases are uncommon and correspond to adding a device, removing a device and resizing a device. All these other cases do not pre-reserve system space, they modify the chunk btree right away, so they don't hold reserved space for a long period like chunk allocation and chunk removal do. The diff of this change is huge, but more than half of it is just addition of comments describing both how things work regarding chunk allocation and removal, including both the new behavior and the parts of the old behavior that did not change. CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20btrfs: fix deadlock with concurrent chunk allocations involving system chunksFilipe Manana3-69/+1
commit 1cb3db1cf383a3c7dbda1aa0ce748b0958759947 upstream. When a task attempting to allocate a new chunk verifies that there is not currently enough free space in the system space_info and there is another task that allocated a new system chunk but it did not finish yet the creation of the respective block group, it waits for that other task to finish creating the block group. This is to avoid exhaustion of the system chunk array in the superblock, which is limited, when we have a thundering herd of tasks allocating new chunks. This problem was described and fixed by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations"). However there are two very similar scenarios where this can lead to a deadlock: 1) Task B allocated a new system chunk and task A is waiting on task B to finish creation of the respective system block group. However before task B ends its transaction handle and finishes the creation of the system block group, it attempts to allocate another chunk (like a data chunk for an fallocate operation for a very large range). Task B will be unable to progress and allocate the new chunk, because task A set space_info->chunk_alloc to 1 and therefore it loops at btrfs_chunk_alloc() waiting for task A to finish its chunk allocation and set space_info->chunk_alloc to 0, but task A is waiting on task B to finish creation of the new system block group, therefore resulting in a deadlock; 2) Task B allocated a new system chunk and task A is waiting on task B to finish creation of the respective system block group. By the time that task B enter the final phase of block group allocation, which happens at btrfs_create_pending_block_groups(), when it modifies the extent tree, the device tree or the chunk tree to insert the items for some new block group, it needs to allocate a new chunk, so it ends up at btrfs_chunk_alloc() and keeps looping there because task A has set space_info->chunk_alloc to 1, but task A is waiting for task B to finish creation of the new system block group and release the reserved system space, therefore resulting in a deadlock. In short, the problem is if a task B needs to allocate a new chunk after it previously allocated a new system chunk and if another task A is currently waiting for task B to complete the allocation of the new system chunk. Unfortunately this deadlock scenario introduced by the previous fix for the system chunk array exhaustion problem does not have a simple and short fix, and requires a big change to rework the chunk allocation code so that chunk btree updates are all made in the first phase of chunk allocation. And since this deadlock regression is being frequently hit on zoned filesystems and the system chunk array exhaustion problem is triggered in more extreme cases (originally observed on PowerPC with a node size of 64K when running the fallocate tests from stress-ng), revert the changes from that commit. The next patch in the series, with a subject of "btrfs: rework chunk allocation to avoid exhaustion of the system chunk array" does the necessary changes to fix the system chunk array exhaustion problem. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/ Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_workDavid Sterba1-1/+1
commit 54afaae34ee49e98c1c902b444b42832551d090c upstream. The types in calculation of the used percentage in the reclaiming messages are both u64, though bg->length is either 1GiB (non-zoned) or the zone size in the zoned mode. The upper limit on zone size is 8GiB so this could theoretically overflow in the future, right now the values fit. Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.13 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20btrfs: properly split extent_map for REQ_OP_ZONE_APPENDNaohiro Aota1-29/+118
commit abb99cfdaf0759f8a619e5fecf52ccccdf310c8c upstream. Damien reported a test failure with btrfs/209. The test itself ran fine, but the fsck ran afterwards reported a corrupted filesystem. The filesystem corruption happens because we're splitting an extent and then writing the extent twice. We have to split the extent though, because we're creating too large extents for a REQ_OP_ZONE_APPEND operation. When dumping the extent tree, we can see two EXTENT_ITEMs at the same start address but different lengths. $ btrfs inspect dump-tree /dev/nullb1 -t extent ... item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in extract_ordered_extent(). Since extract_ordered_extent() uses create_io_em() to split an existing extent_map, we will have split->orig_start != split->start. Then, it will be logged with non-zero "extent data offset". Finally, the logged entries are replayed into a duplicated EXTENT_ITEM. Introduce and use proper splitting function for extent_map. The function is intended to be simple and specific usage for extract_ordered_extent() e.g. not supporting compression case (we do not allow splitting compressed extent_map anyway). There was a question raised by Qu, in summary why we want to split the extent map (and not the bio): The problem is not the limit on the zone end, which as you mention is the same as the block group end. The problem is that data write use zone append (ZA) operations. ZA BIOs cannot be split so a large extent may need to be processed with multiple ZA BIOs, While that is also true for regular writes, the major difference is that ZA are "nameless" write operation giving back the written sectors on completion. And ZA operations may be reordered by the block layer (not intentionally though). Combine both of these characteristics and you can see that the data for a large extent may end up being shuffled when written resulting in data corruption and the impossibility to map the extent to some start sector. To avoid this problem, zoned btrfs uses the principle "one data extent == one ZA BIO". So large extents need to be split. This is unfortunate, but we can revisit this later and optimize, e.g. merge back together the fragments of an extent once written if they actually were written sequentially in the zone. Reported-by: Damien Le Moal <damien.lemoal@wdc.com> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.12+ CC: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20io_uring: use right task for exiting checksPavel Begunkov1-1/+1
commit 9c6882608bce249a8918744ecdb65748534e3f17 upstream. When we use delayed_work for fallback execution of requests, current will be not of the submitter task, and so checks in io_req_task_submit() may not behave as expected. Currently, it leaves inline completions not flushed, so making io_ring_exit_work() to hang. Use the submitter task for all those checks. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cb413c715bed0bc9c98b169059ea9c8a2c770715.1625881431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20cifs: Do not use the original cruid when following DFS links for multiuser ↵Ronnie Sahlberg1-0/+4
mounts commit 50630b3f1ada0bf412d3f28e73bac310448d9d6f upstream. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=213565 cruid should only be used for the initial mount and after this we should use the current users credentials. Ignore the original cruid mount argument when creating a new context for a multiuser mount following a DFS link. Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Cc: stable@vger.kernel.org # 5.11+ Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20cifs: handle reconnect of tcon when there is no cached dfs referralPaulo Alcantara1-4/+2
commit 507345b5ae6a57b7ecd7550ff39282ed20de7b8d upstream. When there is no cached DFS referral of tcon->dfs_path, then reconnect to same share. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20cifs: use the expiry output of dns_query to schedule next resolutionShyam Prasad N6-10/+65
commit 506c1da44fee32ba1d3a70413289ad58c772bba6 upstream. We recently fixed DNS resolution of the server hostname during reconnect. However, server IP address may change, even when the old one continues to server (although sub-optimally). We should schedule the next DNS resolution based on the TTL of the DNS record used for the last resolution. This way, we resolve the server hostname again when a DNS record expires. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: <stable@vger.kernel.org> # v5.11+ Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19f2fs: fix to avoid racing on fsync_entry_slab by multi filesystem instancesChao Yu3-10/+23
commit cad83c968c2ebe97905f900326988ed37146c347 upstream. As syzbot reported, there is an use-after-free issue during f2fs recovery: Use-after-free write at 0xffff88823bc16040 (in kfence-#10): kmem_cache_destroy+0x1f/0x120 mm/slab_common.c:486 f2fs_recover_fsync_data+0x75b0/0x8380 fs/f2fs/recovery.c:869 f2fs_fill_super+0x9393/0xa420 fs/f2fs/super.c:3945 mount_bdev+0x26c/0x3a0 fs/super.c:1367 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x86/0x270 fs/super.c:1497 do_new_mount fs/namespace.c:2905 [inline] path_mount+0x196f/0x2be0 fs/namespace.c:3235 do_mount fs/namespace.c:3248 [inline] __do_sys_mount fs/namespace.c:3456 [inline] __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433 do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae The root cause is multi f2fs filesystem instances can race on accessing global fsync_entry_slab pointer, result in use-after-free issue of slab cache, fixes to init/destroy this slab cache only once during module init/destroy procedure to avoid this issue. Reported-by: syzbot+9d90dad32dd9727ed084@syzkaller.appspotmail.com Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19jfs: fix GPF in diFreePavel Skripkin1-1/+2
commit 9d574f985fe33efd6911f4d752de6f485a1ea732 upstream. Avoid passing inode with JFS_SBI(inode->i_sb)->ipimap == NULL to diFree()[1]. GFP will appear: struct inode *ipimap = JFS_SBI(ip->i_sb)->ipimap; struct inomap *imap = JFS_IP(ipimap)->i_imap; JFS_IP() will return invalid pointer when ipimap == NULL Call Trace: diFree+0x13d/0x2dc0 fs/jfs/jfs_imap.c:853 [1] jfs_evict_inode+0x2c9/0x370 fs/jfs/inode.c:154 evict+0x2ed/0x750 fs/inode.c:578 iput_final fs/inode.c:1654 [inline] iput.part.0+0x3fe/0x820 fs/inode.c:1680 iput+0x58/0x70 fs/inode.c:1670 Reported-and-tested-by: syzbot+0a89a7b56db04c21a656@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19ext4: fix possible UAF when remounting r/o a mmp-protected file systemTheodore Ts'o2-17/+20
commit 61bb4a1c417e5b95d9edb4f887f131de32e419cb upstream. After commit 618f003199c6 ("ext4: fix memory leak in ext4_fill_super"), after the file system is remounted read-only, there is a race where the kmmpd thread can exit, causing sbi->s_mmp_tsk to point at freed memory, which the call to ext4_stop_mmpd() can trip over. Fix this by only allowing kmmpd() to exit when it is stopped via ext4_stop_mmpd(). Link: https://lore.kernel.org/r/20210707002433.3719773-1-tytso@mit.edu Reported-by: Ye Bin <yebin10@huawei.com> Bug-Report-Link: <20210629143603.2166962-1-yebin10@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19ubifs: Fix races between xattr_{set|get} and listxattr operationsZhihao Cheng3-11/+36
commit f4e3634a3b642225a530c292fdb1e8a4007507f5 upstream. UBIFS may occur some problems with concurrent xattr_{set|get} and listxattr operations, such as assertion failure, memory corruption, stale xattr value[1]. Fix it by importing a new rw-lock in @ubifs_inode to serilize write operations on xattr, concurrent read operations are still effective, just like ext4. [1] https://lore.kernel.org/linux-mtd/20200630130438.141649-1-houtao1@huawei.com Fixes: 1e51764a3c2ac05a23 ("UBIFS: add new flash file system") Cc: stable@vger.kernel.org # v2.6+ Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-19io_uring: fix false WARN_ONCEPavel Begunkov1-1/+4
[ Upstream commit e6ab8991c5d0b0deae0961dc22c0edd1dee328f5 ] WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_wake_worker fs/io-wq.c:244 [inline] WARNING: CPU: 1 PID: 11749 at fs/io-wq.c:244 io_wqe_enqueue+0x7f6/0x910 fs/io-wq.c:751 A WARN_ON_ONCE() in io_wqe_wake_worker() can be triggered by a valid userspace setup. Replace it with pr_warn. Reported-by: syzbot+ea2f1484cffe5109dc10@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f7ede342c3342c4c26668f5168e2993e38bbd99c.1623949695.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19ext4: fix memory leak in ext4_fill_superPavel Skripkin3-21/+21
[ Upstream commit 618f003199c6188e01472b03cdbba227f1dc5f24 ] static int kthread(void *_create) will return -ENOMEM or -EINTR in case of internal failure or kthread_stop() call happens before threadfn call. To prevent fancy error checking and make code more straightforward we moved all cleanup code out of kmmpd threadfn. Also, dropped struct mmpd_data at all. Now struct super_block is a threadfn data and struct buffer_head embedded into struct ext4_sb_info. Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Link: https://lore.kernel.org/r/20210430185046.15742-1-paskripkin@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19udf: Fix NULL pointer dereference in udf_symlink functionArturo Giusti1-0/+4
[ Upstream commit fa236c2b2d4436d9f19ee4e5d5924e90ffd7bb43 ] In function udf_symlink, epos.bh is assigned with the value returned by udf_tgetblk. The function udf_tgetblk is defined in udf/misc.c and returns the value of sb_getblk function that could be NULL. Then, epos.bh is used without any check, causing a possible NULL pointer dereference when sb_getblk fails. This fix adds a check to validate the value of epos.bh. Link: https://bugzilla.kernel.org/show_bug.cgi?id=213083 Signed-off-by: Arturo Giusti <koredump@protonmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-19reiserfs: add check for invalid 1st journal blockPavel Skripkin1-0/+14
[ Upstream commit a149127be52fa7eaf5b3681a0317a2bbb772d5a9 ] syzbot reported divide error in reiserfs. The problem was in incorrect journal 1st block. Syzbot's reproducer manualy generated wrong superblock with incorrect 1st block. In journal_init() wasn't any checks about this particular case. For example, if 1st journal block is before superblock 1st block, it can cause zeroing important superblock members in do_journal_end(). Link: https://lore.kernel.org/r/20210517121545.29645-1-paskripkin@gmail.com Reported-by: syzbot+0ba9909df31c6a36974d@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14io_uring: add IOPOLL and reserved field checks to IORING_OP_UNLINKATJens Axboe1-0/+4
commit 22634bc5620d29765e5199c7b230a372c7ddcda2 upstream. We can't support IOPOLL with non-pollable request types, and we should check for unused/reserved fields like we do for other request types. Fixes: 14a1143b68ee ("io_uring: add support for IORING_OP_UNLINKAT") Cc: stable@vger.kernel.org Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14io_uring: add IOPOLL and reserved field checks to IORING_OP_RENAMEATJens Axboe1-0/+4
commit ed7eb2592286ead7d3bfdf8adf65e65392167cc4 upstream. We can't support IOPOLL with non-pollable request types, and we should check for unused/reserved fields like we do for other request types. Fixes: 80a261fd0032 ("io_uring: add support for IORING_OP_RENAMEAT") Cc: stable@vger.kernel.org Reported-by: Dmitry Kadashev <dkadashev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14io_uring: fix blocking inline submissionPavel Begunkov1-1/+1
commit 976517f162a05f4315b2373fd11585c395506259 upstream. There is a complaint against sys_io_uring_enter() blocking if it submits stdin reads. The problem is in __io_file_supports_async(), which sees that it's a cdev and allows it to be processed inline. Punt char devices using generic rules of io_file_supports_async(), including checking for presence of *_iter() versions of rw callbacks. Apparently, it will affect most of cdevs with some exceptions like null and zero devices. Cc: stable@vger.kernel.org Reported-by: Birk Hirdman <lonjil@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d60270856b8a4560a639ef5f76e55eb563633599.1623236455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14erofs: fix error return code in erofs_read_superblock()Wei Yongjun1-0/+1
commit 0508c1ad0f264a24c4643701823a45f6c9bd8146 upstream. 'ret' will be overwritten to 0 if erofs_sb_has_sb_chksum() return true, thus 0 will return in some error handling cases. Fix to return negative error code -EINVAL instead of 0. Link: https://lore.kernel.org/r/20210519141657.3062715-1-weiyongjun1@huawei.com Fixes: b858a4844cfb ("erofs: support superblock checksum") Cc: stable <stable@vger.kernel.org> # 5.5+ Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14fscrypt: fix derivation of SipHash keys on big endian CPUsEric Biggers1-8/+32
commit 2fc2b430f559fdf32d5d1dd5ceaa40e12fb77bdf upstream. Typically, the cryptographic APIs that fscrypt uses take keys as byte arrays, which avoids endianness issues. However, siphash_key_t is an exception. It is defined as 'u64 key[2];', i.e. the 128-bit key is expected to be given directly as two 64-bit words in CPU endianness. fscrypt_derive_dirhash_key() and fscrypt_setup_iv_ino_lblk_32_key() forgot to take this into account. Therefore, the SipHash keys used to index encrypted+casefolded directories differ on big endian vs. little endian platforms, as do the SipHash keys used to hash inode numbers for IV_INO_LBLK_32-encrypted directories. This makes such directories non-portable between these platforms. Fix this by always using the little endian order. This is a breaking change for big endian platforms, but this should be fine in practice since these features (encrypt+casefold support, and the IV_INO_LBLK_32 flag) aren't known to actually be used on any big endian platforms yet. Fixes: aa408f835d02 ("fscrypt: derive dirhash key for casefolded directories") Fixes: e3b1078bedd3 ("fscrypt: add support for IV_INO_LBLK_32 policies") Cc: <stable@vger.kernel.org> # v5.6+ Link: https://lore.kernel.org/r/20210605075033.54424-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14fscrypt: don't ignore minor_hash when hash is 0Eric Biggers1-7/+3
commit 77f30bfcfcf484da7208affd6a9e63406420bf91 upstream. When initializing a no-key name, fscrypt_fname_disk_to_usr() sets the minor_hash to 0 if the (major) hash is 0. This doesn't make sense because 0 is a valid hash code, so we shouldn't ignore the filesystem-provided minor_hash in that case. Fix this by removing the special case for 'hash == 0'. This is an old bug that appears to have originated when the encryption code in ext4 and f2fs was moved into fs/crypto/. The original ext4 and f2fs code passed the hash by pointer instead of by value. So 'if (hash)' actually made sense then, as it was checking whether a pointer was NULL. But now the hashes are passed by value, and filesystems just pass 0 for any hashes they don't have. There is no need to handle this any differently from the hashes actually being 0. It is difficult to reproduce this bug, as it only made a difference in the case where a filename's 32-bit major hash happened to be 0. However, it probably had the largest chance of causing problems on ubifs, since ubifs uses minor_hash to do lookups of no-key names, in addition to using it as a readdir cookie. ext4 only uses minor_hash as a readdir cookie, and f2fs doesn't use minor_hash at all. Fixes: 0b81d0779072 ("fs crypto: move per-file encryption from f2fs tree to fs/crypto") Cc: <stable@vger.kernel.org> # v4.6+ Link: https://lore.kernel.org/r/20210527235236.2376556-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14exfat: handle wrong stream entry size in exfat_readdir()Namjae Jeon1-3/+5
commit 1e5654de0f51890f88abd409ebf4867782431e81 upstream. The compatibility issue between linux exfat and exfat of some camera company was reported from Florian. In their exfat, if the number of files exceeds any limit, the DataLength in stream entry of the directory is no longer updated. So some files created from camera does not show in linux exfat. because linux exfat doesn't allow that cpos becomes larger than DataLength of stream entry. This patch check DataLength in stream entry only if the type is ALLOC_NO_FAT_CHAIN and add the check ensure that dentry offset does not exceed max dentries size(256 MB) to avoid the circular FAT chain issue. Fixes: ca06197382bd ("exfat: add directory operations") Cc: stable@vger.kernel.org # v5.9 Reported-by: Florian Cramer <flrncrmr@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Tested-by: Chris Down <chris@chrisdown.name> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-14mm/huge_memory.c: add missing read-only THP checking in ↵Miaohe Lin1-1/+1
transparent_hugepage_enabled() [ Upstream commit e6be37b2e7bddfe0c76585ee7c7eee5acc8efeab ] Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS"), read-only THP file mapping is supported. But it forgot to add checking for it in transparent_hugepage_enabled(). To fix it, we add checking for read-only THP file mapping and also introduce helper transhuge_vma_enabled() to check whether thp is enabled for specified vma to reduce duplicated code. We rename transparent_hugepage_enabled to transparent_hugepage_active to make the code easier to follow as suggested by David Hildenbrand. [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable] Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14configfs: fix memleak in configfs_release_bin_fileChung-Chiang Cheng1-5/+5
[ Upstream commit 3c252b087de08d3cb32468b54a158bd7ad0ae2f7 ] When reading binary attributes in progress, buffer->bin_buffer is setup in configfs_read_bin_file() but never freed. Fixes: 03607ace807b4 ("configfs: implement binary attributes") Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> [hch: move the vfree rather than duplicating it] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14writeback: fix obtain a reference to a freeing memcg cssMuchun Song1-2/+7
[ Upstream commit 8b0ed8443ae6458786580d36b7d5f8125535c5d4 ] The caller of wb_get_create() should pin the memcg, because wb_get_create() relies on this guarantee. The rcu read lock only can guarantee that the memcg css returned by css_from_id() cannot be released, but the reference of the memcg can be zero. rcu_read_lock() memcg_css = css_from_id() wb_get_create(memcg_css) cgwb_create(memcg_css) // css_get can change the ref counter from 0 back to 1 css_get(memcg_css) rcu_read_unlock() Fix it by holding a reference to the css before calling wb_get_create(). This is not a problem I encountered in the real world. Just the result of a code review. Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Link: https://lore.kernel.org/r/20210402091145.80635-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14dax: fix ENOMEM handling in grab_mapping_entry()Jan Kara1-1/+2
[ Upstream commit 1a14e3779dd58c16b30e56558146e5cc850ba8b0 ] grab_mapping_entry() has a bug in handling of ENOMEM condition. Suppose we have a PMD entry at index i which we are downgrading to a PTE entry. grab_mapping_entry() will set pmd_downgrade to true, lock the entry, clear the entry in xarray, and decrement mapping->nrpages. The it will call: entry = dax_make_entry(pfn_to_pfn_t(0), flags); dax_lock_entry(xas, entry); which inserts new PTE entry into xarray. However this may fail allocating the new node. We handle this by: if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) goto retry; however pmd_downgrade stays set to true even though 'entry' returned from get_unlocked_entry() will be NULL now. And we will go again through the downgrade branch. This is mostly harmless except that mapping->nrpages is decremented again and we temporarily have an invalid entry stored in xarray. Fix the problem by setting pmd_downgrade to false each time we lookup the entry we work with so that it matches the entry we found. Link: https://lkml.kernel.org/r/20210622160015.18004-1-jack@suse.cz Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14ocfs2: fix snprintf() checkingDan Carpenter2-11/+3
[ Upstream commit 54e948c60cc843b6e84dc44496edc91f51d2a28e ] The snprintf() function returns the number of bytes which would have been printed if the buffer was large enough. In other words it can return ">= remain" but this code assumes it returns "== remain". The run time impact of this bug is not very severe. The next iteration through the loop would trigger a WARN() when we pass a negative limit to snprintf(). We would then return success instead of -E2BIG. The kernel implementation of snprintf() will never return negatives so there is no need to check and I have deleted that dead code. Link: https://lkml.kernel.org/r/20210511135350.GV1955@kadam Fixes: a860f6eb4c6a ("ocfs2: sysfile interfaces for online file check") Fixes: 74ae4e104dfc ("ocfs2: Create stack glue sysfs files.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14mark pstore-blk as brokenChristoph Hellwig1-0/+1
[ Upstream commit d07f3b081ee632268786601f55e1334d1f68b997 ] pstore-blk just pokes directly into the pagecache for the block device without going through the file operations for that by faking up it's own file operations that do not match the block device ones. As this breaks the control of the block layer of it's page cache, and even now just works by accident only the best thing is to just disable this driver. Fixes: 17639f67c1d6 ("pstore/blk: Introduce backend for block devices") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210608161327.1537919-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14btrfs: clear log tree recovering status if starting transaction failsDavid Sterba1-0/+1
[ Upstream commit 1aeb6b563aea18cd55c73cf666d1d3245a00f08c ] When a log recovery is in progress, lots of operations have to take that into account, so we keep this status per tree during the operation. Long time ago error handling revamp patch 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling") removed clearing of the status in an error branch. Add it back as was intended in e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations"). There are probably no visible effects, log replay is done only during mount and if it fails all structures are cleared so the stale status won't be kept. Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>