summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2020-06-24block: Fix use-after-free in blkdev_get()Jason Yan1-5/+7
[ Upstream commit 2d3a8e2deddea6c89961c422ec0c5b851e648c14 ] In blkdev_get() we call __blkdev_get() to do some internal jobs and if there is some errors in __blkdev_get(), the bdput() is called which means we have released the refcount of the bdev (actually the refcount of the bdev inode). This means we cannot access bdev after that point. But acctually bdev is still accessed in blkdev_get() after calling __blkdev_get(). This results in use-after-free if the refcount is the last one we released in __blkdev_get(). Let's take a look at the following scenerio: CPU0 CPU1 CPU2 blkdev_open blkdev_open Remove disk bd_acquire blkdev_get __blkdev_get del_gendisk bdev_unhash_inode bd_acquire bdev_get_gendisk bd_forget failed because of unhashed bdput bdput (the last one) bdev_evict_inode access bdev => use after free [ 459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0 [ 459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132 [ 459.352347] [ 459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2 [ 459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 459.354947] Call Trace: [ 459.355337] dump_stack+0x111/0x19e [ 459.355879] ? __lock_acquire+0x24c1/0x31b0 [ 459.356523] print_address_description+0x60/0x223 [ 459.357248] ? __lock_acquire+0x24c1/0x31b0 [ 459.357887] kasan_report.cold+0xae/0x2d8 [ 459.358503] __lock_acquire+0x24c1/0x31b0 [ 459.359120] ? _raw_spin_unlock_irq+0x24/0x40 [ 459.359784] ? lockdep_hardirqs_on+0x37b/0x580 [ 459.360465] ? _raw_spin_unlock_irq+0x24/0x40 [ 459.361123] ? finish_task_switch+0x125/0x600 [ 459.361812] ? finish_task_switch+0xee/0x600 [ 459.362471] ? mark_held_locks+0xf0/0xf0 [ 459.363108] ? __schedule+0x96f/0x21d0 [ 459.363716] lock_acquire+0x111/0x320 [ 459.364285] ? blkdev_get+0xce/0xbe0 [ 459.364846] ? blkdev_get+0xce/0xbe0 [ 459.365390] __mutex_lock+0xf9/0x12a0 [ 459.365948] ? blkdev_get+0xce/0xbe0 [ 459.366493] ? bdev_evict_inode+0x1f0/0x1f0 [ 459.367130] ? blkdev_get+0xce/0xbe0 [ 459.367678] ? destroy_inode+0xbc/0x110 [ 459.368261] ? mutex_trylock+0x1a0/0x1a0 [ 459.368867] ? __blkdev_get+0x3e6/0x1280 [ 459.369463] ? bdev_disk_changed+0x1d0/0x1d0 [ 459.370114] ? blkdev_get+0xce/0xbe0 [ 459.370656] blkdev_get+0xce/0xbe0 [ 459.371178] ? find_held_lock+0x2c/0x110 [ 459.371774] ? __blkdev_get+0x1280/0x1280 [ 459.372383] ? lock_downgrade+0x680/0x680 [ 459.373002] ? lock_acquire+0x111/0x320 [ 459.373587] ? bd_acquire+0x21/0x2c0 [ 459.374134] ? do_raw_spin_unlock+0x4f/0x250 [ 459.374780] blkdev_open+0x202/0x290 [ 459.375325] do_dentry_open+0x49e/0x1050 [ 459.375924] ? blkdev_get_by_dev+0x70/0x70 [ 459.376543] ? __x64_sys_fchdir+0x1f0/0x1f0 [ 459.377192] ? inode_permission+0xbe/0x3a0 [ 459.377818] path_openat+0x148c/0x3f50 [ 459.378392] ? kmem_cache_alloc+0xd5/0x280 [ 459.379016] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.379802] ? path_lookupat.isra.0+0x900/0x900 [ 459.380489] ? __lock_is_held+0xad/0x140 [ 459.381093] do_filp_open+0x1a1/0x280 [ 459.381654] ? may_open_dev+0xf0/0xf0 [ 459.382214] ? find_held_lock+0x2c/0x110 [ 459.382816] ? lock_downgrade+0x680/0x680 [ 459.383425] ? __lock_is_held+0xad/0x140 [ 459.384024] ? do_raw_spin_unlock+0x4f/0x250 [ 459.384668] ? _raw_spin_unlock+0x1f/0x30 [ 459.385280] ? __alloc_fd+0x448/0x560 [ 459.385841] do_sys_open+0x3c3/0x500 [ 459.386386] ? filp_open+0x70/0x70 [ 459.386911] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 459.387610] ? trace_hardirqs_off_caller+0x55/0x1c0 [ 459.388342] ? do_syscall_64+0x1a/0x520 [ 459.388930] do_syscall_64+0xc3/0x520 [ 459.389490] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.390248] RIP: 0033:0x416211 [ 459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d 01 [ 459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002 [ 459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211 [ 459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00 [ 459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a [ 459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff [ 459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c [ 459.400168] [ 459.400430] Allocated by task 20132: [ 459.401038] kasan_kmalloc+0xbf/0xe0 [ 459.401652] kmem_cache_alloc+0xd5/0x280 [ 459.402330] bdev_alloc_inode+0x18/0x40 [ 459.402970] alloc_inode+0x5f/0x180 [ 459.403510] iget5_locked+0x57/0xd0 [ 459.404095] bdget+0x94/0x4e0 [ 459.404607] bd_acquire+0xfa/0x2c0 [ 459.405113] blkdev_open+0x110/0x290 [ 459.405702] do_dentry_open+0x49e/0x1050 [ 459.406340] path_openat+0x148c/0x3f50 [ 459.406926] do_filp_open+0x1a1/0x280 [ 459.407471] do_sys_open+0x3c3/0x500 [ 459.408010] do_syscall_64+0xc3/0x520 [ 459.408572] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.409415] [ 459.409679] Freed by task 1262: [ 459.410212] __kasan_slab_free+0x129/0x170 [ 459.410919] kmem_cache_free+0xb2/0x2a0 [ 459.411564] rcu_process_callbacks+0xbb2/0x2320 [ 459.412318] __do_softirq+0x225/0x8ac Fix this by delaying bdput() to the end of blkdev_get() which means we have finished accessing bdev. Fixes: 77ea887e433a ("implement in-kernel gendisk events handling") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24afs: Fix the mapping of the UAEOVERFLOW abort codeDavid Howells1-0/+1
[ Upstream commit 4ec89596d06bd481ba827f3b409b938d63914157 ] Abort code UAEOVERFLOW is returned when we try and set a time that's out of range, but it's currently mapped to EREMOTEIO by the default case. Fix UAEOVERFLOW to map instead to EOVERFLOW. Found with the generic/258 xfstest. Note that the test is wrong as it assumes that the filesystem will support a pre-UNIX-epoch date. Fixes: 1eda8bab70ca ("afs: Add support for the UAE error table") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24afs: Remove the error argument from afs_protocol_error()David Howells7-54/+35
[ Upstream commit 7126ead910aa9fcc9e16e9e7a8c9179658261f1d ] Remove the error argument from afs_protocol_error() as it's always -EBADMSG. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24afs: Set error flag rather than return error from file status decodeDavid Howells4-123/+55
[ Upstream commit 38355eec6a7d2b8f2f313f9174736dc877744e59 ] Set a flag in the call struct to indicate an unmarshalling error rather than return and handle an error from the decoding of file statuses. This flag is checked on a successful return from the delivery function. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24afs: Always include dir in bulk status fetch from afs_do_lookup()David Howells1-2/+7
[ Upstream commit 13fcc6356a94558a0a4857dc00cd26b3834a1b3e ] When a lookup is done in an AFS directory, the filesystem will speculate and fetch up to 49 other statuses for files in the same directory and fetch those as well, turning them into inodes or updating inodes that already exist. However, occasionally, a callback break might go missing due to NAT timing out, but the afs filesystem doesn't then realise that the directory is not up to date. Alleviate this by using one of the status slots to check the directory in which the lookup is being done. Reported-by: Dave Botsch <botsch@cnf.cornell.edu> Suggested-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24afs: Fix EOF corruptionDavid Howells1-1/+11
[ Upstream commit 3f4aa981816368fe6b1d13c2bfbe76df9687e787 ] When doing a partial writeback, afs_write_back_from_locked_page() may generate an FS.StoreData RPC request that writes out part of a file when a file has been constructed from pieces by doing seek, write, seek, write, ... as is done by ld. The FS.StoreData RPC is given the current i_size as the file length, but the server basically ignores it unless the data length is 0 (in which case it's just a truncate operation). The revised file length returned in the result of the RPC may then not reflect what we suggested - and this leads to i_size getting moved backwards - which causes issues later. Fix the client to take account of this by ignoring the returned file size unless the data version number jumped unexpectedly - in which case we're going to have to clear the pagecache and reload anyway. This can be observed when doing a kernel build on an AFS mount. The following pair of commands produce the issue: ld -m elf_x86_64 -z max-page-size=0x200000 --emit-relocs \ -T arch/x86/realmode/rm/realmode.lds \ arch/x86/realmode/rm/header.o \ arch/x86/realmode/rm/trampoline_64.o \ arch/x86/realmode/rm/stack.o \ arch/x86/realmode/rm/reboot.o \ -o arch/x86/realmode/rm/realmode.elf arch/x86/tools/relocs --realmode \ arch/x86/realmode/rm/realmode.elf \ >arch/x86/realmode/rm/realmode.relocs This results in the latter giving: Cannot read ELF section headers 0/18: Success as the realmode.elf file got corrupted. The sequence of events can also be driven with: xfs_io -t -f \ -c "pwrite -S 0x58 0 0x58" \ -c "pwrite -S 0x59 10000 1000" \ -c "close" \ /afs/example.com/scratch/a Fixes: 31143d5d515e ("AFS: implement basic file write support") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24afs: afs_write_end() should change i_size under the right lockDavid Howells1-2/+2
[ Upstream commit 1f32ef79897052ef7d3d154610d8d6af95abde83 ] Fix afs_write_end() to change i_size under vnode->cb_lock rather than ->wb_lock so that it doesn't race with afs_vnode_commit_status() and afs_getattr(). The ->wb_lock is only meant to guard access to ->wb_keys which isn't accessed by that piece of code. Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24afs: Fix non-setting of mtime when writing into mmapDavid Howells1-0/+1
[ Upstream commit bb413489288e4e457353bac513fddb6330d245ca ] The mtime on an inode needs to be updated when a write is made into an mmap'ed section. There are three ways in which this could be done: update it when page_mkwrite is called, update it when a page is changed from dirty to writeback or leave it to the server and fix the mtime up from the reply to the StoreData RPC. Found with the generic/215 xfstest. Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24ext4: stop overwrite the errcode in ext4_setup_superyangerkun1-0/+1
[ Upstream commit 5adaccac46ea79008d7b75f47913f1a00f91d0ce ] Now the errcode from ext4_commit_super will overwrite EROFS exists in ext4_setup_super. Actually, no need to call ext4_commit_super since we will return EROFS. Fix it by goto done directly. Fixes: c89128a00838 ("ext4: handle errors on ext4_commit_super") Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200601073404.3712492-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24NFS: Fix direct WRITE throughput regressionChuck Lever1-0/+2
[ Upstream commit ba838a75e73f55a780f1ee896b8e3ecb032dba0f ] I measured a 50% throughput regression for large direct writes. The observed on-the-wire behavior is that the client sends every NFS WRITE twice: once as an UNSTABLE WRITE plus a COMMIT, and once as a FILE_SYNC WRITE. This is because the nfs_write_match_verf() check in nfs_direct_commit_complete() fails for every WRITE. Buffered writes use nfs_write_completion(), which sets req->wb_verf correctly. Direct writes use nfs_direct_write_completion(), which does not set req->wb_verf at all. This leaves req->wb_verf set to all zeroes for every direct WRITE, and thus nfs_direct_commit_completion() always sets NFS_ODIRECT_RESCHED_WRITES. This fix appears to restore nearly all of the lost performance. Fixes: 1f28476dcb98 ("NFS: Fix O_DIRECT commit verifier handling") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24nfs: set invalid blocks after NFSv4 writesZheng Bin1-3/+11
[ Upstream commit 3a39e778690500066b31fe982d18e2e394d3bce2 ] Use the following command to test nfsv4(size of file1M is 1MB): mount -t nfs -o vers=4.0,actimeo=60 127.0.0.1/dir1 /mnt cp file1M /mnt du -h /mnt/file1M -->0 within 60s, then 1M When write is done(cp file1M /mnt), will call this: nfs_writeback_done nfs4_write_done nfs4_write_done_cb nfs_writeback_update_inode nfs_post_op_update_inode_force_wcc_locked(change, ctime, mtime nfs_post_op_update_inode_force_wcc_locked nfs_set_cache_invalid nfs_refresh_inode_locked nfs_update_inode nfsd write response contains change, ctime, mtime, the flag will be clear after nfs_update_inode. Howerver, write response does not contain space_used, previous open response contains space_used whose value is 0, so inode->i_blocks is still 0. nfs_getattr -->called by "du -h" do_update |= force_sync || nfs_attribute_cache_expired -->false in 60s cache_validity = READ_ONCE(NFS_I(inode)->cache_validity) do_update |= cache_validity & (NFS_INO_INVALID_ATTR -->false if (do_update) { __nfs_revalidate_inode } Within 60s, does not send getattr request to nfsd, thus "du -h /mnt/file1M" is 0. Add a NFS_INO_INVALID_BLOCKS flag, set it when nfsv4 write is done. Fixes: 16e143751727 ("NFS: More fine grained attribute tracking") Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24afs: Fix memory leak in afs_put_sysnames()Zhihao Cheng1-0/+1
[ Upstream commit 2ca068be09bf8e285036603823696140026dcbe7 ] Fix afs_put_sysnames() to actually free the specified afs_sysnames object after its reference count has been decreased to zero and its contents have been released. Fixes: 6f8880d8e681557 ("afs: Implement @sys substitution handling") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24f2fs: don't return vmalloc() memory from f2fs_kmalloc()Eric Biggers4-14/+8
[ Upstream commit 0b6d4ca04a86b9dababbb76e58d33c437e127b77 ] kmalloc() returns kmalloc'ed memory, and kvmalloc() returns either kmalloc'ed or vmalloc'ed memory. But the f2fs wrappers, f2fs_kmalloc() and f2fs_kvmalloc(), both return both kinds of memory. It's redundant to have two functions that do the same thing, and also breaking the standard naming convention is causing bugs since people assume it's safe to kfree() memory allocated by f2fs_kmalloc(). See e.g. the various allocations in fs/f2fs/compress.c. Fix this by making f2fs_kmalloc() just use kmalloc(). And to avoid re-introducing the allocation failures that the vmalloc fallback was intended to fix, convert the largest allocations to use f2fs_kvmalloc(). Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24gfs2: fix use-after-free on transaction ail listsBob Peterson1-2/+9
[ Upstream commit 83d060ca8d90fa1e3feac227f995c013100862d3 ] Before this patch, transactions could be merged into the system transaction by function gfs2_merge_trans(), but the transaction ail lists were never merged. Because the ail flushing mechanism can run separately, bd elements can be attached to the transaction's buffer list during the transaction (trans_add_meta, etc) but quickly moved to its ail lists. Later, in function gfs2_trans_end, the transaction can be freed (by gfs2_trans_end) while it still has bd elements queued to its ail lists, which can cause it to either lose track of the bd elements altogether (memory leak) or worse, reference the bd elements after the parent transaction has been freed. Although I've not seen any serious consequences, the problem becomes apparent with the previous patch's addition of: gfs2_assert_warn(sdp, list_empty(&tr->tr_ail1_list)); to function gfs2_trans_free(). This patch adds logic into gfs2_merge_trans() to move the merged transaction's ail lists to the sdp transaction. This prevents the use-after-free. To do this properly, we need to hold the ail lock, so we pass sdp into the function instead of the transaction itself. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24ext4: don't block for O_DIRECT if IOCB_NOWAIT is setJens Axboe1-0/+6
[ Upstream commit 6e014c621e7271649f0d51e54dbe1db4c10486c8 ] Running with some debug patches to detect illegal blocking triggered the extend/unaligned condition in ext4. If ext4 needs to extend the file (and hence go to buffered IO), or if the app is doing unaligned IO, then ext4 asks the iomap code to wait for IO completion. If the caller asked for no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN instead. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24ext4: handle ext4_mark_inode_dirty errorsHarshad Shirwadkar12-73/+139
[ Upstream commit 4209ae12b12265d475bba28634184423149bd14f ] ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return value may lead ext4 to ignore real failures that would result in corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail as soon as possible and return errors to the caller whenever appropriate. One of the possible scnearios when this bug could affected is that while creating a new inode, its directory entry gets added successfully but while writing the inode itself mark_inode_dirty returns error which is ignored. This would result in inconsistency that the directory entry points to a non-existent inode. Ran gce-xfstests smoke tests and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24nfsd: safer handling of corrupted c_typeJ. Bruce Fields1-2/+1
[ Upstream commit c25bf185e57213b54ea0d632ac04907310993433 ] This can only happen if there's a bug somewhere, so let's make it a WARN not a printk. Also, I think it's safest to ignore the corruption rather than trying to fix it by removing a cache entry. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24gfs2: Allow lock_nolock mount to specify jid=XBob Peterson1-1/+1
[ Upstream commit ea22eee4e6027d8927099de344f7fff43c507ef9 ] Before this patch, a simple typo accidentally added \n to the jid= string for lock_nolock mounts. This made it impossible to mount a gfs2 file system with a journal other than journal0. Thus: mount -tgfs2 -o hostdata="jid=1" <device> <mount pt> Resulted in: mount: wrong fs type, bad option, bad superblock on <device> In most cases this is not a problem. However, for debugging and testing purposes we sometimes want to test the integrity of other journals. This patch removes the unnecessary \n and thus allows lock_nolock users to specify an alternate journal. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24nfsd4: make drc_slab global, not per-netJ. Bruce Fields4-13/+25
[ Upstream commit 027690c75e8fd91b60a634d31c4891a6e39d45bd ] I made every global per-network-namespace instead. But perhaps doing that to this slab was a step too far. The kmem_cache_create call in our net init method also seems to be responsible for this lockdep warning: [ 45.163710] Unable to find swap-space signature [ 45.375718] trinity-c1 (855): attempted to duplicate a private mapping with mremap. This is not supported. [ 46.055744] futex_wake_op: trinity-c1 tries to shift op by -209; fix this program [ 51.011723] [ 51.013378] ====================================================== [ 51.013875] WARNING: possible circular locking dependency detected [ 51.014378] 5.2.0-rc2 #1 Not tainted [ 51.014672] ------------------------------------------------------ [ 51.015182] trinity-c2/886 is trying to acquire lock: [ 51.015593] 000000005405f099 (slab_mutex){+.+.}, at: slab_attr_store+0xa2/0x130 [ 51.016190] [ 51.016190] but task is already holding lock: [ 51.016652] 00000000ac662005 (kn->count#43){++++}, at: kernfs_fop_write+0x286/0x500 [ 51.017266] [ 51.017266] which lock already depends on the new lock. [ 51.017266] [ 51.017909] [ 51.017909] the existing dependency chain (in reverse order) is: [ 51.018497] [ 51.018497] -> #1 (kn->count#43){++++}: [ 51.018956] __lock_acquire+0x7cf/0x1a20 [ 51.019317] lock_acquire+0x17d/0x390 [ 51.019658] __kernfs_remove+0x892/0xae0 [ 51.020020] kernfs_remove_by_name_ns+0x78/0x110 [ 51.020435] sysfs_remove_link+0x55/0xb0 [ 51.020832] sysfs_slab_add+0xc1/0x3e0 [ 51.021332] __kmem_cache_create+0x155/0x200 [ 51.021720] create_cache+0xf5/0x320 [ 51.022054] kmem_cache_create_usercopy+0x179/0x320 [ 51.022486] kmem_cache_create+0x1a/0x30 [ 51.022867] nfsd_reply_cache_init+0x278/0x560 [ 51.023266] nfsd_init_net+0x20f/0x5e0 [ 51.023623] ops_init+0xcb/0x4b0 [ 51.023928] setup_net+0x2fe/0x670 [ 51.024315] copy_net_ns+0x30a/0x3f0 [ 51.024653] create_new_namespaces+0x3c5/0x820 [ 51.025257] unshare_nsproxy_namespaces+0xd1/0x240 [ 51.025881] ksys_unshare+0x506/0x9c0 [ 51.026381] __x64_sys_unshare+0x3a/0x50 [ 51.026937] do_syscall_64+0x110/0x10b0 [ 51.027509] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 51.028175] [ 51.028175] -> #0 (slab_mutex){+.+.}: [ 51.028817] validate_chain+0x1c51/0x2cc0 [ 51.029422] __lock_acquire+0x7cf/0x1a20 [ 51.029947] lock_acquire+0x17d/0x390 [ 51.030438] __mutex_lock+0x100/0xfa0 [ 51.030995] mutex_lock_nested+0x27/0x30 [ 51.031516] slab_attr_store+0xa2/0x130 [ 51.032020] sysfs_kf_write+0x11d/0x180 [ 51.032529] kernfs_fop_write+0x32a/0x500 [ 51.033056] do_loop_readv_writev+0x21d/0x310 [ 51.033627] do_iter_write+0x2e5/0x380 [ 51.034148] vfs_writev+0x170/0x310 [ 51.034616] do_pwritev+0x13e/0x160 [ 51.035100] __x64_sys_pwritev+0xa3/0x110 [ 51.035633] do_syscall_64+0x110/0x10b0 [ 51.036200] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 51.036924] [ 51.036924] other info that might help us debug this: [ 51.036924] [ 51.037876] Possible unsafe locking scenario: [ 51.037876] [ 51.038556] CPU0 CPU1 [ 51.039130] ---- ---- [ 51.039676] lock(kn->count#43); [ 51.040084] lock(slab_mutex); [ 51.040597] lock(kn->count#43); [ 51.041062] lock(slab_mutex); [ 51.041320] [ 51.041320] *** DEADLOCK *** [ 51.041320] [ 51.041793] 3 locks held by trinity-c2/886: [ 51.042128] #0: 000000001f55e152 (sb_writers#5){.+.+}, at: vfs_writev+0x2b9/0x310 [ 51.042739] #1: 00000000c7d6c034 (&of->mutex){+.+.}, at: kernfs_fop_write+0x25b/0x500 [ 51.043400] #2: 00000000ac662005 (kn->count#43){++++}, at: kernfs_fop_write+0x286/0x500 Reported-by: kernel test robot <lkp@intel.com> Fixes: 3ba75830ce17 "drc containerization" Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24ceph: don't return -ESTALE if there's still an open fileLuis Henriques1-1/+8
[ Upstream commit 878dabb64117406abd40977b87544d05bb3031fc ] Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting a file handle to dentry"), this fixes another corner case with name_to_handle_at/open_by_handle_at. The issue has been detected by xfstest generic/467, when doing: - name_to_handle_at("/cephfs/myfile") - open("/cephfs/myfile") - unlink("/cephfs/myfile") - sync; sync; - drop caches - open_by_handle_at() The call to open_by_handle_at should not fail because the file hasn't been deleted yet (only unlinked) and we do have a valid handle to it. -ESTALE shall be returned only if i_nlink is 0 *and* i_count is 1. This patch also makes sure we have LINK caps before checking i_nlink. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24NFSv4.1 fix rpc_call_done assignment for BIND_CONN_TO_SESSIONOlga Kornievskaia1-1/+1
[ Upstream commit 1c709b766e73e54d64b1dde1b7cfbcf25bcb15b9 ] Fixes: 02a95dee8cf0 ("NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24fuse: copy_file_range should truncate cacheMiklos Szeredi1-0/+22
[ Upstream commit 9b46418c40fe910e6537618f9932a8be78a3dd6c ] After the copy operation completes the cache is not up-to-date. Truncate all pages in the interval that has successfully been copied. Truncating completely copied dirty pages is okay, since the data has been overwritten anyway. Truncating partially copied dirty pages is not okay; add a comment for now. Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24fuse: fix copy_file_range cache issuesMiklos Szeredi1-12/+8
[ Upstream commit 2c4656dfd994538176db30ce09c02cc0dfc361ae ] a) Dirty cache needs to be written back not just in the writeback_cache case, since the dirty pages may come from memory maps. b) The fuse_writeback_range() helper takes an inclusive interval, so the end position needs to be pos+len-1 instead of pos+len. Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24dlm: remove BUG() before panic()Arnd Bergmann1-1/+0
[ Upstream commit fe204591cc9480347af7d2d6029b24a62e449486 ] Building a kernel with clang sometimes fails with an objtool error in dlm: fs/dlm/lock.o: warning: objtool: revert_lock_pc()+0xbd: can't find jump dest instruction at .text+0xd7fc The problem is that BUG() never returns and the compiler knows that anything after it is unreachable, however the panic still emits some code that does not get fully eliminated. Having both BUG() and panic() is really pointless as the BUG() kills the current process and the subsequent panic() never hits. In most cases, we probably don't really want either and should replace the DLM_ASSERT() statements with WARN_ON(), as has been done for some of them. Remove the BUG() here so the user at least sees the panic message and we can reliably build randconfig kernels. Fixes: e7fd41792fc0 ("[DLM] The core of the DLM for GFS2/CLVM") Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: clang-built-linux@googlegroups.com Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24f2fs: compress: fix zstd data corruptionChao Yu1-0/+7
[ Upstream commit 1454c978efbb57b052670d50023f48c759d704ce ] During zstd compression, ZSTD_endStream() may return non-zero value because distination buffer is full, but there is still compressed data remained in intermediate buffer, it means that zstd algorithm can not save at last one block space, let's just writeback raw data instead of compressed one, this can fix data corruption when decompressing incomplete stored compression data. Fixes: 50cfa66f0de0 ("f2fs: compress: support zstd compress algorithm") Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24f2fs: fix potential use-after-free issueChao Yu1-4/+4
[ Upstream commit f3494345ce9999624b36109252a4bf5f00e51a46 ] In error path of f2fs_read_multi_pages(), it should let last referrer release decompress io context memory, otherwise, other referrer will cause use-after-free issue. Fixes: 4c8ff7095bef ("f2fs: support data compression") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24f2fs: Fix wrong stub helper update_sit_infoYueHaibing1-1/+1
[ Upstream commit 48abe91ac1ad27cd5a5709f983dcf58f2b9a6b70 ] update_sit_info should be f2fs_update_sit_info, otherwise build fails while no CONFIG_F2FS_STAT_FS. Fixes: fc7100ea2a52 ("f2fs: Add f2fs stats to sysfs") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24fuse: BUG_ON correction in fuse_dev_splice_write()Vasily Averin1-2/+3
[ Upstream commit 0e9fb6f17ad5b386b75451328975a07d7d953c6d ] commit 963545357202 ("fuse: reduce allocation size for splice_write") changed size of bufs array, so BUG_ON which checks the index of the array shold also be fixed. [SzM: turn BUG_ON into WARN_ON] Fixes: 963545357202 ("fuse: reduce allocation size for splice_write") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24virtiofs: schedule blocking async replies in separate workerVivek Goyal3-35/+73
[ Upstream commit bb737bbe48bea9854455cb61ea1dc06e92ce586c ] In virtiofs (unlike in regular fuse) processing of async replies is serialized. This can result in a deadlock in rare corner cases when there's a circular dependency between the completion of two or more async replies. Such a deadlock can be reproduced with xfstests:generic/503 if TEST_DIR == SCRATCH_MNT (which is a misconfiguration): - Process A is waiting for page lock in worker thread context and blocked (virtio_fs_requests_done_work()). - Process B is holding page lock and waiting for pending writes to finish (fuse_wait_on_page_writeback()). - Write requests are waiting in virtqueue and can't complete because worker thread is blocked on page lock (process A). Fix this by creating a unique work_struct for each async reply that can block (O_DIRECT read). Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster()Chuhong Yuan1-1/+3
[ Upstream commit 8cc0072469723459dc6bd7beff81b2b3149f4cf4 ] xfs_ifree_cluster() calls xfs_perag_get() at the beginning, but forgets to call xfs_perag_put() in one failed path. Add the missed function call to fix it. Fixes: ce92464c180b ("xfs: make xfs_trans_get_buf return an error code") Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24f2fs: handle readonly filesystem in f2fs_ioc_shutdown()Chao Yu1-1/+8
[ Upstream commit 8626441f05dc45a2f4693ee6863d02456ce39e60 ] If mountpoint is readonly, we should allow shutdowning filesystem successfully, this fixes issue found by generic/599 testcase of xfstest. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24cifs: set up next DFS target before generic_ip_connect()Paulo Alcantara1-9/+9
[ Upstream commit aaa3aef34d3ab9499a5c7633823429f7a24e6dff ] If we mount a very specific DFS link \\FS0.FOO.COM\dfs\link -> \FS0\share1, \FS1\share2 where its target list contains NB names ("FS0" & "FS1") rather than FQDN ones ("FS0.FOO.COM" & "FS1.FOO.COM"), we end up connecting to \FOO\share1 but server->hostname will have "FOO.COM". The reason is because both "FS0" and "FS0.FOO.COM" resolve to same IP address and they share same TCP server connection, but "FS0.FOO.COM" was the first hostname set -- which is OK. However, if the echo thread timeouts and we still have a good connection to "FS0", in cifs_reconnect() rc = generic_ip_connect(server) -> success if (rc) { ... reconn_inval_dfs_target(server, cifs_sb, &tgt_list, &tgt_it); ... } ... it successfully reconnects to "FS0" server but does not set up next DFS target - which should be the same target server "\FS0\share1" - and server->hostname remains set to "FS0.FOO.COM" rather than "FS0", as reconn_inval_dfs_target() would have it set to "FS0" if called earlier. Finally, in __smb2_reconnect(), the reconnect of tcons would fail because tcon->ses->server->hostname (FS0.FOO.COM) does not match DFS target's hostname (FS0). Fix that by calling reconn_inval_dfs_target() before generic_ip_connect() so server->hostname will get updated correctly prior to reconnecting its tcons in __smb2_reconnect(). With "cifs: handle hostnames that resolve to same ip in failover" patch - The above problem would not occur. - We could save an DNS query to find out that they both resolve to the same ip address. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24nfsd: Fix svc_xprt refcnt leak when setup callback client failedXiyu Yang1-0/+2
[ Upstream commit a4abc6b12eb1f7a533c2e7484cfa555454ff0977 ] nfsd4_process_cb_update() invokes svc_xprt_get(), which increases the refcount of the "c->cn_xprt". The reference counting issue happens in one exception handling path of nfsd4_process_cb_update(). When setup callback client failed, the function forgets to decrease the refcnt increased by svc_xprt_get(), causing a refcnt leak. Fix this issue by calling svc_xprt_put() when setup callback client failed. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24f2fs: report delalloc reserve as non-free in statfs for project quotaKonstantin Khlebnikov1-1/+2
[ Upstream commit baaa7ebf25c78c5cb712fac16b7f549100beddd3 ] This reserved space isn't committed yet but cannot be used for allocations. For userspace it has no difference from used space. See the same fix in ext4 commit f06925c73942 ("ext4: report delalloc reserve as non-free in statfs for project quota"). Fixes: ddc34e328d06 ("f2fs: introduce f2fs_statfs_project") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-24f2fs: compress: let lz4 compressor handle output buffer budget properlyChao Yu1-6/+9
[ Upstream commit f6644143c63f2eac88973f7fea087582579b0189 ] Commonly, in order to handle lz4 worst compress case, caller should allocate buffer with size of LZ4_compressBound(inputsize) for target compressed data storing, however in this case, if caller didn't allocate enough space, lz4 compressor still can handle output buffer budget properly, and end up compressing when left space in output buffer is not enough. So we don't have to allocate buffer with size for worst case, then we can avoid 2 * 4KB size intermediate buffer allocation when log_cluster_size is 2, and avoid unnecessary compressing work of compressor if we can not save at least 4KB space. Suggested-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22f2fs: fix checkpoint=disable:%u%%Jaegeuk Kim2-6/+20
commit 1ae18f71cb522684bac1718f5c188fb5e30eb23d upstream. When parsing the mount option, we don't have sbi->user_block_count. Should do it after getting it. Cc: <stable@vger.kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22f2fs: don't leak filename in f2fs_try_convert_inline_dir()Eric Biggers1-2/+4
commit ff5f85c8d62a487bde415ef4c9e2d0be718021df upstream. We need to call fscrypt_free_filename() to free the memory allocated by fscrypt_setup_filename(). Fixes: b06af2aff28b ("f2fs: convert inline_dir early before starting rename") Cc: <stable@vger.kernel.org> # v5.6+ Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22jbd2: avoid leaking transaction credits when unreserving handleJan Kara1-3/+11
commit 14ff6286309e2853aed50083c9a83328423fdd8c upstream. When reserved transaction handle is unused, we subtract its reserved credits in __jbd2_journal_unreserve_handle() called from jbd2_journal_stop(). However this function forgets to remove reserved credits from transaction->t_outstanding_credits and thus the transaction space that was reserved remains effectively leaked. The leaked transaction space can be quite significant in some cases and leads to unnecessarily small transactions and thus reducing throughput of the journalling machinery. E.g. fsmark workload creating lots of 4k files was observed to have about 20% lower throughput due to this when ext4 is mounted with dioread_nolock mount option. Subtract reserved credits from t_outstanding_credits as well. CC: stable@vger.kernel.org Fixes: 8f7d89f36829 ("jbd2: transaction reservation support") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22ext4: fix race between ext4_sync_parent() and rename()Eric Biggers1-15/+13
commit 08adf452e628b0e2ce9a01048cfbec52353703d7 upstream. 'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is broken because without d_lock, d_parent can be concurrently changed due to a rename(). Then if the old directory is immediately deleted, old d_parent->inode can be NULL. That causes a NULL dereference in igrab(). To fix this, use dget_parent() to safely grab a reference to the parent dentry, which pins the inode. This also eliminates the need to use d_find_any_alias() other than for the initial inode, as we no longer throw away the dentry at each step. This is an extremely hard race to hit, but it is possible. Adding a udelay() in between the reads of ->d_parent and its ->d_inode makes it reproducible on a no-journal filesystem using the following program: #include <fcntl.h> #include <unistd.h> int main() { if (fork()) { for (;;) { mkdir("dir1", 0700); int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC); write(fd, "X", 1); close(fd); } } else { mkdir("dir2", 0700); for (;;) { rename("dir1/file", "dir2/file"); rmdir("dir1"); } } } Fixes: d59729f4e794 ("ext4: fix races in ext4_sync_parent()") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22ext4: fix error pointer dereferenceJeffle Xu1-2/+5
commit 8418897f1bf87da0cb6936489d57a4320c32c0af upstream. Don't pass error pointers to brelse(). commit 7159a986b420 ("ext4: fix some error pointer dereferences") has fixed some cases, fix the remaining one case. Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is stored in @bs->bh, which will be passed to brelse() in the cleanup routine of ext4_xattr_set_handle(). This will then cause a NULL panic crash in __brelse(). BUG: unable to handle kernel NULL pointer dereference at 000000000000005b RIP: 0010:__brelse+0x1b/0x50 Call Trace: ext4_xattr_set_handle+0x163/0x5d0 ext4_xattr_set+0x95/0x110 __vfs_setxattr+0x6b/0x80 __vfs_setxattr_noperm+0x68/0x1b0 vfs_setxattr+0xa0/0xb0 setxattr+0x12c/0x1a0 path_setxattr+0x8d/0xc0 __x64_sys_setxattr+0x27/0x30 do_syscall_64+0x60/0x250 entry_SYSCALL_64_after_hwframe+0x49/0xbe In this case, @bs->bh stores '-EIO' actually. Fixes: fb265c9cb49e ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: stable@kernel.org # 2.6.19 Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22ext4: fix buffer_head refcnt leak when ext4_iget() failsXiyu Yang1-0/+1
commit 3bbd0ef26098d241dc59ee77ba14b7dab0df0786 upstream. ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a reference of the specified buffer_head object to "bitmap_bh" with increased refcnt. When ext4_orphan_get() returns, local variable "bitmap_bh" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in one exception handling path of ext4_orphan_get(). When ext4_iget() fails, the function forgets to decrease the refcnt increased by ext4_read_inode_bitmap(), causing a refcnt leak. Fix this issue by calling brelse() when ext4_iget() fails. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/1587618568-13418-1-git-send-email-xiyuyang19@fudan.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_maxHarshad Shirwadkar1-3/+6
commit c36a71b4e35ab35340facdd6964a00956b9fef0a upstream. If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned (-1) resulting in illegal memory accesses. Although there is no consistent repro, we see that generic/019 sometimes crashes because of this bug. Ran gce-xfstests smoke and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22btrfs: fix space_info bytes_may_use underflow during space cache writeoutFilipe Manana1-5/+15
commit 2166e5edce9ac1edf3b113d6091ef72fcac2d6c4 upstream. We always preallocate a data extent for writing a free space cache, which causes writeback to always try the nocow path first, since the free space inode has the prealloc bit set in its flags. However if the block group that contains the data extent for the space cache has been turned to RO mode due to a running scrub or balance for example, we have to fallback to the cow path. In that case once a new data extent is allocated we end up calling btrfs_add_reserved_bytes(), which decrements the counter named bytes_may_use from the data space_info object with the expection that this counter was previously incremented with the same amount (the size of the data extent). However when we started writeout of the space cache at cache_save_setup(), we incremented the value of the bytes_may_use counter through a call to btrfs_check_data_free_space() and then decremented it through a call to btrfs_prealloc_file_range_trans() immediately after. So when starting the writeback if we fallback to cow mode we have to increment the counter bytes_may_use of the data space_info again to compensate for the extent allocation done by the cow path. When this issue happens we are incorrectly decrementing the bytes_may_use counter and when its current value is smaller then the amount we try to subtract we end up with the following warning: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...) CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: writeback wb_workfn (flush-btrfs-1591) RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] Code: ff ff 48 (...) RSP: 0000:ffffa41608f13660 EFLAGS: 00010287 RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410 RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000 R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400 R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400 FS: 0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: find_free_extent+0x4a0/0x16c0 [btrfs] btrfs_reserve_extent+0x91/0x180 [btrfs] cow_file_range+0x12d/0x490 [btrfs] btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs] ? find_lock_delalloc_range+0x221/0x250 [btrfs] writepage_delalloc+0xe8/0x150 [btrfs] __extent_writepage+0xe8/0x4c0 [btrfs] extent_write_cache_pages+0x237/0x530 [btrfs] extent_writepages+0x44/0xa0 [btrfs] do_writepages+0x23/0x80 __writeback_single_inode+0x59/0x700 writeback_sb_inodes+0x267/0x5f0 __writeback_inodes_wb+0x87/0xe0 wb_writeback+0x382/0x590 ? wb_workfn+0x4a2/0x6c0 wb_workfn+0x4a2/0x6c0 process_one_work+0x26d/0x6a0 worker_thread+0x4f/0x3e0 ? process_one_work+0x6a0/0x6a0 kthread+0x103/0x140 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x3a/0x50 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bd7c03622e0b0a52 ]--- ------------[ cut here ]------------ So fix this by incrementing the bytes_may_use counter of the data space_info when we fallback to the cow path. If the cow path is successful the counter is decremented after extent allocation (by btrfs_add_reserved_bytes()), if it fails it ends up being decremented as well when clearing the delalloc range (extent_clear_unlock_delalloc()). This could be triggered sporadically by the test case btrfs/061 from fstests. Fixes: 82d5902d9c681b ("Btrfs: Support reading/writing on disk free ino cache") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22btrfs: fix space_info bytes_may_use underflow after nocow buffered writeFilipe Manana1-5/+56
commit 467dc47ea99c56e966e99d09dae54869850abeeb upstream. When doing a buffered write we always try to reserve data space for it, even when the file has the NOCOW bit set or the write falls into a file range covered by a prealloc extent. This is done both because it is expensive to check if we can do a nocow write (checking if an extent is shared through reflinks or if there's a hole in the range for example), and because when writeback starts we might actually need to fallback to COW mode (for example the block group containing the target extents was turned into RO mode due to a scrub or balance). When we are unable to reserve data space we check if we can do a nocow write, and if we can, we proceed with dirtying the pages and setting up the range for delalloc. In this case the bytes_may_use counter of the data space_info object is not incremented, unlike in the case where we are able to reserve data space (done through btrfs_check_data_free_space() which calls btrfs_alloc_data_chunk_ondemand()). Later when running delalloc we attempt to start writeback in nocow mode but we might revert back to cow mode, for example because in the meanwhile a block group was turned into RO mode by a scrub or relocation. The cow path after successfully allocating an extent ends up calling btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of the data space_info object to have been incremented before - but we did not do it when the buffered write started, since there was not enough available data space. So btrfs_add_reserved_bytes() ends up decrementing the bytes_may_use counter anyway, and when the counter's current value is smaller then the size of the allocated extent we get a stack trace like the following: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...) CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: writeback wb_workfn (flush-btrfs-1754) RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] Code: ff ff 48 (...) RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287 RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410 RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000 R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400 R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800 FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: find_free_extent+0x4a0/0x16c0 [btrfs] btrfs_reserve_extent+0x91/0x180 [btrfs] cow_file_range+0x12d/0x490 [btrfs] run_delalloc_nocow+0x341/0xa40 [btrfs] btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs] ? find_lock_delalloc_range+0x221/0x250 [btrfs] writepage_delalloc+0xe8/0x150 [btrfs] __extent_writepage+0xe8/0x4c0 [btrfs] extent_write_cache_pages+0x237/0x530 [btrfs] ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs] extent_writepages+0x44/0xa0 [btrfs] do_writepages+0x23/0x80 __writeback_single_inode+0x59/0x700 writeback_sb_inodes+0x267/0x5f0 __writeback_inodes_wb+0x87/0xe0 wb_writeback+0x382/0x590 ? wb_workfn+0x4a2/0x6c0 wb_workfn+0x4a2/0x6c0 process_one_work+0x26d/0x6a0 worker_thread+0x4f/0x3e0 ? process_one_work+0x6a0/0x6a0 kthread+0x103/0x140 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x3a/0x50 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace f9f6ef8ec4cd8ec9 ]--- So to fix this, when falling back into cow mode check if space was not reserved, by testing for the bit EXTENT_NORESERVE in the respective file range, and if not, increment the bytes_may_use counter for the data space_info object. Also clear the EXTENT_NORESERVE bit from the range, so that if the cow path fails it decrements the bytes_may_use counter when clearing the delalloc range (through the btrfs_clear_delalloc_extent() callback). Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22btrfs: fix wrong file range cleanup after an error filling dealloc rangeFilipe Manana1-1/+1
commit e2c8e92d1140754073ad3799eb6620c76bab2078 upstream. If an error happens while running dellaloc in COW mode for a range, we can end up calling extent_clear_unlock_delalloc() for a range that goes beyond our range's end offset by 1 byte, which affects 1 extra page. This results in clearing bits and doing page operations (such as a page unlock) outside our target range. Fix that by calling extent_clear_unlock_delalloc() with an inclusive end offset, instead of an exclusive end offset, at cow_file_range(). Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22btrfs: fix corrupt log due to concurrent fsync of inodes with shared extentsFilipe Manana4-4/+27
commit e289f03ea79bbc6574b78ac25682555423a91cbb upstream. When we have extents shared amongst different inodes in the same subvolume, if we fsync them in parallel we can end up with checksum items in the log tree that represent ranges which overlap. For example, consider we have inodes A and B, both sharing an extent that covers the logical range from X to X + 64KiB: 1) Task A starts an fsync on inode A; 2) Task B starts an fsync on inode B; 3) Task A calls btrfs_csum_file_blocks(), and the first search in the log tree, through btrfs_lookup_csum(), returns -EFBIG because it finds an existing checksum item that covers the range from X - 64KiB to X; 4) Task A checks that the checksum item has not reached the maximum possible size (MAX_CSUM_ITEMS) and then releases the search path before it does another path search for insertion (through a direct call to btrfs_search_slot()); 5) As soon as task A releases the path and before it does the search for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG too, because there is an existing checksum item that has an end offset that matches the start offset (X) of the checksum range we want to log; 6) Task B releases the path; 7) Task A does the path search for insertion (through btrfs_search_slot()) and then verifies that the checksum item that ends at offset X still exists and extends its size to insert the checksums for the range from X to X + 64KiB; 8) Task A releases the path and returns from btrfs_csum_file_blocks(), having inserted the checksums into an existing checksum item that got its size extended. At this point we have one checksum item in the log tree that covers the logical range from X - 64KiB to X + 64KiB; 9) Task B now does a search for insertion using btrfs_search_slot() too, but it finds that the previous checksum item no longer ends at the offset X, it now ends at an of offset X + 64KiB, so it leaves that item untouched. Then it releases the path and calls btrfs_insert_empty_item() that inserts a checksum item with a key offset corresponding to X and a size for inserting a single checksum (4 bytes in case of crc32c). Subsequent iterations end up extending this new checksum item so that it contains the checksums for the range from X to X + 64KiB. So after task B returns from btrfs_csum_file_blocks() we end up with two checksum items in the log tree that have overlapping ranges, one for the range from X - 64KiB to X + 64KiB, and another for the range from X to X + 64KiB. Having checksum items that represent ranges which overlap, regardless of being in the log tree or in the chekcsums tree, can lead to problems where checksums for a file range end up not being found. This type of problem has happened a few times in the past and the following commits fixed them and explain in detail why having checksum items with overlapping ranges is problematic: 27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums" b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync" 40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree" Since this specific instance of the problem can only happen when logging inodes, because it is the only case where concurrent attempts to insert checksums for the same range can happen, fix the issue by using an extent io tree as a range lock to serialize checksum insertion during inode logging. This issue could often be reproduced by the test case generic/457 from fstests. When it happens it produces the following trace: BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610 BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884 item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4 item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4 item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4 item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4 item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4 item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4 (...) BTRFS error (device dm-0): block=30625792 write time tree block corruption detected ------------[ cut here ]------------ WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs] Modules linked in: btrfs dm_thin_pool ... CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs] Code: c7 c7 ... RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001 RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0 R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000 FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btree_submit_bio_hook+0x67/0xc0 [btrfs] submit_one_bio+0x31/0x50 [btrfs] btree_write_cache_pages+0x2db/0x4b0 [btrfs] ? __filemap_fdatawrite_range+0xb1/0x110 do_writepages+0x23/0x80 __filemap_fdatawrite_range+0xd2/0x110 btrfs_write_marked_extents+0x15e/0x180 [btrfs] btrfs_sync_log+0x206/0x10a0 [btrfs] ? kmem_cache_free+0x315/0x3b0 ? btrfs_log_inode+0x1e8/0xf90 [btrfs] ? __mutex_unlock_slowpath+0x45/0x2a0 ? lockref_put_or_lock+0x9/0x30 ? dput+0x2d/0x580 ? dput+0xb5/0x580 ? btrfs_sync_file+0x464/0x4d0 [btrfs] btrfs_sync_file+0x464/0x4d0 [btrfs] do_fsync+0x38/0x60 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fb41953a6d0 Code: 48 3d ... RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0 RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003 RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009 R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060 R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace d543fc76f5ad7fd8 ]--- In that trace the tree checker detected the overlapping checksum items at the time when we triggered writeback for the log tree when syncing the log. Another trace that can happen is due to BUG_ON() when deleting checksum items while logging an inode: BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584) BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610 BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473 item 0 key (257 1 0) itemoff 16123 itemsize 160 inode generation 7 size 262144 mode 100600 item 1 key (257 12 256) itemoff 16103 itemsize 20 item 2 key (257 108 0) itemoff 16050 itemsize 53 extent data disk bytenr 13631488 nr 4096 extent data offset 0 nr 131072 ram 131072 (...) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:3153! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs] Code: 0f b6 ... RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001 RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001 R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08 R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_del_csums+0x2f4/0x540 [btrfs] copy_items+0x4b5/0x560 [btrfs] btrfs_log_inode+0x910/0xf90 [btrfs] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs] ? dget_parent+0x5/0x370 btrfs_log_dentry_safe+0x4a/0x70 [btrfs] btrfs_sync_file+0x42b/0x4d0 [btrfs] __x64_sys_msync+0x199/0x200 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fe586c65760 Code: 00 f7 ... RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760 RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000 RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61 R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1 R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420 Modules linked in: dm_log_writes ... ---[ end trace c92a7f447a8515f5 ]--- CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22btrfs: fix error handling when submitting direct I/O bioOmar Sandoval1-3/+3
commit 6d3113a193e3385c72240096fe397618ecab6e43 upstream. In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID stripe or chunk, we submit orig_bio without cloning it. In this case, we don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement pending_bios to -1, and we never complete orig_bio. Fix it by initializing pending_bios to 1 instead of incrementing later. Fixing this exposes another bug: we put orig_bio prematurely and then put it again from end_io. Fix it by not putting orig_bio. After this change, pending_bios is really more of a reference count, but I'll leave that cleanup separate to keep the fix small. Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22btrfs: reloc: fix reloc root leak and NULL pointer dereferenceQu Wenruo1-3/+9
commit 51415b6c1b117e223bc083e30af675cb5c5498f3 upstream. [BUG] When balance is canceled, there is a pretty high chance that unmounting the fs can lead to lead the NULL pointer dereference: BTRFS warning (device dm-3): page private not zero on page 223158272 ... BTRFS warning (device dm-3): page private not zero on page 223162368 BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1 BUG: kernel NULL pointer dereference, address: 0000000000000168 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 2 PID: 5793 Comm: umount Tainted: G O 5.7.0-rc5-custom+ #53 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__lock_acquire+0x5dc/0x24c0 Call Trace: lock_acquire+0xab/0x390 _raw_spin_lock+0x39/0x80 btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs] release_extent_buffer+0xb2/0x170 [btrfs] free_extent_buffer+0x66/0xb0 [btrfs] btrfs_put_root+0x8e/0x130 [btrfs] btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs] btrfs_free_fs_info+0xe5/0x120 [btrfs] btrfs_kill_super+0x1f/0x30 [btrfs] deactivate_locked_super+0x3b/0x80 deactivate_super+0x3e/0x50 cleanup_mnt+0x109/0x160 __cleanup_mnt+0x12/0x20 task_work_run+0x67/0xa0 exit_to_usermode_loop+0xc5/0xd0 syscall_return_slowpath+0x205/0x360 do_syscall_64+0x6e/0xb0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7fd028ef740b [CAUSE] When balance is canceled, all reloc roots are marked as orphan, and orphan reloc roots are going to be cleaned up. However for orphan reloc roots and merged reloc roots, their lifespan are quite different: Merged reloc roots | Orphan reloc roots by cancel -------------------------------------------------------------------- create_reloc_root() | create_reloc_root() |- refs == 1 | |- refs == 1 | btrfs_grab_root(reloc_root); | btrfs_grab_root(reloc_root); |- refs == 2 | |- refs == 2 | root->reloc_root = reloc_root; | root->reloc_root = reloc_root; >>> No difference so far <<< | prepare_to_merge() | prepare_to_merge() |- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR) | merge_reloc_roots() | merge_reloc_roots() |- merge_reloc_root() | |- Doing nothing to put reloc root |- insert_dirty_subvol() | |- refs == 2 |- __del_reloc_root() | |- btrfs_put_root() | |- refs == 1 | >>> Now orphan reloc roots still have refs 2 <<< | clean_dirty_subvols() | clean_dirty_subvols() |- btrfs_drop_snapshot() | |- btrfS_drop_snapshot() |- reloc_root get freed | |- reloc_root still has refs 2 | related ebs get freed, but | reloc_root still recorded in | allocated_roots btrfs_check_leaked_roots() | btrfs_check_leaked_roots() |- No leaked roots | |- Leaked reloc_roots detected | |- btrfs_put_root() | |- free_extent_buffer(root->node); | |- eb already freed, caused NULL | pointer dereference [FIX] The fix is to clear fs_root->reloc_root and put it at merge_reloc_roots() time, so that we won't leak reloc roots. Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.1+ Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22btrfs: force chunk allocation if our global rsv is larger than metadataJosef Bacik2-0/+21
commit 9c343784c4328781129bcf9e671645f69fe4b38a upstream. Nikolay noticed a bunch of test failures with my global rsv steal patches. At first he thought they were introduced by them, but they've been failing for a while with 64k nodes. The problem is with 64k nodes we have a global reserve that calculates out to 13MiB on a freshly made file system, which only has 8MiB of metadata space. Because of changes I previously made we no longer account for the global reserve in the overcommit logic, which means we correctly allow overcommit to happen even though we are already overcommitted. However in some corner cases, for example btrfs/170, we will allocate the entire file system up with data chunks before we have enough space pressure to allocate a metadata chunk. Then once the fs is full we ENOSPC out because we cannot overcommit and the global reserve is taking up all of the available space. The most ideal way to deal with this is to change our space reservation stuff to take into account the height of the tree's that we're modifying, so that our global reserve calculation does not end up so obscenely large. However that is a huge undertaking. Instead fix this by forcing a chunk allocation if the global reserve is larger than the total metadata space. This gives us essentially the same behavior that happened before, we get a chunk allocated and these tests can pass. This is meant to be a stop-gap measure until we can tackle the "tree height only" project. Fixes: 0096420adb03 ("btrfs: do not account global reserve in can_overcommit") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-22btrfs: send: emit file capabilities after chownMarcos Paulo de Souza1-0/+67
commit 89efda52e6b6930f80f5adda9c3c9edfb1397191 upstream. Whenever a chown is executed, all capabilities of the file being touched are lost. When doing incremental send with a file with capabilities, there is a situation where the capability can be lost on the receiving side. The sequence of actions bellow shows the problem: $ mount /dev/sda fs1 $ mount /dev/sdb fs2 $ touch fs1/foo.bar $ setcap cap_sys_nice+ep fs1/foo.bar $ btrfs subvolume snapshot -r fs1 fs1/snap_init $ btrfs send fs1/snap_init | btrfs receive fs2 $ chgrp adm fs1/foo.bar $ setcap cap_sys_nice+ep fs1/foo.bar $ btrfs subvolume snapshot -r fs1 fs1/snap_complete $ btrfs subvolume snapshot -r fs1 fs1/snap_incremental $ btrfs send fs1/snap_complete | btrfs receive fs2 $ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2 At this point, only a chown was emitted by "btrfs send" since only the group was changed. This makes the cap_sys_nice capability to be dropped from fs2/snap_incremental/foo.bar To fix that, only emit capabilities after chown is emitted. The current code first checks for xattrs that are new/changed, emits them, and later emit the chown. Now, __process_new_xattr skips capabilities, letting only finish_inode_if_needed to emit them, if they exist, for the inode being processed. This behavior was being worked around in "btrfs receive" side by caching the capability and only applying it after chown. Now, xattrs are only emmited _after_ chown, making that workaround not needed anymore. Link: https://github.com/kdave/btrfs-progs/issues/202 CC: stable@vger.kernel.org # 4.4+ Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>