summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2020-12-29btrfs: fix return value mixup in btrfs_get_extentPavel Machek1-1/+1
commit 881a3a11c2b858fe9b69ef79ac5ee9978a266dc9 upstream btrfs_get_extent() sets variable ret, but out: error path expect error to be in variable err so the error code is lost. Fixes: 6bf9e4bd6a27 ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Pavel Machek (CIP) <pavel@denx.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02btrfs: inode: Verify inode mode to avoid NULL pointer dereferenceQu Wenruo1-8/+33
commit 6bf9e4bd6a277840d3fe8c5d5d530a1fbd3db592 upstream [BUG] When accessing a file on a crafted image, btrfs can crash in block layer: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 136501067 P4D 136501067 PUD 124519067 PMD 0 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.0.0-rc8-default #252 RIP: 0010:end_bio_extent_readpage+0x144/0x700 Call Trace: <IRQ> blk_update_request+0x8f/0x350 blk_mq_end_request+0x1a/0x120 blk_done_softirq+0x99/0xc0 __do_softirq+0xc7/0x467 irq_exit+0xd1/0xe0 call_function_single_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x1e/0x170 [CAUSE] The crafted image has a tricky corruption, the INODE_ITEM has a different type against its parent dir: item 20 key (268 INODE_ITEM 0) itemoff 2808 itemsize 160 generation 13 transid 13 size 1048576 nbytes 1048576 block group 0 mode 121644 links 1 uid 0 gid 0 rdev 0 sequence 9 flags 0x0(none) This mode number 0120000 means it's a symlink. But the dir item think it's still a regular file: item 8 key (264 DIR_INDEX 5) itemoff 3707 itemsize 32 location key (268 INODE_ITEM 0) type FILE transid 13 data_len 0 name_len 2 name: f4 item 40 key (264 DIR_ITEM 51821248) itemoff 1573 itemsize 32 location key (268 INODE_ITEM 0) type FILE transid 13 data_len 0 name_len 2 name: f4 For symlink, we don't set BTRFS_I(inode)->io_tree.ops and leave it empty, as symlink is only designed to have inlined extent, all handled by tree block read. Thus no need to trigger btrfs_submit_bio_hook() for inline file extent. However end_bio_extent_readpage() expects tree->ops populated, as it's reading regular data extent. This causes NULL pointer dereference. [FIX] This patch fixes the problem in two ways: - Verify inode mode against its dir item when looking up inode So in btrfs_lookup_dentry() if we find inode mode mismatch with dir item, we error out so that corrupted inode will not be accessed. - Verify inode mode when getting extent mapping Only regular file should have regular or preallocated extent. If we found regular/preallocated file extent for symlink or the rest, we error out before submitting the read bio. With this fix that crafted image can be rejected gracefully: BTRFS critical (device loop0): inode mode mismatch with dir: inode mode=0121644 btrfs type=7 dir type=1 Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202763 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [sudip: use original btrfs_inode_type()] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02btrfs: adjust return values of btrfs_inode_by_nameSu Yue1-12/+10
commit 005d67127fa9dfb3382f2c9e918feed7a243a7fe upstream Previously, btrfs_inode_by_name() returned 0 which left caller to check objectid of location even location if the type was invalid. Let btrfs_inode_by_name() return -EUCLEAN if a corrupted location of a dir entry is found. Removal of label out_err also simplifies the function. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ drop unlikely ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14Btrfs: fix unexpected failure of nocow buffered writes after snapshotting ↵Robbie Ko1-21/+4
when low on space commit 8ecebf4d767e2307a946c8905278d6358eda35c3 upstream. Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") forced nocow writes to fallback to COW, during writeback, when a snapshot is created. This resulted in writes made before creating the snapshot to unexpectedly fail with ENOSPC during writeback when success (0) was returned to user space through the write system call. The steps leading to this problem are: 1. When it's not possible to allocate data space for a write, the buffered write path checks if a NOCOW write is possible. If it is, it will not reserve space and success (0) is returned to user space. 2. Then when a snapshot is created, the root's will_be_snapshotted atomic is incremented and writeback is triggered for all inode's that belong to the root being snapshotted. Incrementing that atomic forces all previous writes to fallback to COW during writeback (running delalloc). 3. This results in the writeback for the inodes to fail and therefore setting the ENOSPC error in their mappings, so that a subsequent fsync on them will report the error to user space. So it's not a completely silent data loss (since fsync will report ENOSPC) but it's a very unexpected and undesirable behaviour, because if a clean shutdown/unmount of the filesystem happens without previous calls to fsync, it is expected to have the data present in the files after mounting the filesystem again. So fix this by adding a new atomic named snapshot_force_cow to the root structure which prevents this behaviour and works the following way: 1. It is incremented when we start to create a snapshot after triggering writeback and before waiting for writeback to finish. 2. This new atomic is now what is used by writeback (running delalloc) to decide whether we need to fallback to COW or not. Because we incremented this new atomic after triggering writeback in the snapshot creation ioctl, we ensure that all buffered writes that happened before snapshot creation will succeed and not fallback to COW (which would make them fail with ENOSPC). 3. The existing atomic, will_be_snapshotted, is kept because it is used to force new buffered writes, that start after we started snapshotting, to reserve data space even when NOCOW is possible. This makes these writes fail early with ENOSPC when there's no available space to allocate, preventing the unexpected behaviour of writeback later failing with ENOSPC due to a fallback to COW mode. Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting") Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01btrfs: qgroup: fix data leak caused by race between writeback and truncateQu Wenruo1-13/+10
[ Upstream commit fa91e4aa1716004ea8096d5185ec0451e206aea0 ] [BUG] When running tests like generic/013 on test device with btrfs quota enabled, it can normally lead to data leak, detected at unmount time: BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096 ------------[ cut here ]------------ WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace caf08beafeca2392 ]--- BTRFS error (device dm-3): qgroup reserved space leaked [CAUSE] In the offending case, the offending operations are: 2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0 2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0 The following sequence of events could happen after the writev(): CPU1 (writeback) | CPU2 (truncate) ----------------------------------------------------------------- btrfs_writepages() | |- extent_write_cache_pages() | |- Got page for 1003520 | | 1003520 is Dirty, no writeback | | So (!clear_page_dirty_for_io()) | | gets called for it | |- Now page 1003520 is Clean. | | | btrfs_setattr() | | |- btrfs_setsize() | | |- truncate_setsize() | | New i_size is 18388 |- __extent_writepage() | | |- page_offset() > i_size | |- btrfs_invalidatepage() | |- Page is clean, so no qgroup | callback executed This means, the qgroup reserved data space is not properly released in btrfs_invalidatepage() as the page is Clean. [FIX] Instead of checking the dirty bit of a page, call btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage(). As qgroup rsv are completely bound to the QGROUP_RESERVED bit of io_tree, not bound to page status, thus we won't cause double freeing anyway. Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26btrfs: inode: fix NULL pointer dereference if inode doesn't need compressionQu Wenruo1-4/+11
[ Upstream commit 1e6e238c3002ea3611465ce5f32777ddd6a40126 ] [BUG] There is a bug report of NULL pointer dereference caused in compress_file_extent(): Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs] LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] Call Trace: [c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable) [c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs] [c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs] [c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0 [c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660 [c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0 [c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c ---[ end trace f16954aa20d822f6 ]--- [CAUSE] For the following execution route of compress_file_range(), it's possible to hit NULL pointer dereference: compress_file_extent() |- pages = NULL; |- start = async_chunk->start = 0; |- end = async_chunk = 4095; |- nr_pages = 1; |- inode_need_compress() == false; <<< Possible, see later explanation | Now, we have nr_pages = 1, pages = NULL |- cont: |- ret = cow_file_range_inline(); |- if (ret <= 0) { |- for (i = 0; i < nr_pages; i++) { |- WARN_ON(pages[i]->mapping); <<< Crash To enter above call execution branch, we need the following race: Thread 1 (chattr) | Thread 2 (writeback) --------------------------+------------------------------ | btrfs_run_delalloc_range | |- inode_need_compress = true | |- cow_file_range_async() btrfs_ioctl_set_flag() | |- binode_flags |= | BTRFS_INODE_NOCOMPRESS | | compress_file_range() | |- inode_need_compress = false | |- nr_page = 1 while pages = NULL | | Then hit the crash [FIX] This patch will fix it by checking @pages before doing accessing it. This patch is only designed as a hot fix and easy to backport. More elegant fix may make btrfs only check inode_need_compress() once to avoid such race, but that would be another story. Reported-by: Luciano Chavez <chavez@us.ibm.com> Fixes: 4d3a800ebb12 ("btrfs: merge nr_pages input and output parameter in compress_pages") CC: stable@vger.kernel.org # 4.14.x: cecc8d9038d16: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-26btrfs: Move free_pages_out label in inline extent handling branch in ↵Nikolay Borisov1-8/+8
compress_file_range [ Upstream commit cecc8d9038d164eda61fbcd72520975a554ea63e ] This label is only executed if compress_file_range fails to create an inline extent. So move its code in the semantically related inline extent handling branch. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09btrfs: fix data block group relocation failure due to concurrent scrubFilipe Manana1-1/+18
[ Upstream commit 432cd2a10f1c10cead91fe706ff5dc52f06d642a ] When running relocation of a data block group while scrub is running in parallel, it is possible that the relocation will fail and abort the current transaction with an -EINVAL error: [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents [134243.999871] ------------[ cut here ]------------ [134244.000741] BTRFS: Transaction aborted (error -22) [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.017151] Code: 48 c7 c7 (...) [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286 [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000 [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001 [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001 [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08 [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000 [134244.028024] FS: 00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000 [134244.029491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0 [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [134244.034484] Call Trace: [134244.034984] btrfs_cow_block+0x12b/0x2b0 [btrfs] [134244.035859] do_relocation+0x30b/0x790 [btrfs] [134244.036681] ? do_raw_spin_unlock+0x49/0xc0 [134244.037460] ? _raw_spin_unlock+0x29/0x40 [134244.038235] relocate_tree_blocks+0x37b/0x730 [btrfs] [134244.039245] relocate_block_group+0x388/0x770 [btrfs] [134244.040228] btrfs_relocate_block_group+0x161/0x2e0 [btrfs] [134244.041323] btrfs_relocate_chunk+0x36/0x110 [btrfs] [134244.041345] btrfs_balance+0xc06/0x1860 [btrfs] [134244.043382] ? btrfs_ioctl_balance+0x27c/0x310 [btrfs] [134244.045586] btrfs_ioctl_balance+0x1ed/0x310 [btrfs] [134244.045611] btrfs_ioctl+0x1880/0x3760 [btrfs] [134244.049043] ? do_raw_spin_unlock+0x49/0xc0 [134244.049838] ? _raw_spin_unlock+0x29/0x40 [134244.050587] ? __handle_mm_fault+0x11b3/0x14b0 [134244.051417] ? ksys_ioctl+0x92/0xb0 [134244.052070] ksys_ioctl+0x92/0xb0 [134244.052701] ? trace_hardirqs_off_thunk+0x1a/0x1c [134244.053511] __x64_sys_ioctl+0x16/0x20 [134244.054206] do_syscall_64+0x5c/0x280 [134244.054891] entry_SYSCALL_64_after_hwframe+0x49/0xbe [134244.055819] RIP: 0033:0x7f29b51c9dd7 [134244.056491] Code: 00 00 00 (...) [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7 [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003 [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000 [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0 [134244.067626] irq event stamp: 0 [134244.068202] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.070909] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0 [134244.073432] ---[ end trace bd7c03622e0b0a99 ]--- The -EINVAL error comes from the following chain of function calls: __btrfs_cow_block() <-- aborts the transaction btrfs_reloc_cow_block() replace_file_extents() get_new_location() <-- returns -EINVAL When relocating a data block group, for each allocated extent of the block group, we preallocate another extent (at prealloc_file_extent_cluster()), associated with the data relocation inode, and then dirty all its pages. These preallocated extents have, and must have, the same size that extents from the data block group being relocated have. Later before we start the relocation stage that updates pointers (bytenr field of file extent items) to point to the the new extents, we trigger writeback for the data relocation inode. The expectation is that writeback will write the pages to the previously preallocated extents, that it follows the NOCOW path. That is generally the case, however, if a scrub is running it may have turned the block group that contains those extents into RO mode, in which case writeback falls back to the COW path. However in the COW path instead of allocating exactly one extent with the expected size, the allocator may end up allocating several smaller extents due to free space fragmentation - because we tell it at cow_file_range() that the minimum allocation size can match the filesystem's sector size. This later breaks the relocation's expectation that an extent associated to a file extent item in the data relocation inode has the same size as the respective extent pointed by a file extent item in another tree - in this case the extent to which the relocation inode poins to is smaller, causing relocation.c:get_new_location() to return -EINVAL. For example, if we are relocating a data block group X that has a logical address of X and the block group has an extent allocated at the logical address X + 128KiB with a size of 64KiB: 1) At prealloc_file_extent_cluster() we allocate an extent for the data relocation inode with a size of 64KiB and associate it to the file offset 128KiB (X + 128KiB - X) of the data relocation inode. This preallocated extent was allocated at block group Z; 2) A scrub running in parallel turns block group Z into RO mode and starts scrubing its extents; 3) Relocation triggers writeback for the data relocation inode; 4) When running delalloc (btrfs_run_delalloc_range()), we try first the NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC set in its flags. However, because block group Z is in RO mode, the NOCOW path (run_delalloc_nocow()) falls back into the COW path, by calling cow_file_range(); 5) At cow_file_range(), in the first iteration of the while loop we call btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum allocation size of 4KiB (fs_info->sectorsize). Due to free space fragmentation, btrfs_reserve_extent() ends up allocating two extents of 32KiB each, each one on a different iteration of that while loop; 6) Writeback of the data relocation inode completes; 7) Relocation proceeds and ends up at relocation.c:replace_file_extents(), with a leaf which has a file extent item that points to the data extent from block group X, that has a logical address (bytenr) of X + 128KiB and a size of 64KiB. Then it calls get_new_location(), which does a lookup in the data relocation tree for a file extent item starting at offset 128KiB (X + 128KiB - X) and belonging to the data relocation inode. It finds a corresponding file extent item, however that item points to an extent that has a size of 32KiB, which doesn't match the expected size of 64KiB, resuling in -EINVAL being returned from this function and propagated up to __btrfs_cow_block(), which aborts the current transaction. To fix this make sure that at cow_file_range() when we call the allocator we pass it a minimum allocation size corresponding the desired extent size if the inode belongs to the data relocation tree, otherwise pass it the filesystem's sector size as the minimum allocation size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09btrfs: cow_file_range() num_bytes and disk_num_bytes are sameAnand Jain1-10/+6
[ Upstream commit 3752d22fcea160cc2493e34f5e0e41cdd7fdd921 ] This patch deletes local variable disk_num_bytes as its value is same as num_bytes in the function cow_file_range(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-30btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eofFilipe Manana1-3/+0
commit 4b1946284dd6641afdb9457101056d9e6ee6204c upstream. If we attempt to write to prealloc extent located after eof using a RWF_NOWAIT write, we always fail with -EAGAIN. We do actually check if we have an allocated extent for the write at the start of btrfs_file_write_iter() through a call to check_can_nocow(), but later when we go into the actual direct IO write path we simply return -EAGAIN if the write starts at or beyond EOF. Trivial to reproduce: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/foo $ chattr +C /mnt/foo $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec) $ xfs_io -c "falloc -k 64K 1M" /mnt/foo $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo pwrite: Resource temporarily unavailable On xfs and ext4 the write succeeds, as expected. Fix this by removing the wrong check at btrfs_direct_IO(). Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-20btrfs: fix wrong file range cleanup after an error filling dealloc rangeFilipe Manana1-2/+2
[ Upstream commit e2c8e92d1140754073ad3799eb6620c76bab2078 ] If an error happens while running dellaloc in COW mode for a range, we can end up calling extent_clear_unlock_delalloc() for a range that goes beyond our range's end offset by 1 byte, which affects 1 extra page. This results in clearing bits and doing page operations (such as a page unlock) outside our target range. Fix that by calling extent_clear_unlock_delalloc() with an inclusive end offset, instead of an exclusive end offset, at cow_file_range(). Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-20btrfs: fix error handling when submitting direct I/O bioOmar Sandoval1-3/+3
[ Upstream commit 6d3113a193e3385c72240096fe397618ecab6e43 ] In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID stripe or chunk, we submit orig_bio without cloning it. In this case, we don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement pending_bios to -1, and we never complete orig_bio. Fix it by initializing pending_bios to 1 instead of incrementing later. Fixing this exposes another bug: we put orig_bio prematurely and then put it again from end_io. Fix it by not putting orig_bio. After this change, pending_bios is really more of a reference count, but I'll leave that cleanup separate to keep the fix small. Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28btrfs: fix bytes_may_use underflow in prealloc error condtitionJosef Bacik1-3/+13
commit b778cf962d71a0e737923d55d0432f3bd287258e upstream. I hit the following warning while running my error injection stress testing: WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs] RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs] Call Trace: btrfs_free_reserved_data_space+0x4f/0x70 [btrfs] __btrfs_prealloc_file_range+0x378/0x470 [btrfs] elfcorehdr_read+0x40/0x40 ? elfcorehdr_read+0x40/0x40 ? btrfs_commit_transaction+0xca/0xa50 [btrfs] ? dput+0xb4/0x2a0 ? btrfs_log_dentry_safe+0x55/0x70 [btrfs] ? btrfs_sync_file+0x30e/0x420 [btrfs] ? do_fsync+0x38/0x70 ? __x64_sys_fdatasync+0x13/0x20 ? do_syscall_64+0x5b/0x1b0 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens if we fail to insert our reserved file extent. At this point we've already converted our reservation from ->bytes_may_use to ->bytes_reserved. However once we break we will attempt to free everything from [cur_offset, end] from ->bytes_may_use, but our extent reservation will overlap part of this. Fix this problem by adding ins.offset (our extent allocation size) to cur_offset so we remove the actual remaining part from ->bytes_may_use. I validated this fix using my inject-error.py script python inject-error.py -o should_fail_bio -t cache_save_setup -t \ __btrfs_prealloc_file_range \ -t insert_reserved_file_extent.constprop.0 \ -r "-5" ./run-fsstress.sh where run-fsstress.sh simply mounts and runs fsstress on a disk. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15btrfs: Get rid of the confusing btrfs_file_extent_inline_lenQu Wenruo1-6/+6
[ Upstream commit e41ca5897489b1c18af75ff0cc8f5c80260b3281 ] We used to call btrfs_file_extent_inline_len() to get the uncompressed data size of an inlined extent. However this function is hiding evil, for compressed extent, it has no choice but to directly read out ram_bytes from btrfs_file_extent_item. While for uncompressed extent, it uses item size to calculate the real data size, and ignoring ram_bytes completely. In fact, for corrupted ram_bytes, due to above behavior kernel btrfs_print_leaf() can't even print correct ram_bytes to expose the bug. Since we have the tree-checker to verify all EXTENT_DATA, such mismatch can be detected pretty easily, thus we can trust ram_bytes without the evil btrfs_file_extent_inline_len(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: do not call synchronize_srcu() in inode_tree_delJosef Bacik1-2/+0
commit f72ff01df9cf5db25c76674cac16605992d15467 upstream. Testing with the new fsstress uncovered a pretty nasty deadlock with lookup and snapshot deletion. Process A unlink -> final iput -> inode_tree_del -> synchronize_srcu(subvol_srcu) Process B btrfs_lookup <- srcu_read_lock() acquired here -> btrfs_iget -> find inode that has I_FREEING set -> __wait_on_freeing_inode() We're holding the srcu_read_lock() while doing the iget in order to make sure our fs root doesn't go away, and then we are waiting for the inode to finish freeing. However because the free'ing process is doing a synchronize_srcu() we deadlock. Fix this by dropping the synchronize_srcu() in inode_tree_del(). We don't need people to stop accessing the fs root at this point, we're only adding our empty root to the dead roots list. A larger much more invasive fix is forthcoming to address how we deal with fs roots, but this fixes the immediate problem. Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: don't double lock the subvol_sem for rename exchangeJosef Bacik1-6/+4
commit 943eb3bf25f4a7b745dd799e031be276aa104d82 upstream. If we're rename exchanging two subvols we'll try to lock this lock twice, which is bad. Just lock once if either of the ino's are subvols. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17btrfs: record all roots for rename exchange on a subvolJosef Bacik1-0/+3
commit 3e1740993e43116b3bc71b0aad1e6872f6ccf341 upstream. Testing with the new fsstress support for subvolumes uncovered a pretty bad problem with rename exchange on subvolumes. We're modifying two different subvolumes, but we only start the transaction on one of them, so the other one is not added to the dirty root list. This is caught by btrfs_cow_block() with a warning because the root has not been updated, however if we do not modify this root again we'll end up pointing at an invalid root because the root item is never updated. Fix this by making sure we add the destination root to the trans list, the same as we do with normal renames. This fixes the corruption. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-31btrfs: inode: Don't compress if NODATASUM or NODATACOW setQu Wenruo1-1/+23
commit 42c16da6d684391db83788eb680accd84f6c2083 upstream. As btrfs(5) specified: Note If nodatacow or nodatasum are enabled, compression is disabled. If NODATASUM or NODATACOW set, we should not compress the extent. Normally NODATACOW is detected properly in run_delalloc_range() so compression won't happen for NODATACOW. However for NODATASUM we don't have any check, and it can cause compressed extent without csum pretty easily, just by: mkfs.btrfs -f $dev mount $dev $mnt -o nodatasum touch $mnt/foobar mount -o remount,datasum,compress $mnt xfs_io -f -c "pwrite 0 128K" $mnt/foobar And in fact, we have a bug report about corrupted compressed extent without proper data checksum so even RAID1 can't recover the corruption. (https://bugzilla.kernel.org/show_bug.cgi?id=199707) Running compression without proper checksum could cause more damage when corruption happens, as compressed data could make the whole extent unreadable, so there is no need to allow compression for NODATACSUM. The fix will refactor the inode compression check into two parts: - inode_can_compress() As the hard requirement, checked at btrfs_run_delalloc_range(), so no compression will happen for NODATASUM inode at all. - inode_need_compress() As the soft requirement, checked at btrfs_run_delalloc_range() and compress_file_range(). Reported-by: James Harvey <jamespharvey20@gmail.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09Btrfs: fix wrong ctime and mtime of a directory after log replayFilipe Manana1-2/+12
commit 5338e43abbab13791144d37fd8846847062351c6 upstream. When replaying a log that contains a new file or directory name that needs to be added to its parent directory, we end up updating the mtime and the ctime of the parent directory to the current time after we have set their values to the correct ones (set at fsync time), efectivelly losing them. Sample reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/file # fsync of the directory is optional, not needed $ xfs_io -c fsync /mnt/dir $ xfs_io -c fsync /mnt/dir/file $ stat -c %Y /mnt/dir 1557856079 <power failure> $ sleep 3 $ mount /dev/sdb /mnt $ stat -c %Y /mnt/dir 1557856082 --> should have been 1557856079, the mtime is updated to the current time when replaying the log Fix this by not updating the mtime and ctime to the current time at btrfs_add_link() when we are replaying a log tree. This could be triggered by my recent fsync fuzz tester for fstests, for which an fstests patch exists titled "fstests: generic, fsync fuzz tester with fsstress". Fixes: e02119d5a7b43 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-26btrfs: improve error handling of btrfs_add_linkJohannes Thumshirn1-1/+6
[ Upstream commit 1690dd41e0cb1dade80850ed8a3eb0121b96d22f ] In the error handling block, err holds the return value of either btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked since it's introduction with commit fe66a05a0679 (Btrfs: improve error handling for btrfs_insert_dir_item callers) in 2012. If the error handling in the error handling fails, there's not much left to do and the abort either happened earlier in the callees or is necessary here. So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort the transaction, but still return the original code of the failure stored in 'ret' as this will be reported to the user. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-23Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"David Sterba1-3/+0
commit 77b7aad195099e7c6da11e94b7fa6ef5e6fb0025 upstream. This reverts commit e73e81b6d0114d4a303205a952ab2e87c44bd279. This patch causes a few problems: - adds latency to btrfs_finish_ordered_io - as btrfs_finish_ordered_io is used for free space cache, generating more work from btrfs_btree_balance_dirty_nodelay could end up in the same workque, effectively deadlocking 12260 kworker/u96:16+btrfs-freespace-write D [<0>] balance_dirty_pages+0x6e6/0x7ad [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90 [<0>] btrfs_finish_ordered_io+0x3da/0x770 [<0>] normal_work_helper+0x1c5/0x5a0 [<0>] process_one_work+0x1ee/0x5a0 [<0>] worker_thread+0x46/0x3d0 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff Transaction commit will wait on the freespace cache: 838 btrfs-transacti D [<0>] btrfs_start_ordered_extent+0x154/0x1e0 [<0>] btrfs_wait_ordered_range+0xbd/0x110 [<0>] __btrfs_wait_cache_io+0x49/0x1a0 [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0 [<0>] commit_cowonly_roots+0x215/0x2b0 [<0>] btrfs_commit_transaction+0x37e/0x910 [<0>] transaction_kthread+0x14d/0x180 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff And then writepages ends up waiting on transaction commit: 9520 kworker/u96:13+flush-btrfs-1 D [<0>] wait_current_trans+0xac/0xe0 [<0>] start_transaction+0x21b/0x4b0 [<0>] cow_file_range_inline+0x10b/0x6b0 [<0>] cow_file_range.isra.69+0x329/0x4a0 [<0>] run_delalloc_range+0x105/0x3c0 [<0>] writepage_delalloc+0x119/0x180 [<0>] __extent_writepage+0x10c/0x390 [<0>] extent_write_cache_pages+0x26f/0x3d0 [<0>] extent_writepages+0x4f/0x80 [<0>] do_writepages+0x17/0x60 [<0>] __writeback_single_inode+0x59/0x690 [<0>] writeback_sb_inodes+0x291/0x4e0 [<0>] __writeback_inodes_wb+0x87/0xb0 [<0>] wb_writeback+0x3bb/0x500 [<0>] wb_workfn+0x40d/0x610 [<0>] process_one_work+0x1ee/0x5a0 [<0>] worker_thread+0x1e0/0x3d0 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff Eventually, we have every process in the system waiting on balance_dirty_pages(), and nobody is able to make progress on page writeback. The original patch tried to fix an OOM condition, that happened on 4.4 but no success reproducing that on later kernels (4.19 and 4.20). This is more likely a problem in OOM itself. Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/ Reported-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org # 4.18+ CC: ethanlien <ethanlien@synology.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09Btrfs: fix fsync of files with multiple hard links in new directoriesFilipe Manana1-0/+17
commit 41bd60676923822de1df2c50b3f9a10171f4338a upstream. The log tree has a long standing problem that when a file is fsync'ed we only check for new ancestors, created in the current transaction, by following only the hard link for which the fsync was issued. We follow the ancestors using the VFS' dget_parent() API. This means that if we create a new link for a file in a directory that is new (or in an any other new ancestor directory) and then fsync the file using an old hard link, we end up not logging the new ancestor, and on log replay that new hard link and ancestor do not exist. In some cases, involving renames, the file will not exist at all. Example: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt mkdir /mnt/A touch /mnt/foo ln /mnt/foo /mnt/A/bar xfs_io -c fsync /mnt/foo <power failure> In this example after log replay only the hard link named 'foo' exists and directory A does not exist, which is unexpected. In other major linux filesystems, such as ext4, xfs and f2fs for example, both hard links exist and so does directory A after mounting again the filesystem. Checking if any new ancestors are new and need to be logged was added in 2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"), however only for the ancestors of the hard link (dentry) for which the fsync was issued, instead of checking for all ancestors for all of the inode's hard links. So fix this by tracking the id of the last transaction where a hard link was created for an inode and then on fsync fallback to a full transaction commit when an inode has more than one hard link and at least one new hard link was created in the current transaction. This is the simplest solution since this is not a common use case (adding frequently hard links for which there's an ancestor created in the current transaction and then fsync the file). In case it ever becomes a common use case, a solution that consists of iterating the fs/subvol btree for each hard link and check if any ancestor is new, could be implemented. This solves many unexpected scenarios reported by Jayashree Mohan and Vijay Chidambaram, and for which there is a new test case for fstests under review. Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") CC: stable@vger.kernel.org # 4.4+ Reported-by: Vijay Chidambaram <vvijay03@gmail.com> Reported-by: Jayashree Mohan <jayashree2912@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21Btrfs: fix cur_offset in the error case for nocowRobbie Ko1-3/+2
commit 506481b20e818db40b6198815904ecd2d6daee64 upstream. When the cow_file_range fails, the related resources are unlocked according to the range [start..end), so the unlock cannot be repeated in run_delalloc_nocow. In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is not updated correctly, so move the cur_offset update before cow_file_range. kernel BUG at mm/page-writeback.c:2663! Internal error: Oops - BUG: 0 [#1] SMP CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O Hardware name: Realtek_RTD1296 (DT) Workqueue: writeback wb_workfn (flush-btrfs-1) task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000 PC is at clear_page_dirty_for_io+0x1bc/0x1e8 LR is at clear_page_dirty_for_io+0x14/0x1e8 pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145 sp : ffffffc02e9af4f0 Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020) Call trace: [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8 [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs] [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs] [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs] [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs] [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs] [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs] [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs] [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs] [<ffffffc00033d758>] do_writepages+0x30/0x88 [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198 [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0 [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0 [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0 [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250 [<ffffffc0002b0014>] process_one_work+0x1dc/0x388 [<ffffffc0002b02f0>] worker_thread+0x130/0x500 [<ffffffc0002b6344>] kthread+0x10c/0x110 [<ffffffc000284590>] ret_from_fork+0x10/0x40 Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000) CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13Btrfs: fix use-after-free during inode evictionFilipe Manana1-1/+3
commit 421f0922a2cfb0c75acd9746454aaa576c711a65 upstream. At inode.c:evict_inode_truncate_pages(), when we iterate over the inode's extent states, we access an extent state record's "state" field after we unlocked the inode's io tree lock. This can lead to a use-after-free issue because after we unlock the io tree that extent state record might have been freed due to being merged into another adjacent extent state record (a previous inflight bio for a read operation finished in the meanwhile which unlocked a range in the io tree and cause a merge of extent state records, as explained in the comment before the while loop added in commit 6ca0709756710 ("Btrfs: fix hang during inode eviction due to concurrent readahead")). Fix this by keeping a copy of the extent state's flags in a local variable and using it after unlocking the io tree. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189 Fixes: b9d0b38928e2 ("btrfs: Add handler for invalidate page") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13btrfs: only free reserved extent if we didn't insert itJosef Bacik1-1/+9
commit 49940bdd57779c78462da7aa5a8650b2fea8c2ff upstream. When we insert the file extent once the ordered extent completes we free the reserved extent reservation as it'll have been migrated to the bytes_used counter. However if we error out after this step we'll still clear the reserved extent reservation, resulting in a negative accounting of the reserved bytes for the block group and space info. Fix this by only doing the free if we didn't successfully insert a file extent for this extent. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13Btrfs: fix null pointer dereference on compressed write path errorFilipe Manana1-0/+1
commit 3527a018c00e5dbada2f9d7ed5576437b6dd5cfb upstream. At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns 0 (success creating an inline extent) or an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05Btrfs: fix btrfs_write_inode vs delayed iput deadlockJosef Bacik1-26/+0
commit 3c4276936f6fbe52884b4ea4e6cc120b890a0f9f upstream. We recently ran into the following deadlock involving btrfs_write_inode(): [ +0.005066] __schedule+0x38e/0x8c0 [ +0.007144] schedule+0x36/0x80 [ +0.006447] bit_wait+0x11/0x60 [ +0.006446] __wait_on_bit+0xbe/0x110 [ +0.007487] ? bit_wait_io+0x60/0x60 [ +0.007319] __inode_wait_for_writeback+0x96/0xc0 [ +0.009568] ? autoremove_wake_function+0x40/0x40 [ +0.009565] inode_wait_for_writeback+0x21/0x30 [ +0.009224] evict+0xb0/0x190 [ +0.006099] iput+0x1a8/0x210 [ +0.006103] btrfs_run_delayed_iputs+0x73/0xc0 [ +0.009047] btrfs_commit_transaction+0x799/0x8c0 [ +0.009567] btrfs_write_inode+0x81/0xb0 [ +0.008008] __writeback_single_inode+0x267/0x320 [ +0.009569] writeback_sb_inodes+0x25b/0x4e0 [ +0.008702] wb_writeback+0x102/0x2d0 [ +0.007487] wb_workfn+0xa4/0x310 [ +0.006794] ? wb_workfn+0xa4/0x310 [ +0.007143] process_one_work+0x150/0x410 [ +0.008179] worker_thread+0x6d/0x520 [ +0.007490] kthread+0x12c/0x160 [ +0.006620] ? put_pwq_unlocked+0x80/0x80 [ +0.008185] ? kthread_park+0xa0/0xa0 [ +0.007484] ? do_syscall_64+0x53/0x150 [ +0.007837] ret_from_fork+0x29/0x40 Writeback calls: btrfs_write_inode btrfs_commit_transaction btrfs_run_delayed_iputs If iput() is called on that same inode, evict() will wait for writeback forever. btrfs_write_inode() was originally added way back in 4730a4bc5bf3 ("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode() hasn't been used for O_SYNC since 148f948ba877 ("vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so btrfs_write_inode() is actually unnecessary (and leads to a bunch of unnecessary commits). Get rid of it, which also gets rid of the deadlock. CC: stable@vger.kernel.org # 3.2+ Signed-off-by: Josef Bacik <jbacik@fb.com> [Omar: new commit message] Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()Omar Sandoval1-1/+4
[ Upstream commit 0552210997badb6a60740a26ff9d976a416510f0 ] btrfs_free_extent() can fail because of ENOMEM. There's no reason to panic here, we can just abort the transaction. Fixes: f4b9aa8d3b87 ("btrfs_truncate") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03Btrfs: don't return ino to ino cache if inode item removal failsOmar Sandoval1-12/+13
[ Upstream commit c08db7d8d295a4f3a10faaca376de011afff7950 ] In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode item will still be in the tree but we still return the ino to the ino cache. That will blow up later when someone tries to allocate that ino, so don't return it to the cache. Fixes: 581bb050941b ("Btrfs: Cache free inode numbers in memory") Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03btrfs: balance dirty metadata pages in btrfs_finish_ordered_ioEthan Lien1-0/+3
[ Upstream commit e73e81b6d0114d4a303205a952ab2e87c44bd279 ] [Problem description and how we fix it] We should balance dirty metadata pages at the end of btrfs_finish_ordered_io, since a small, unmergeable random write can potentially produce dirty metadata which is multiple times larger than the data itself. For example, a small, unmergeable 4KiB write may produce: 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree Although we do call balance dirty pages in write side, but in the buffered write path, most metadata are dirtied only after we reach the dirty background limit (which by far only counts dirty data pages) and wakeup the flusher thread. If there are many small, unmergeable random writes spread in a large btree, we'll find a burst of dirty pages exceeds the dirty_bytes limit after we wakeup the flusher thread - which is not what we expect. In our machine, it caused out-of-memory problem since a page cannot be dropped if it is marked dirty. Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay, but since we do btrfs_finish_ordered_io in a separate worker, it will not stop the flusher consuming dirty pages. Also, we use different worker for metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle the size of dirty metadata pages. [Reproduce steps] To reproduce the problem, we need to do 4KiB write randomly spread in a large btree. In our 2GiB RAM machine: 1) Create 4 subvolumes. 2) Run fio on each subvolume: [global] direct=0 rw=randwrite ioengine=libaio bs=4k iodepth=16 numjobs=1 group_reporting size=128G runtime=1800 norandommap time_based randrepeat=0 3) Take snapshot on each subvolume and repeat fio on existing files. 4) Repeat step (3) until we get large btrees. In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of metadata in each subvolume tree and 12GiB of metadata in extent tree. 5) Stop all fio, take snapshot again, and wait until all delayed work is completed. 6) Start all fio. Few seconds later we hit OOM when the flusher starts to work. It can be reproduced even when using nocow write. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03Btrfs: fix return value on rename exchange failureFilipe Manana1-1/+3
commit c5b4a50b74018b3677098151ec5f4fce07d5e6a0 upstream. If we failed during a rename exchange operation after starting/joining a transaction, we would end up replacing the return value, stored in the local 'ret' variable, with the return value from btrfs_end_transaction(). So this could end up returning 0 (success) to user space despite the operation having failed and aborted the transaction, because if there are multiple tasks having a reference on the transaction at the time btrfs_end_transaction() is called by the rename exchange, that function returns 0 (otherwise it returns -EIO and not the original error value). So fix this by not overwriting the return value on error after getting a transaction handle. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-26btrfs: return error value if create_io_em failed in cow_file_rangeSu Yue1-1/+3
commit 090a127afa8f73e9618d4058d6755f7ec7453dd6 upstream. In cow_file_range(), create_io_em() may fail, but its return value is not recorded. Then return value may be 0 even it failed which is a wrong behavior. Let cow_file_range() return PTR_ERR(em) if create_io_em() failed. Fixes: 6f9994dbabe5 ("Btrfs: create a helper to create em for IO") CC: stable@vger.kernel.org # 4.11+ Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30do d_instantiate/unlock_new_inode combinations safelyAl Viro1-12/+4
commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream. For anything NFS-exported we do _not_ want to unlock new inode before it has grown an alias; original set of fixes got the ordering right, but missed the nasty complication in case of lockdep being enabled - unlock_new_inode() does lockdep_annotate_inode_mutex_key(inode) which can only be done before anyone gets a chance to touch ->i_mutex. Unfortunately, flipping the order and doing unlock_new_inode() before d_instantiate() opens a window when mkdir can race with open-by-fhandle on a guessed fhandle, leading to multiple aliases for a directory inode and all the breakage that follows from that. Correct solution: a new primitive (d_instantiate_new()) combining these two in the right order - lockdep annotate, then d_instantiate(), then the rest of unlock_new_inode(). All combinations of d_instantiate() with unlock_new_inode() should be converted to that. Cc: stable@kernel.org # 2.6.29 and later Tested-by: Mike Marshall <hubcap@omnibond.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-22btrfs: Split btrfs_del_delalloc_inode into 2 functionsNikolay Borisov1-3/+10
commit 2b8773313494ede83a26fb372466e634564002ed upstream. This is in preparation of fixing delalloc inodes leakage on transaction abort. Also export the new function. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26Btrfs: fix unexpected EEXIST from btrfs_get_extentLiu Bo1-14/+3
[ Upstream commit 18e83ac75bfe67009c4ddcdd581bba8eb16f4030 ] This fixes a corner case that is caused by a race of dio write vs dio read/write. Here is how the race could happen. Suppose that no extent map has been loaded into memory yet. There is a file extent [0, 32K), two jobs are running concurrently against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio read from [0, 4K) or [4K, 8K). t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K). ------------------------------------------------------ t1 t2 btrfs_get_blocks_direct() btrfs_get_blocks_direct() -> btrfs_get_extent() -> btrfs_get_extent() -> lookup_extent_mapping() -> add_extent_mapping() -> lookup_extent_mapping() # load [0, 32K) -> btrfs_new_extent_direct() -> btrfs_drop_extent_cache() # split [0, 32K) and # drop [8K, 32K) -> add_extent_mapping() # add [8K, 32K) -> add_extent_mapping() # handle -EEXIST when adding # [0, 32K) ------------------------------------------------------ About how t2(dio read/write) runs into -EEXIST: a) add_extent_mapping() gets -EEXIST for adding em [0, 32k), b) search_extent_mapping() then returns [0, 8k) as the existing em, even though start == existing->start, em is [0, 32k) so that extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k, c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k) (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree, d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write, which is confusing applications. Here I conclude all the possible situations, 1) start < existing->start +-----------+em+-----------+ +--prev---+ | +-------------+ | | | | | | | +---------+ + +---+existing++ ++ + | + start 2) start == existing->start +------------em------------+ | +-------------+ | | | | | + +----existing-+ + | | + start 3) start > existing->start && start < (existing->start + existing->len) +------------em------------+ | +-------------+ | | | | | + +----existing-+ + | | + start 4) start >= (existing->start + existing->len) +-----------+em+-----------+ | +-------------+ | +--next---+ | | | | | | + +---+existing++ + +---------+ + | + start As we can see, it turns out that if start is within existing em (front inclusive), then the existing em should be returned as is, otherwise, we try our best to merge candidate em with sibling ems to form a larger em (in order to reduce the total number of em). Reported-by: David Vallender <david.vallender@landmark.co.uk> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26btrfs: fix unaligned access in readdirDavid Sterba1-8/+12
commit 92d32170847bfff2dd08af2c016085779f2fd2a1 upstream. The last update to readdir introduced a temporary buffer to store the emitted readdir data, but as there are file names of variable length, there's a lot of unaligned access. This was observed on a sparc64 machine: Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs] Fixes: 23b5ec74943 ("btrfs: fix readdir deadlock with pagefault") CC: stable@vger.kernel.org # 4.14+ Reported-and-tested-by: René Rebe <rene@exactcode.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-08Btrfs: fix unexpected cow in run_delalloc_nocowLiu Bo1-5/+32
commit 5811375325420052fcadd944792a416a43072b7f upstream. Fstests generic/475 provides a way to fail metadata reads while checking if checksum exists for the inode inside run_delalloc_nocow(), and csum_exist_in_range() interprets error (-EIO) as inode having checksum and makes its caller enter the cow path. In case of free space inode, this ends up with a warning in cow_file_range(). The same problem applies to btrfs_cross_ref_exist() since it may also read metadata in between. With this, run_delalloc_nocow() bails out when errors occur at the two places. cc: <stable@vger.kernel.org> v2.6.28+ Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25btrfs: Fix quota reservation leak on preallocated filesJustin Maggard1-0/+2
[ Upstream commit b430b7751286b3acff2d324553c8cec4f1e87764 ] Commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to") changed the behavior of __btrfs_buffered_write() so that it first tries to get a data space reservation, and then skips the relatively expensive nocow check if the reservation succeeded. If we have quotas enabled, the data space reservation also includes a quota reservation. But in the rewrite case, the space has already been accounted for in qgroups. So btrfs_check_data_free_space() increases the quota reservation, but it never gets decreased when the data actually gets written and overwrites the pre-existing data. So we're left with both the qgroup and qgroup reservation accounting for the same space. This commit adds the missing btrfs_qgroup_free_data() call in the case of BTRFS_ORDERED_PREALLOC extents. Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to") Signed-off-by: Justin Maggard <jmaggard@netgear.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix use-after-free on root->orphan_block_rsvLiu Bo1-13/+21
commit 1a932ef4e47984dee227834667b5ff5a334e4805 upstream. I got these from running generic/475, WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs] Call Trace: btrfs_orphan_release_metadata+0x9f/0x200 [btrfs] btrfs_orphan_del+0x10d/0x170 [btrfs] btrfs_setattr+0x500/0x640 [btrfs] notify_change+0x7ae/0x870 do_truncate+0xca/0x130 vfs_truncate+0x2ee/0x3d0 do_sys_truncate+0xaf/0xf0 SyS_truncate+0xe/0x10 entry_SYSCALL_64_fastpath+0x1f/0x96 The race is between btrfs_orphan_commit_root and btrfs_orphan_del, t1 t2 btrfs_orphan_commit_root btrfs_orphan_del spin_lock check (&root->orphan_inodes) root->orphan_block_rsv = NULL; spin_unlock atomic_dec(&root->orphan_inodes); access root->orphan_block_rsv Accessing root->orphan_block_rsv must be done before decreasing root->orphan_inodes. cc: <stable@vger.kernel.org> v3.12+ Fixes: 703c88e03524 ("Btrfs: fix tracking of orphan inode count") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctlyLiu Bo1-1/+1
commit e8f1bc1493855e32b7a2a019decc3c353d94daf6 upstream. This regression is introduced in commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction"). There are two problems, a) it is ->destroy_inode() that does the final free on inode, not ->evict_inode(), b) clear_inode() must be called before ->evict_inode() returns. This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); in evict() because I_CLEAR is set in clear_inode(). Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction") Cc: <stable@vger.kernel.org> # v4.7-rc6+ Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22Btrfs: fix deadlock in run_delalloc_nocowLiu Bo1-1/+4
commit e89166990f11c3f21e1649d760dd35f9e410321c upstream. @cur_offset is not set back to what it should be (@cow_start) if btrfs_next_leaf() returns something wrong, and the range [cow_start, cur_offset) remains locked forever. cc: <stable@vger.kernel.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16btrfs: Handle btrfs_set_extent_delalloc failure in fixup workerNikolay Borisov1-2/+9
commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream. This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers to deal with pages that have been improperly dirtied") and it didn't do any error handling then. This function might very well fail in ENOMEM situation, yet it's not handled, this could lead to inconsistent state. So let's handle the failure by setting the mapping error bit. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03Btrfs: bail out gracefully rather than BUG_ONLiu Bo1-2/+8
[ Upstream commit 56a0e706fcf870270878d6d72b71092ae42d229c ] If a file's DIR_ITEM key is invalid (due to memory errors) and gets written to disk, a future lookup_path can end up with kernel panic due to BUG_ON(). This gets rid of the BUG_ON(), meanwhile output the corrupted key and return ENOENT if it's invalid. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reported-by: Guillaume Bouchard <bouchard@mercs-eng.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-29Merge branch 'for-4.14-rc3' of ↵Linus Torvalds1-5/+22
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've collected a bunch of isolated fixes, for crashes, user-visible behaviour or missing bits from other subsystem cleanups from the past. The overall number is not small but I was not able to make it significantly smaller. Most of the patches are supposed to go to stable" * 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: log csums for all modified extents Btrfs: fix unexpected result when dio reading corrupted blocks btrfs: Report error on removing qgroup if del_qgroup_item fails Btrfs: skip checksum when reading compressed data if some IO have failed Btrfs: fix kernel oops while reading compressed data Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block Btrfs: do not backup tree roots when fsync btrfs: remove BTRFS_FS_QUOTA_DISABLING flag btrfs: propagate error to btrfs_cmp_data_prepare caller btrfs: prevent to set invalid default subvolid Btrfs: send: fix error number for unknown inode types btrfs: fix NULL pointer dereference from free_reloc_roots() btrfs: finish ordered extent cleaning if no progress is found btrfs: clear ordered flag on cleaning up ordered extents Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO Btrfs: do not reset bio->bi_ops while writing bio Btrfs: use the new helper wbc_to_write_flags
2017-09-26Btrfs: fix unexpected result when dio reading corrupted blocksLiu Bo1-5/+2
commit 4246a0b63bd8 ("block: add a bi_error field to struct bio") changed the logic of how dio read endio reports errors. For single stripe dio read, %bio->bi_status reflects the error before verifying checksum, and now we're updating it when data block matches with its checksum, while in the mismatching case, %bio->bi_status is not updated to relfect that. When some blocks in a file have been corrupted on disk, reading such a file ends up with 1) checksum errors are reported in kernel log 2) read(2) returns successfully with some content being 0x01. In order to fix it, we need to report its checksum mismatch error to the upper layer (dio layer in this case) as well. Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reported-by: Goffredo Baroncelli <kreijack@inwind.it> Tested-by: Goffredo Baroncelli <kreijack@inwind.it> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26btrfs: finish ordered extent cleaning if no progress is foundNaohiro Aota1-0/+8
__endio_write_update_ordered() repeats the search until it reaches the end of the specified range. This works well with direct IO path, because before the function is called, it's ensured that there are ordered extents filling whole the range. It's not the case, however, when it's called from run_delalloc_range(): it is possible to have error in the midle of the loop in e.g. run_delalloc_nocow(), so that there exisits the range not covered by any ordered extents. By cleaning such "uncomplete" range, __endio_write_update_ordered() stucks at offset where there're no ordered extents. Since the ordered extents are created from head to tail, we can stop the search if there are no offset progress. Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang") Cc: <stable@vger.kernel.org> # 4.12 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26btrfs: clear ordered flag on cleaning up ordered extentsNaohiro Aota1-0/+12
Commit 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup submitted ordered extents. However, it does not clear the ordered bit (Private2) of corresponding pages. Thus, the following BUG occurs from free_pages_check_bad() (on btrfs/125 with nospace_cache). BUG: Bad page state in process btrfs pfn:3fa787 page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping: (null) index:0xd flags: 0x8000000000002008(uptodate|private_2) raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x2000(private_2) This patch clears the flag same as other places calling btrfs_dec_test_ordered_pending() for every page in the specified range. Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang") Cc: <stable@vger.kernel.org> # 4.12 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-15Merge branch 'work.mount' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount flag updates from Al Viro: "Another chunk of fmount preparations from dhowells; only trivial conflicts for that part. It separates MS_... bits (very grotty mount(2) ABI) from the struct super_block ->s_flags (kernel-internal, only a small subset of MS_... stuff). This does *not* convert the filesystems to new constants; only the infrastructure is done here. The next step in that series is where the conflicts would be; that's the conversion of filesystems. It's purely mechanical and it's better done after the merge, so if you could run something like list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$') sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \ -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \ -e 's/\<MS_NODEV\>/SB_NODEV/g' \ -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \ -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \ -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \ -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \ -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \ -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \ -e 's/\<MS_SILENT\>/SB_SILENT/g' \ -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \ -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \ -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \ -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \ $list and commit it with something along the lines of 'convert filesystems away from use of MS_... constants' as commit message, it would save a quite a bit of headache next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Differentiate mount flags (MS_*) from internal superblock flags VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-09Merge branch 'for-4.14' of ↵Linus Torvalds1-67/+153
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The changes range through all types: cleanups, core chagnes, sanity checks, fixes, other user visible changes, detailed list below: - deprecated: user transaction ioctl - mount option ssd does not change allocation alignments - degraded read-write mount is allowed if all the raid profile constraints are met, now based on more accurate check - defrag: do not reset compression afterwards; the NOCOMPRESS flag can be now overriden by defrag - prep work for better extent reference tracking (related to the qgroup slowness with balance) - prep work for compression heuristics - memory allocation reductions (may help latencies on a loaded system) - better accounting for io waiting states - error handling improvements (removed BUGs) - added more sanity checks for shared refs - fix readdir vs pagefault deadlock under some circumstances - fix for 'no-hole' mode, certain combination of compressed and inline extents - send: fix emission of invalid clone operations - fixup file mode if setting acls fail - more fixes from fuzzing - oher cleanups" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits) btrfs: submit superblock io with REQ_META and REQ_PRIO btrfs: remove unnecessary memory barrier in btrfs_direct_IO btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent btrfs: pass fs_info to btrfs_del_root instead of tree_root Btrfs: add one more sanity check for shared ref type Btrfs: remove BUG_ON in __add_tree_block Btrfs: remove BUG() in add_data_reference Btrfs: remove BUG() in print_extent_item Btrfs: remove BUG() in btrfs_extent_inline_ref_size Btrfs: convert to use btrfs_get_extent_inline_ref_type Btrfs: add a helper to retrive extent inline ref type btrfs: scrub: simplify scrub worker initialization btrfs: scrub: clean up division in scrub_find_csum btrfs: scrub: clean up division in __scrub_mark_bitmap btrfs: scrub: use bool for flush_all_writes btrfs: preserve i_mode if __btrfs_set_acl() fails btrfs: Remove extraneous chunk_objectid variable btrfs: Remove chunk_objectid argument from btrfs_make_block_group btrfs: Remove extra parentheses from condition in copy_items() ...
2017-08-24Btrfs: fix blk_status_t/errno confusionOmar Sandoval1-33/+37
This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>