summaryrefslogtreecommitdiff
path: root/fs/btrfs/ioctl.c
AgeCommit message (Collapse)AuthorFilesLines
2021-05-11btrfs: fix metadata extent leak after failure to create subvolumeFilipe Manana1-3/+15
commit 67addf29004c5be9fa0383c82a364bb59afc7f84 upstream. When creating a subvolume we allocate an extent buffer for its root node after starting a transaction. We setup a root item for the subvolume that points to that extent buffer and then attempt to insert the root item into the root tree - however if that fails, due to ENOMEM for example, we do not free the extent buffer previously allocated and we do not abort the transaction (as at that point we did nothing that can not be undone). This means that we effectively do not return the metadata extent back to the free space cache/tree and we leave a delayed reference for it which causes a metadata extent item to be added to the extent tree, in the next transaction commit, without having backreferences. When this happens 'btrfs check' reports the following: $ btrfs check /dev/sdi Opening filesystem to check... Checking filesystem on /dev/sdi UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb [1/7] checking root items [2/7] checking extents ref mismatch on [30425088 16384] extent item 1, found 0 backref 30425088 root 256 not referenced back 0x564a91c23d70 incorrect global backref count on 30425088 found 1 wanted 0 backpointer mismatch on [30425088 16384] owner ref check failed [30425088 16384] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) found 212992 bytes used, error(s) found total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 124669 file data blocks allocated: 65536 referenced 65536 So fix this by freeing the metadata extent if btrfs_insert_root() returns an error. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-09btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctlDan Carpenter1-1/+18
commit 5011c5a663b9c6d6aff3d394f11049b371199627 upstream. The problem is we're copying "inherit" from user space but we don't necessarily know that we're copying enough data for a 64 byte struct. Then the next problem is that 'inherit' has a variable size array at the end, and we have to verify that array is the size we expected. Fixes: 6f72c7e20dba ("Btrfs: add qgroup inheritance") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06btrfs: fix race when defragmenting leads to unnecessary IOFilipe Manana1-0/+39
[ Upstream commit 7f458a3873ae94efe1f37c8b96c97e7298769e98 ] When defragmenting we skip ranges that have holes or inline extents, so that we don't do unnecessary IO and waste space. We do this check when calling should_defrag_range() at btrfs_defrag_file(). However we do it without holding the inode's lock. The reason we do it like this is to avoid blocking other tasks for too long, that possibly want to operate on other file ranges, since after the call to should_defrag_range() and before locking the inode, we trigger a synchronous page cache readahead. However before we were able to lock the inode, some other task might have punched a hole in our range, or we may now have an inline extent there, in which case we should not set the range for defrag anymore since that would cause unnecessary IO and make us waste space (i.e. allocating extents to contain zeros for a hole). So after we locked the inode and the range in the iotree, check again if we have holes or an inline extent, and if we do, just skip the range. I hit this while testing my next patch that fixes races when updating an inode's number of bytes (subject "btrfs: update the number of bytes used by an inode atomically"), and it depends on this change in order to work correctly. Alternatively I could rework that other patch to detect holes and flag their range with the 'new delalloc' bit, but this itself fixes an efficiency problem due a race that from a functional point of view is not harmful (it could be triggered with btrfs/062 from fstests). CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit archMatthew Wilcox (Oracle)1-6/+4
commit a1fbc6750e212c5675a4e48d7f51d44607eb8756 upstream. On 32-bit systems, this shift will overflow for files larger than 4GB as start_index is unsigned long while the calls to btrfs_delalloc_*_space expect u64. CC: stable@vger.kernel.org # 4.4+ Fixes: df480633b891 ("btrfs: extent-tree: Switch to new delalloc space reserve and release") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> [ define the variable instead of repeating the shift ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18btrfs: reschedule when cloning lots of extentsJohannes Thumshirn1-0/+2
[ Upstream commit 6b613cc97f0ace77f92f7bc112b8f6ad3f52baf8 ] We have several occurrences of a soft lockup from fstest's generic/175 testcase, which look more or less like this one: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030] Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 10030 Comm: xfs_io Tainted: G L 5.9.0-rc5+ #768 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Call Trace: <IRQ> dump_stack+0x77/0xa0 panic+0xfa/0x2cb watchdog_timer_fn.cold+0x85/0xa5 ? lockup_detector_update_enable+0x50/0x50 __hrtimer_run_queues+0x99/0x4c0 ? recalibrate_cpu_khz+0x10/0x10 hrtimer_run_queues+0x9f/0xb0 update_process_times+0x28/0x80 tick_handle_periodic+0x1b/0x60 __sysvec_apic_timer_interrupt+0x76/0x210 asm_call_on_stack+0x12/0x20 </IRQ> sysvec_apic_timer_interrupt+0x7f/0x90 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs] RSP: 0018:ffffc90007123a58 EFLAGS: 00000282 RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000 RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0 RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032 R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000 R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0 ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs] btrfs_release_path+0x67/0x80 [btrfs] btrfs_insert_replace_extent+0x177/0x2c0 [btrfs] btrfs_replace_file_extents+0x472/0x7c0 [btrfs] btrfs_clone+0x9ba/0xbd0 [btrfs] btrfs_clone_files.isra.0+0xeb/0x140 [btrfs] ? file_update_time+0xcd/0x120 btrfs_remap_file_range+0x322/0x3b0 [btrfs] do_clone_file_range+0xb7/0x1e0 vfs_clone_file_range+0x30/0xa0 ioctl_file_clone+0x8a/0xc0 do_vfs_ioctl+0x5b2/0x6f0 __x64_sys_ioctl+0x37/0xa0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f87977fc247 RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247 RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003 RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000 R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000 Kernel Offset: disabled ---[ end Kernel panic - not syncing: softlockup: hung tasks ]--- All of these lockup reports have the call chain btrfs_clone_files() -> btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with both source and destination extents locked and loops over the source extent to create the clones. Conditionally reschedule in the btrfs_clone() loop, to give some time back to other processes. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-17btrfs: fix wrong address when faulting in pages in the search ioctlFilipe Manana1-1/+2
commit 1c78544eaa4660096aeb6a57ec82b42cdb3bfe5a upstream. When faulting in the pages for the user supplied buffer for the search ioctl, we are passing only the base address of the buffer to the function fault_in_pages_writeable(). This means that after the first iteration of the while loop that searches for leaves, when we have a non-zero offset, stored in 'sk_offset', we try to fault in a wrong page range. So fix this by adding the offset in 'sk_offset' to the base address of the user supplied buffer when calling fault_in_pages_writeable(). Several users have reported that the applications compsize and bees have started to operate incorrectly since commit a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl") was added to stable trees, and these applications make heavy use of the search ioctls. This fixes their issues. Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/ Link: https://github.com/kilobyte/compsize/issues/34 Fixes: a48b73eca4ceb9 ("btrfs: fix potential deadlock in the search ioctl") CC: stable@vger.kernel.org # 4.4+ Tested-by: A L <mail@lechevalier.se> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09btrfs: fix potential deadlock in the search ioctlJosef Bacik1-7/+20
[ Upstream commit a48b73eca4ceb9b8a4b97f290a065335dbcd8a04 ] With the conversion of the tree locks to rwsem I got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted ------------------------------------------------------ compsize/11122 is trying to acquire lock: ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90 but task is already holding lock: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-fs-00){++++}-{3:3}: down_write_nested+0x3b/0x70 __btrfs_tree_lock+0x24/0x120 btrfs_search_slot+0x756/0x990 btrfs_lookup_inode+0x3a/0xb4 __btrfs_update_delayed_inode+0x93/0x270 btrfs_async_run_delayed_root+0x168/0x230 btrfs_work_helper+0xd4/0x570 process_one_work+0x2ad/0x5f0 worker_thread+0x3a/0x3d0 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 -> #1 (&delayed_node->mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_delayed_update_inode+0x50/0x440 btrfs_update_inode+0x8a/0xf0 btrfs_dirty_inode+0x5b/0xd0 touch_atime+0xa1/0xd0 btrfs_file_mmap+0x3f/0x60 mmap_region+0x3a4/0x640 do_mmap+0x376/0x580 vm_mmap_pgoff+0xd5/0x120 ksys_mmap_pgoff+0x193/0x230 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&mm->mmap_lock#2){++++}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-fs-00); lock(&delayed_node->mutex); lock(btrfs-fs-00); lock(&mm->mmap_lock#2); *** DEADLOCK *** 1 lock held by compsize/11122: #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 stack backtrace: CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? __might_fault+0x3e/0x90 ? find_held_lock+0x72/0x90 __might_fault+0x68/0x90 ? __might_fault+0x3e/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 ? btrfs_search_forward+0x2a6/0x360 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ? __do_sys_newfstat+0x5a/0x70 ? ksys_ioctl+0x83/0xc0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The problem is we're doing a copy_to_user() while holding tree locks, which can deadlock if we have to do a page fault for the copy_to_user(). This exists even without my locking changes, so it needs to be fixed. Rework the search ioctl to do the pre-fault and then copy_to_user_nofault for the copying. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-21btrfs: add missing check for nocow and compression inode flagsDavid Sterba1-8/+22
commit f37c563bab4297024c300b05c8f48430e323809d upstream. User Forza reported on IRC that some invalid combinations of file attributes are accepted by chattr. The NODATACOW and compression file flags/attributes are mutually exclusive, but they could be set by 'chattr +c +C' on an empty file. The nodatacow will be in effect because it's checked first in btrfs_run_delalloc_range. Extend the flag validation to catch the following cases: - input flags are conflicting - old and new flags are conflicting - initialize the local variable with inode flags after inode ls locked Inode attributes take precedence over mount options and are an independent setting. Nocompress would be a no-op with nodatacow, but we don't want to mix any compression-related options with nodatacow. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11Btrfs: make deduplication with range including the last block workFilipe Manana1-1/+2
commit 831d2fa25ab8e27592b1b0268dae6f2dfaf7cc43 upstream. Since btrfs was migrated to use the generic VFS helpers for clone and deduplication, it stopped allowing for the last block of a file to be deduplicated when the source file size is not sector size aligned (when eof is somewhere in the middle of the last block). There are two reasons for that: 1) The generic code always rounds down, to a multiple of the block size, the range's length for deduplications. This means we end up never deduplicating the last block when the eof is not block size aligned, even for the safe case where the destination range's end offset matches the destination file's size. That rounding down operation is done at generic_remap_check_len(); 2) Because of that, the btrfs specific code does not expect anymore any non-aligned range length's for deduplication and therefore does not work if such nona-aligned length is given. This patch addresses that second part, and it depends on a patch that fixes generic_remap_check_len(), in the VFS, which was submitted ealier and has the following subject: "fs: allow deduplication of eof block into the end of the destination file" These two patches address reports from users that started seeing lower deduplication rates due to the last block never being deduplicated when the file size is not aligned to the filesystem's block size. Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23Btrfs: always copy scrub arguments back to user spaceFilipe Manana1-1/+13
commit 5afe6ce748c1ea99e0d648153c05075e1ab93afb upstream. If scrub returns an error we are not copying back the scrub arguments structure to user space. This prevents user space to know how much progress scrub has done if an error happened - this includes -ECANCELED which is returned when users ask for scrub to stop. A particular use case, which is used in btrfs-progs, is to resume scrub after it is canceled, in that case it relies on checking the progress from the scrub arguments structure and then use that progress in a call to resume scrub. So fix this by always copying the scrub arguments structure to user space, overwriting the value returned to user space with -EFAULT only if copying the structure failed to let user space know that either that copying did not happen, and therefore the structure is stale, or it happened partially and the structure is probably not valid and corrupt due to the partial copy. Reported-by: Graham Cobb <g.btrfs@cobb.uk.net> Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/ Fixes: 06fe39ab15a6a4 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Graham Cobb <g.btrfs@cobb.uk.net> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12Btrfs: fix hole extent items with a zero size after range cloningFilipe Manana1-11/+5
[ Upstream commit 147271e35ba267506dde6550f58ccf8d287db3ef ] Normally when cloning a file range if we find an implicit hole at the end of the range we assume it is because the NO_HOLES feature is enabled. However that is not always the case. One well known case [1] is when we have a power failure after mixing buffered and direct IO writes against the same file. In such cases we need to punch a hole in the destination file, and if the NO_HOLES feature is not enabled, we need to insert explicit file extent items to represent the hole. After commit 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents"), we started to insert file extent items representing the hole with an item size of 0, which is invalid and should be 53 bytes (the size of a btrfs_file_extent_item structure), resulting in all sorts of corruptions and invalid memory accesses. This is detected by the tree checker when we attempt to write a leaf to disk. The problem can be sporadically triggered by test case generic/561 from fstests. That test case does not exercise power failure and creates a new filesystem when it starts, so it does not use a filesystem created by any previous test that tests power failure. However the test does both buffered and direct IO writes (through fsstress) and it's precisely that which is creating the implicit holes in files. That happens even before the commit mentioned earlier. I need to investigate why we get those implicit holes to check if there is a real problem or not. For now this change fixes the regression of introducing file extent items with an item size of 0 bytes. Fix the issue by calling btrfs_punch_hole_range() without passing a btrfs_clone_extent_info structure, which ensures file extent items are inserted to represent the hole with a correct item size. We were passing a btrfs_clone_extent_info with a value of 0 for its 'item_size' field, which was causing the insertion of file extent items with an item size of 0. [1] https://www.spinics.net/lists/linux-btrfs/msg75350.html Reported-by: David Sterba <dsterba@suse.com> Fixes: 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: abort transaction after failed inode updates in create_subvolJosef Bacik1-2/+8
commit c7e54b5102bf3614cadb9ca32d7be73bad6cecf0 upstream. We can just abort the transaction here, and in fact do that for every other failure in this function except these two cases. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-04btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNCDavid Sterba1-6/+0
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as deprecated and scheduled for removal but we actualy do use them for 'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5fc should have been just the async flag for subvolume creation. The deprecation has been added in this development cycle, remove it until it's time. Fixes: ebc87351e5fc ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag") Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-15btrfs: qgroup: Always free PREALLOC META reserve in ↵Qu Wenruo1-4/+2
btrfs_delalloc_release_extents() [Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC. PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done. [Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space. The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); We always free qgroup PREALLOC meta space. While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0)); We only free qgroup PREALLOC meta space when something went wrong. [The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space. The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata() - Free the unused metadata space Normally in: btrfs_delalloc_release_extents() |- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not. E.g. for 64K buffered write, the metadata rsv works like: /* The first page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() /* The first page caused one outstanding extent, thus needs metadata rsv */ /* The 2nd page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed /* The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved */ /* The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent) This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta(). The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit 336a8bb8e36a ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT. - The trans commit in advance for qgroup would hide the bug Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug. [FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release(). Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: stop clearing EXTENT_DIRTY in inode I/O treeOmar Sandoval1-3/+2
Since commit fee187d9d9dd ("Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we can simplify and stop trying to clear it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flagNikolay Borisov1-1/+14
Support for asynchronous snapshot creation was originally added in 72fd032e9424 ("Btrfs: add SNAP_CREATE_ASYNC ioctl") to cater for ceph's backend needs. However, since Ceph has deprecated support for btrfs there is no longer need for that support in btrfs. Additionally, this was never supported by btrfs-progs, the official userspace tools. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: sysfs: replace direct access to feature set names with a helperDavid Sterba1-1/+1
In order to unexport the feature type array, add a helper for the enum-to-string conversion. Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move basic block_group definitions to their own headerJosef Bacik1-0/+1
This is prep work for moving all of the block group cache code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09Btrfs: remove unnecessary condition in btrfs_clone() to avoid too much nestingFilipe Manana1-145/+131
The bulk of the work done when cloning extents, at ioctl.c:btrfs_clone(), is done inside an if statement that checks if the found key has the type BTRFS_EXTENT_DATA_KEY. That if statement is redundant however, because btrfs_search_slot() always leaves us in a leaf slot that points to a key that is always greater then or equals to the search key, and our search key here has that type, and because we bail out before that if statement if the key at the given leaf slot is greater then BTRFS_EXTENT_DATA_KEY. Therefore just remove that if statement, not only because it is useless but mostly because it increases the nesting/indentation level in this function which is quite big and makes things a bit awkward whenever I need to fix something that requires changing btrfs_clone() (and it has been like that for many years already). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: use common vfs LABEL ioctl definitionsEric Sandeen1-4/+4
I lifted the btrfs label get/set ioctls to the vfs some time ago, but never followed up to use those common definitions directly in btrfs. This patch does that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extentsFilipe Manana1-152/+43
When cloning extents (or deduplicating) we create a transaction with a space reservation that considers we will drop or update a single file extent item of the destination inode (that we modify a single leaf). That is fine for the vast majority of scenarios, however it might happen that we need to drop many file extent items, and adjust at most two file extent items, in the destination root, which can span multiple leafs. This will lead to either the call to btrfs_drop_extents() to fail with ENOSPC or the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode() (called through clone_finish_inode_update()) to fail with ENOSPC. Such failure results in a transaction abort, leaving the filesystem in a read-only mode. In order to fix this we need to follow the same approach as the hole punching code, where we create a local reservation with 1 unit and keep ending and starting transactions, after balancing the btree inode, when __btrfs_drop_extents() returns ENOSPC. So fix this by making the extent cloning call calls the recently added btrfs_punch_hole_range() helper, which is what does the mentioned work for hole punching, and make sure whenever we drop extent items in a transaction, we also add a replacing file extent item, to avoid corruption (a hole) if after ending a transaction and before starting a new one, the old transaction gets committed and a power failure happens before we finish cloning. A test case for fstests follows soon. Reported-by: David Goodwin <david@codepoets.co.uk> Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/ Reported-by: Sam Tygier <sam@tygier.co.uk> Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/ Fixes: b6f3409b2197e8f ("Btrfs: reserve sufficient space for ioctl clone") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-17Merge tag 'for-5.3-tag' of ↵Linus Torvalds1-0/+23
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Highlights: - chunks that have been trimmed and unchanged since last mount are tracked and skipped on repeated trims - use hw assissed crc32c on more arches, speedups if native instructions or optimized implementation is available - the RAID56 incompat bit is automatically removed when the last block group of that type is removed Fixes: - fsync fix for reflink on NODATACOW files that could lead to ENOSPC - fix data loss after inode eviction, renaming it, and fsync it - fix fsync not persisting dentry deletions due to inode evictions - update ctime/mtime/iversion after hole punching - fix compression type validation (reported by KASAN) - send won't be allowed to start when relocation is in progress, this can cause spurious errors or produce incorrect send stream Core: - new tracepoints for space update - tree-checker: better check for end of extents for some tree items - preparatory work for more checksum algorithms - run delayed iput at unlink time and don't push the work to cleaner thread where it's not properly throttled - wrap block mapping to structures and helpers, base for further refactoring - split large files, part 1: - space info handling - block group reservations - delayed refs - delayed allocation - other cleanups and refactoring" * tag 'for-5.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (103 commits) btrfs: fix memory leak of path on error return path btrfs: move the subvolume reservation stuff out of extent-tree.c btrfs: migrate the delalloc space stuff to it's own home btrfs: migrate btrfs_trans_release_chunk_metadata btrfs: migrate the delayed refs rsv code btrfs: Evaluate io_tree in find_lock_delalloc_range() btrfs: migrate the global_block_rsv helpers to block-rsv.c btrfs: migrate the block-rsv code to block-rsv.c btrfs: stop using block_rsv_release_bytes everywhere btrfs: cleanup the target logic in __btrfs_block_rsv_release btrfs: export __btrfs_block_rsv_release btrfs: export btrfs_block_rsv_add_bytes btrfs: move btrfs_block_rsv definitions into it's own header btrfs: Simplify update of space_info in __reserve_metadata_bytes() btrfs: unexport can_overcommit btrfs: move reserve_metadata_bytes and supporting code to space-info.c btrfs: move dump_space_info to space-info.c btrfs: export block_rsv_use_bytes btrfs: move btrfs_space_info_add_*_bytes to space-info.c btrfs: move the space info update macro to space-info.h ...
2019-07-13Merge tag 'vfs-fix-ioctl-checking-3' of ↵Linus Torvalds1-19/+11
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong: "Here's a patch series that sets up common parameter checking functions for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations. The goal here is to reduce the amount of behaviorial variance between the filesystems where those ioctls originated (ext2 and XFS, respectively) and everybody else. - Standardize parameter checking for the SETFLAGS and FSSETXATTR ioctls (which were the file attribute setters for ext4 and xfs and have now been hoisted to the vfs) - Only allow the DAX flag to be set on files and directories" * tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: only allow FSSETXATTR to set DAX flag on files and dirs vfs: teach vfs_ioc_fssetxattr_check to check extent size hints vfs: teach vfs_ioc_fssetxattr_check to check project id info vfs: create a generic checking function for FS_IOC_FSSETXATTR vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
2019-07-11Merge tag 'fsnotify_for_v5.3-rc1' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "This contains cleanups of the fsnotify name removal hook and also a patch to disable fanotify permission events for 'proc' filesystem" * tag 'fsnotify_for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: get rid of fsnotify_nameremove() fsnotify: move fsnotify_nameremove() hook out of d_delete() configfs: call fsnotify_rmdir() hook debugfs: call fsnotify_{unlink,rmdir}() hooks debugfs: simplify __debugfs_remove_file() devpts: call fsnotify_unlink() hook tracefs: call fsnotify_{unlink,rmdir}() hooks rpc_pipefs: call fsnotify_{unlink,rmdir}() hooks btrfs: call fsnotify_rmdir() hook fsnotify: add empty fsnotify_{unlink,rmdir}() hooks fanotify: Disallow permission events for proc filesystem
2019-07-04btrfs: migrate the delalloc space stuff to it's own homeJosef Bacik1-0/+1
We have code for data and metadata reservations for delalloc. There's quite a bit of code here, and it's used in a lot of places so I've separated it out to it's own file. inode.c and file.c are already pretty large, and this code is complicated enough to live in its own space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: move space_info to space-info.hJosef Bacik1-0/+1
Migrate the struct definition and the one helper that's in ctree.h into space-info.h Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01vfs: create a generic checking function for FS_IOC_FSSETXATTRDarrick J. Wong1-11/+6
Create a generic checking function for the incoming FS_IOC_FSSETXATTR fsxattr values so that we can standardize some of the implementation behaviors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz>
2019-07-01vfs: create a generic checking and prep function for FS_IOC_SETFLAGSDarrick J. Wong1-8/+5
Create a generic function to check incoming FS_IOC_SETFLAGS flag values and later prepare the inode for updates so that we can standardize the implementations that follow ext4's flag values. Note that the efivarfs implementation no longer fails a no-op SETFLAGS without CAP_LINUX_IMMUTABLE since that's the behavior in ext*. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2019-07-01btrfs: Flush before reflinking any extent to prevent NOCOW write falling ↵Qu Wenruo1-0/+21
back to COW without data reservation [BUG] The following script can cause unexpected fsync failure: #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs mkfs.btrfs -f $dev -b 512M > /dev/null mount $dev $mnt -o nospace_cache # Prealloc one extent xfs_io -f -c "falloc 8k 64m" $mnt/file1 # Fill the remaining data space xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding sync # Write into the prealloc extent xfs_io -c "pwrite 1m 16m" $mnt/file1 # Reflink then fsync, fsync would fail due to ENOSPC xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1 umount $dev The fsync fails with ENOSPC, and the last page of the buffered write is lost. [CAUSE] This is caused by: - Btrfs' back reference only has extent level granularity So write into shared extent must be COWed even only part of the extent is shared. So for above script we have: - fallocate Create a preallocated extent where we can do NOCOW write. - fill all the remaining data and unallocated space - buffered write into preallocated space As we have not enough space available for data and the extent is not shared (yet) we fall into NOCOW mode. - reflink Now part of the large preallocated extent is shared, later write into that extent must be COWed. - fsync triggers writeback But now the extent is shared and therefore we must fallback into COW mode, which fails with ENOSPC since there's not enough space to allocate data extents. [WORKAROUND] The workaround is to ensure any buffered write in the related extents (not just the reflink source range) get flushed before reflink/dedupe, so that NOCOW writes succeed that happened before reflinking succeed. The workaround is expensive, we could do it better by only flushing NOCOW range, but that needs extra accounting for NOCOW range. For now, fix the possible data loss first. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-06-20btrfs: call fsnotify_rmdir() hookAmir Goldstein1-1/+3
This will allow generating fsnotify delete events after the fsnotify_nameremove() hook is removed from d_delete(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-06-17Btrfs: fix failure to persist compression property xattr deletion on fsyncFilipe Manana1-2/+0
After the recent series of cleanups in the properties and xattrs modules that landed in the 5.2 merge window, we ended up with a regression where after deleting the compression xattr property through the setflags ioctl, we don't set the BTRFS_INODE_COPY_EVERYTHING flag in the inode anymore. As a consequence, if the inode was fsync'ed when it had the compression property set, after deleting the compression property through the setflags ioctl and fsync'ing again the inode, the log will still contain the compression xattr, because the inode did not had that bit set, which made the fsync not delete all xattrs from the log and copy all xattrs from the subvolume tree to the log tree. This regression happens due to the fact that that series of cleanups made btrfs_set_prop() call the old function do_setxattr() (which is now named btrfs_setxattr()), and not the old version of btrfs_setxattr(), which is now called btrfs_setxattr_trans(). Fix this by setting the BTRFS_INODE_COPY_EVERYTHING bit in the current btrfs_setxattr() function and remove it from everywhere else, including its setup at btrfs_ioctl_setflags(). This is cleaner, avoids similar regressions in the future, and centralizes the setup of the bit. After all, the need to setup this bit should only be in the xattrs module, since it is an implementation of xattrs. Fixes: 04e6863b19c722 ("btrfs: split btrfs_setxattr calls regarding transaction") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: drop local copy of inode i_modeAnand Jain1-5/+2
There isn't real use of making struct inode::i_mode a local copy, it saves a dereference one time, not much. Just use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: drop old_fsflags in btrfs_ioctl_setflagsAnand Jain1-3/+3
btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only once. Instead used it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: modify local copy of btrfs_inode flagsAnand Jain1-31/+26
Instead of updating the binode::flags directly, update a local copy, and then at the point of no error, store copy it to the binode::flags. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: drop useless inode i_flags copy and restoreAnand Jain1-3/+0
The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the inode::i_flags update functions such as btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called in btrfs_ioctl_setflags() instead of btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the inode::i_flags remains unmodified until the thread has checked all the conditions. So drop the saved inode::i_flags in out_i_flags. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: start transaction in btrfs_ioctl_setflags()Anand Jain1-20/+25
Inode attribute can be set through the FS_IOC_SETFLAGS ioctl. This flags also includes compression attribute for which we would set/reset the compression extended attribute. While doing this there is a bit of duplicate code, the following things happens twice: - start/end_transaction - inode_inc_iversion() - current_time update to inode->i_ctime - and btrfs_update_inode() These are updated both at btrfs_ioctl_setflags() and btrfs_set_props() as well. This patch merges these two duplicate codes at btrfs_ioctl_setflags(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: refactor btrfs_set_props to validate externallyAnand Jain1-0/+3
In preparation to merge multiple transactions when setting the compression flags, split btrfs_set_props() validation part outside of it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29Btrfs: fix race between send and deduplication that lead to failures and crashesFilipe Manana1-1/+18
Send operates on read only trees and expects them to never change while it is using them. This is part of its initial design, and this expection is due to two different reasons: 1) When it was introduced, no operations were allowed to modifiy read-only subvolumes/snapshots (including defrag for example). 2) It keeps send from having an impact on other filesystem operations. Namely send does not need to keep locks on the trees nor needs to hold on to transaction handles and delay transaction commits. This ends up being a consequence of the former reason. However the deduplication feature was introduced later (on September 2013, while send was introduced in July 2012) and it allowed for deduplication with destination files that belong to read-only trees (subvolumes and snapshots). That means that having a send operation (either full or incremental) running in parallel with a deduplication that has the destination inode in one of the trees used by the send operation, can result in tree nodes and leaves getting freed and reused while send is using them. This problem is similar to the problem solved for the root nodes getting freed and reused when a snapshot is made against one tree that is currenly being used by a send operation, fixed in commits [1] and [2]. These commits explain in detail how the problem happens and the explanation is valid for any node or leaf that is not the root of a tree as well. This problem was also discussed and explained recently in a thread [3]. The problem is very easy to reproduce when using send with large trees (snapshots) and just a few concurrent deduplication operations that target files in the trees used by send. A stress test case is being sent for fstests that triggers the issue easily. The most common error to hit is the send ioctl return -EIO with the following messages in dmesg/syslog: [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27 The first one is very easy to hit while the second one happens much less frequently, except for very large trees (in that test case, snapshots with 100000 files having large xattrs to get deep and wide trees). Less frequently, at least one BUG_ON can be hit: [1631742.130080] ------------[ cut here ]------------ [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806! [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000 [1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1631742.141272] Call Trace: [1631742.141826] ? do_raw_spin_unlock+0x49/0xc0 [1631742.142390] tree_advance+0x173/0x1d0 [btrfs] [1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs] [1631742.143533] ? process_extent+0x1070/0x1070 [btrfs] [1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs] [1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0 [1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs] [1631742.146179] ? account_entity_enqueue+0xd3/0x100 [1631742.146662] ? reweight_entity+0x154/0x1a0 [1631742.147135] ? update_curr+0x20/0x2a0 [1631742.147593] ? check_preempt_wakeup+0x103/0x250 [1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0 [1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [1631742.148942] do_vfs_ioctl+0xa2/0x6f0 [1631742.149361] ? __fget+0x113/0x200 [1631742.149767] ksys_ioctl+0x70/0x80 [1631742.150159] __x64_sys_ioctl+0x16/0x20 [1631742.150543] do_syscall_64+0x60/0x1b0 [1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe [1631742.151326] RIP: 0033:0x7f4cd9f5add7 (...) [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001 (...) [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]--- That BUG_ON happens because while send is using a node, that node is COWed by a concurrent deduplication, gets freed and gets reused as a leaf (because a transaction commit happened in between), so when it attempts to read a slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer contents were wiped out and it now matches a leaf (which can even belong to some other tree now), hitting the BUG_ON(level == 0). Fix this concurrency issue by not allowing send and deduplication to run in parallel if both operate on the same readonly trees, returning EAGAIN to user space and logging an exlicit warning in dmesg/syslog. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()Qu Wenruo1-6/+9
Use the new btrfs_ref structure and replace parameter list to clean up the usage of owner and level to distinguish the extent types. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Perform locking/unlocking in btrfs_remap_file_range()Goldwyn Rodrigues1-23/+11
Move code to make it more readable, so as locking and unlocking is done in the same function. The generic checks that are now performed in the locked section are unaffected. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: refactor btrfs_set_prop and add btrfs_set_prop_transAnand Jain1-6/+6
btrfs_set_prop() takes transaction pointer as the first argument, however in ioctl.c for the purpose of setting the compression property, we call btrfs_set_prop() with NULL transaction pointer. Down in the call chain btrfs_setxattr() starts transaction to update the attribute and also to update the inode. So for clarity, create btrfs_set_prop_trans() with no transaction pointer as argument, in preparation to start transaction here instead of doing it down the call chain at btrfs_setxattr(). Also now the btrfs_set_prop() is a static function. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: merge _btrfs_set_prop helpersAnand Jain1-4/+6
btrfs_set_prop() is a redirect to __btrfs_set_prop() with the transaction handle equal to NULL. __btrfs_set_prop() in turn passes this to do_setxattr() which then transaction is actually created. Instead merge __btrfs_set_prop() to btrfs_set_prop(), and update the caller with NULL argument. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-03-28Btrfs: do not allow trimming when a fs is mounted with the nologreplay optionFilipe Manana1-0/+10
Whan a filesystem is mounted with the nologreplay mount option, which requires it to be mounted in RO mode as well, we can not allow discard on free space inside block groups, because log trees refer to extents that are not pinned in a block group's free space cache (pinning the extents is precisely the first phase of replaying a log tree). So do not allow the fitrim ioctl to do anything when the filesystem is mounted with the nologreplay option, because later it can be mounted RW without that option, which causes log replay to happen and result in either a failure to replay the log trees (leading to a mount failure), a crash or some silent corruption. Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Fixes: 96da09192cda ("btrfs: Introduce new mount option to disable tree log replay") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-27Btrfs: fix deadlock between clone/dedupe and renameFilipe Manana1-18/+3
Reflinking (clone/dedupe) and rename are operations that operate on two inodes and therefore need to lock them in the same order to avoid ABBA deadlocks. It happens that Btrfs' reflink implementation always locked them in a different order from VFS's lock_two_nondirectories() helper, which is used by the rename code in VFS, resulting in ABBA type deadlocks. Btrfs' locking order: static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2) { if (inode1 < inode2) swap(inode1, inode2); inode_lock_nested(inode1, I_MUTEX_PARENT); inode_lock_nested(inode2, I_MUTEX_CHILD); } VFS's locking order: void lock_two_nondirectories(struct inode *inode1, struct inode *inode2) { if (inode1 > inode2) swap(inode1, inode2); if (inode1 && !S_ISDIR(inode1->i_mode)) inode_lock(inode1); if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1) inode_lock_nested(inode2, I_MUTEX_NONDIR2); } Fix this by killing the btrfs helper function that does the double inode locking and replace it with VFS's helper lock_two_nondirectories(). Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 416161db9b63e3 ("btrfs: offline dedupe") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25Btrfs: remove no longer needed range length checks for deduplicationFilipe Manana1-17/+2
Comparing the content of the pages in the range to deduplicate is now done in generic_remap_checks called by the generic helper generic_remap_file_range_prep(), which takes care of ensuring we do not compare/deduplicate undefined data beyond a file's EOF (range from EOF to the next block boundary). So remove these checks which are now redundant. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: merge btrfs_find_device and find_deviceAnand Jain1-2/+2
Both btrfs_find_device() and find_device() does the same thing except that the latter does not take the seed device onto account in the device scanning context. We can merge them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: refactor btrfs_find_device() take fs_devices as argumentAnand Jain1-2/+3
btrfs_find_device() accepts fs_info as an argument and retrieves fs_devices from fs_info. Instead use fs_devices, so that this function can be used in non-mount (during device scanning) context as well. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25Btrfs: move duplicated nodatasum check into common reflink/dedupe helperFilipe Manana1-10/+7
Move the check that verifies if both inodes have checksums disabled or both have them enabled, from the clone and deduplication functions into the new common helper btrfs_remap_file_range_prep(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25Btrfs: do not overwrite error return value in the balance ioctlFilipe Manana1-1/+1
If the call to btrfs_balance() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if btrfs_balance() returned success or was canceled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25Btrfs: do not overwrite error return value in the device replace ioctlFilipe Manana1-1/+1
If the call to btrfs_dev_replace_by_ioctl() failed we would overwrite the error returned to user space with -EFAULT if the call to copy_to_user() failed as well. Fix that by calling copy_to_user() only if no error happened before or a device replace operation was canceled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>