summaryrefslogtreecommitdiff
path: root/fs/btrfs/ctree.h
AgeCommit message (Collapse)AuthorFilesLines
2021-12-15btrfs: fix invalid delayed ref after subvolume creation failureFilipe Manana1-1/+6
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the new root's root item into the root tree, we are freeing the metadata extent we reserved for the new root to prevent a metadata extent leak, as we don't abort the transaction at that point (since there is nothing at that point that is irreversible). However we allocated the metadata extent for the new root which we are creating for the new subvolume, so its delayed reference refers to the ID of this new root. But when we free the metadata extent we pass the root of the subvolume where the new subvolume is located to btrfs_free_tree_block() - this is incorrect because this will generate a delayed reference that refers to the ID of the parent subvolume's root, and not to ID of the new root. This results in a failure when running delayed references that leads to a transaction abort and a trace like the following: [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs] [3868.739857] Code: 68 0f 85 e6 fb ff (...) [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246 [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002 [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88 [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88 [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000 [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88 [3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000 [3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0 [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3868.757803] Call Trace: [3868.758281] <TASK> [3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs] [3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] [3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs] [3868.762069] ? lock_acquired+0x19f/0x420 [3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs] [3868.763860] ? _raw_spin_unlock+0x29/0x40 [3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs] [3868.765870] create_subvol+0x1d8/0x9a0 [btrfs] [3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs] [3868.767669] ? preempt_count_add+0x49/0xa0 [3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [3868.769639] ? _copy_from_user+0x66/0xa0 [3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [3868.772364] ? __slab_free+0x10a/0x360 [3868.773198] ? rcu_read_lock_sched_held+0x12/0x60 [3868.774121] ? lock_release+0x223/0x4a0 [3868.774863] ? lock_acquired+0x19f/0x420 [3868.775634] ? rcu_read_lock_sched_held+0x12/0x60 [3868.776530] ? trace_hardirqs_on+0x1b/0xe0 [3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [3868.778280] ? kmem_cache_free+0x321/0x3c0 [3868.779011] ? __x64_sys_ioctl+0x83/0xb0 [3868.779718] __x64_sys_ioctl+0x83/0xb0 [3868.780387] do_syscall_64+0x3b/0xc0 [3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae [3868.781953] RIP: 0033:0x7f281c59e957 [3868.782585] Code: 3c 1c 48 f7 d8 4c (...) [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957 [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003 [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36 [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc [3868.793765] </TASK> [3868.794037] irq event stamp: 0 [3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0 [3868.799284] ---[ end trace be24c7002fe27747 ]--- [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2 [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627 [3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33 [3868.802863] extent refs 1 gen 1265 flags 2 [3868.803447] ref#0: tree block backref root 1610 (...) [3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33 [3869.065421] extent refs 1 gen 1268 flags 2 [3869.066115] ref#0: tree block backref root 1689 (...) [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0 [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry Fix this by passing the new subvolume's root ID to btrfs_free_tree_block(). This requires changing the root argument of btrfs_free_tree_block() from struct btrfs_root * to a u64, since at this point during the subvolume creation we have not yet created the struct btrfs_root for the new subvolume, and btrfs_free_tree_block() only needs a root ID and nothing else from a struct btrfs_root. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from btrfs_unlink_inode()Filipe Manana1-1/+0
The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZEQu Wenruo1-0/+7
It's a common practice to avoid use sizeof(struct btrfs_super_block) (3531), but to use BTRFS_SUPER_INFO_SIZE (4096). The problem is that, sizeof(struct btrfs_super_block) doesn't match BTRFS_SUPER_INFO_SIZE from the very beginning. Furthermore, for all call sites except selftests, we always allocate BTRFS_SUPER_INFO_SIZE space for super block, there isn't any real reason to use the smaller value, and it doesn't really save any space. So let's get rid of such confusing behavior, and unify those two values. This modification also adds a new static_assert() to verify the size, and moves the BTRFS_SUPER_INFO_* macros to the definition of btrfs_super_block for the static_assert(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: add a BTRFS_FS_ERROR helperJosef Bacik1-0/+3
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of 5963ffcaf383 ("btrfs: always abort the transaction if we abort a trans handle") we will always set BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED and ERROR to see if things have gone wrong. Add a helper to check BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to use the helper. The TRANS_ABORTED bit check was added in af7227338135 ("Btrfs: clean up resources during umount after trans is aborted") but is not actually specific. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: remove unused function btrfs_bio_fits_in_stripe()Qu Wenruo1-2/+0
As the last caller in compression.c has been removed, we don't need that function anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: unexport setup_items_for_insert()Filipe Manana1-19/+5
Since setup_items_for_insert() is not used anymore outside of ctree.c, make it static and remove its prototype from ctree.h. This also requires to move the definition of setup_item_for_insert() from ctree.h to ctree.c and move down btrfs_duplicate_item() so that it's defined after setup_items_for_insert(). Further, since setup_item_for_insert() is used outside ctree.c, rename it to btrfs_setup_item_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 2/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: loop only once over data sizes array when inserting an item batchFilipe Manana1-5/+52
When inserting a batch of items into a btree, we end up looping over the data sizes array 3 times: 1) Once in the caller of btrfs_insert_empty_items(), when it populates the array with the data sizes for each item; 2) Once at btrfs_insert_empty_items() to sum the elements of the data sizes array and compute the total data size; 3) And then once again at setup_items_for_insert(), where we do exactly the same as what we do at btrfs_insert_empty_items(), to compute the total data size. That is not bad for small arrays, but when the arrays have hundreds of elements, the time spent on looping is not negligible. For example when doing batch inserts of delayed items for dir index items or when logging a directory, it's common to have 200 to 260 dir index items in a single batch when using a leaf size of 16K and using file names between 8 and 12 characters. For a 64K leaf size, multiply that by 4. Taking into account that during directory logging or when flushing delayed dir index items we can have many of those large batches, the time spent on the looping adds up quickly. It's also more important to avoid it at setup_items_for_insert(), since we are holding a write lock on a leaf and, in some cases, on upper nodes of the btree, which causes us to block other tasks that want to access the leaf and nodes for longer than necessary. So change the code so that setup_items_for_insert() and btrfs_insert_empty_items() no longer compute the total data size, and instead rely on the caller to supply it. This makes us loop over the array only once, where we can both populate the data size array and compute the total data size, taking advantage of spatial and temporal locality. To make this more manageable, use a structure to contain all the relevant details for a batch of items (keys array, data sizes array, total data size, number of items), and use it as an argument for btrfs_insert_empty_items() and setup_items_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 1/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename struct btrfs_io_bio to btrfs_bioQu Wenruo1-2/+4
Previously we had "struct btrfs_bio", which records IO context for mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra btrfs specific info for logical bytenr bio. With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now. The struct btrfs_bio changes meaning by this commit. There was a suggested name like btrfs_logical_bio but it's a bit long and we'd prefer to use a shorter name. This could be a concern for backports to older kernels where the different meaning could possibly cause confusion or bugs. Comparing the new and old structures, there's no overlap among the struct members so a build would break in case of incorrect backport. We haven't had many backports to bio code anyway so this is more of a theoretical cause of bugs and a matter of precaution but we'll need to keep the semantic change in mind. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: add a dedicated data relocation block groupJohannes Thumshirn1-0/+7
Relocation in a zoned filesystem can fail with a transaction abort with error -22 (EINVAL). This happens because the relocation code assumes that the extents we relocated the data to have the same size the source extents had and ensures this by preallocating the extents. But in a zoned filesystem we currently can't preallocate the extents as this would break the sequential write required rule. Therefore it can happen that the writeback process kicks in while we're still adding pages to a delalloc range and starts writing out dirty pages. This then creates destination extents that are smaller than the source extents, triggering the following safety check in get_new_location(): 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { 1035 ret = -EINVAL; 1036 goto out; 1037 } Temporarily create a dedicated block group for the relocation process, so no non-relocation data writes can interfere with the relocation writes. This is needed that we can switch the relocation process on a zoned filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme like in a non-zoned filesystem using REQ_OP_WRITE and preallocation. Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: introduce btrfs_is_data_reloc_rootJohannes Thumshirn1-0/+5
There are several places in our codebase where we check if a root is the root of the data reloc tree and subsequent patches will introduce more. Factor out the check into a small helper function instead of open coding it multiple times. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: zoned: implement active zone trackingNaohiro Aota1-0/+3
Add zone_is_active flag to btrfs_block_group. This flag indicates the underlying zones are all active. Such zone active block groups are tracked by fs_info->active_bg_list. btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying device part. They set/clear the bitmap to indicate zone activeness and count the number of zones we can activate left. btrfs_zone_{activate,finish}() take responsibility for the logical part and the list management. In addition, btrfs_zone_finish() wait for any writes on it and send REQ_OP_ZONE_FINISH to the zone. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file()Qu Wenruo1-2/+2
Currently btrfs_defrag_file() accepts both "struct inode" and "struct file" as parameter. We can easily grab "struct inode" from "struct file" using file_inode() helper. The reason why we need "struct file" is just to re-use its f_ra. Change this to pass "struct file_ra_state" parameter, so that it's more clear what we really want. Since we're here, also add some comments on the function btrfs_defrag_file(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25btrfs: subpage: introduce btrfs_subpage_bitmap_infoQu Wenruo1-0/+1
Currently we use fixed size u16 bitmap for subpage bitmap. This is fine for 4K sectorsize with 64K page size. But for 4K sectorsize and larger page size, the bitmap is too small, while for smaller page size like 16K, u16 bitmaps waste too much space. Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to record the proper bitmap size, and where each bitmap should start at. By this, we can later compact all subpage bitmaps into one u32 bitmap. This patch is the first step. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-12Merge tag 'for-5.15-rc5-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more error handling fixes, stemming from code inspection, error injection or fuzzing" * tag 'for-5.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix abort logic in btrfs_replace_file_extents btrfs: check for error when looking up inode during dir entry replay btrfs: unify lookup return value when dir entry is missing btrfs: deal with errors when adding inode reference during log replay btrfs: deal with errors when replaying dir entry during log replay btrfs: deal with errors when checking if a dir entry exists during log replay btrfs: update refs for any root except tree log roots btrfs: unlock newly allocated extent buffer after error
2021-10-07btrfs: unify lookup return value when dir entry is missingFilipe Manana1-1/+1
btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir entries and both are used during log replay or when updating a log tree during an unlink. However when the dir item does not exists, btrfs_lookup_dir_item() returns NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if the dir item exists but there is no matching entry for a given name or index, both return NULL. This makes the call sites during log replay to be more verbose than necessary and it makes it easy to miss this slight difference. Since we don't need to distinguish between those two cases, make btrfs_lookup_dir_index_item() always return NULL when there is no matching directory entry - either because there isn't any dir entry or because there is one but it does not match the given name and index. Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to 'index' since it is supposed to match an index number, and the name 'objectid' is not very good because it can easily be confused with an inode number (like the inode number a dir entry points to). CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-02Merge tag 'ovl-update-5.15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Copy up immutable/append/sync/noatime attributes (Amir Goldstein) - Improve performance by enabling RCU lookup. - Misc fixes and improvements The reason this touches so many files is that the ->get_acl() method now gets a "bool rcu" argument. The ->get_acl() API was updated based on comments from Al and Linus: Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/ * tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: enable RCU'd ->get_acl() vfs: add rcu argument to ->get_acl() callback ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup() ovl: use kvalloc in xattr copy-up ovl: update ctime when changing fileattr ovl: skip checking lower file's i_writecount on truncate ovl: relax lookup error on mismatch origin ftype ovl: do not set overlay.opaque for new directories ovl: add ovl_allow_offline_changes() helper ovl: disable decoding null uuid with redirect_dir ovl: consistent behavior for immutable/append-only inodes ovl: copy up sync/noatime fileattr flags ovl: pass ovl_fs to ovl_check_setxattr() fs: add generic helper for filling statx attribute flags
2021-08-23btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctlsChristian Brauner1-1/+2
Creating subvolumes and snapshots is one of the core features of btrfs and is even available to unprivileged users. Make it possible to use subvolume and snapshot creation on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: introduce btrfs_search_backwards functionMarcos Paulo de Souza1-0/+4
It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: initial fsverity supportBoris Burkov1-2/+29
Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: Eric Biggers <ebiggers@google.com> Co-developed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: add ro compat flags to inodesBoris Burkov1-14/+16
Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: use delalloc_bytes to determine flush amount for shrink_delallocJosef Bacik1-4/+5
We have been hitting some early ENOSPC issues in production with more recent kernels, and I tracked it down to us simply not flushing delalloc as aggressively as we should be. With tracing I was seeing us failing all tickets with all of the block rsvs at or around 0, with very little pinned space, but still around 120MiB of outstanding bytes_may_used. Upon further investigation I saw that we were flushing around 14 pages per shrink call for delalloc, despite having around 2GiB of delalloc outstanding. Consider the example of a 8 way machine, all CPUs trying to create a file in parallel, which at the time of this commit requires 5 items to do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim size waiting on reservations. Now assume we have 128MiB of delalloc outstanding. With our current math we would set items to 20, and then set to_reclaim to 20 * 256k, or 5MiB. Assuming that we went through this loop all 3 times, for both FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop twice, we'd only flush 60MiB of the 128MiB delalloc space. This could leave a fair bit of delalloc reservations still hanging around by the time we go to ENOSPC out all the remaining tickets. Fix this two ways. First, change the calculations to be a fraction of the total delalloc bytes on the system. Prior to this change we were calculating based on dirty inodes so our math made more sense, now it's just completely unrelated to what we're actually doing. Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've gone through the flush states at least once. This will empty the system of all delalloc so we're sure to be truly out of space when we start failing tickets. I'm tagging stable 5.10 and forward, because this is where we started using the page stuff heavily again. This affects earlier kernel versions as well, but would be a pain to backport to them as the flushing mechanisms aren't the same. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: make btrfs_next_leaf static inlineDavid Sterba1-1/+12
btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it to header to avoid the function call overhead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_orderedDavid Sterba1-1/+1
The uptodate parameter should be bool, change the type. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove unused start and end parameters from btrfs_run_delalloc_range()Qu Wenruo1-1/+1
Since commit d75855b4518b ("btrfs: Remove extent_io_ops::writepage_start_hook") removes the writepage_start_hook() and adds btrfs_writepage_cow_fixup() function, there is no need to follow the old hook parameters. Remove the @start and @end hook, since currently the fixup check is full page check, it doesn't need @start and @end hook. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: zoned: remove max_zone_append_size logicJohannes Thumshirn1-2/+0
There used to be a patch in the original series for zoned support which limited the extent size to max_zone_append_size, but this patch has been dropped somewhere around v9. We've decided to go the opposite direction, instead of limiting extents in the first place we split them before submission to comply with the device's limits. Remove the related code, btrfs_fs_info::max_zone_append_size and btrfs_zoned_device_info::max_zone_append_size. This also removes the workaround for dm-crypt introduced in 1d68128c107a ("btrfs: zoned: fail mount if the device does not support zone append") because the fix has been merged as f34ee1dce642 ("dm crypt: Fix zoned block device support"). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-18vfs: add rcu argument to ->get_acl() callbackMiklos Szeredi1-1/+1
Add a rcu argument to the ->get_acl() callback to allow get_cached_acl_rcu() to call the ->get_acl() method in the next patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22btrfs: remove unused btrfs_fs_info::total_pinnedNikolay Borisov1-2/+0
This got added 14 years ago in 324ae4df00fd ("Btrfs: Add block group pinned accounting back") but it was not ever used. Subsequently its usage got gradually removed in 8790d502e440 ("Btrfs: Add support for mirroring across drives") and 11833d66be94 ("Btrfs: improve async block group caching"). Let's remove it for good! Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: rip out may_commit_transactionJosef Bacik1-1/+0
may_commit_transaction was introduced before the ticketing infrastructure existed. There was a problem where we'd legitimately be out of space, but every reservation would trigger a transaction commit and then fail. Thus if you had 1000 things trying to make a reservation, they'd all do the flushing loop and thus commit the transaction 1000 times before they'd get their ENOSPC. This helper was introduced to short circuit this, if there wasn't space that could be reclaimed by committing the transaction then simply ENOSPC out. This made true ENOSPC tests much faster as we didn't waste a bunch of time. However many of our bugs over the years have been from cases where we didn't account for some space that would be reclaimed by committing a transaction. The delayed refs rsv space, delayed rsv, many pinned bytes miscalculations, etc. And in the meantime the original problem has been solved with ticketing. We no longer will commit the transaction 1000 times. Instead we'll get 1000 waiters, we will go through the flushing mechanisms, and if there's no progress after 2 loops we ENOSPC everybody out. The ticketing infrastructure gives us a deterministic way to see if we're making progress or not, thus we avoid a lot of extra work. So simplify this step by simply unconditionally committing the transaction. This removes what is arguably our most common source of early ENOSPC bugs and will allow us to drastically simplify many of the things we track because we simply won't need them with this stuff gone. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: ensure relocation never runs while we have send operations runningFilipe Manana1-2/+3
Relocation and send do not play well together because while send is running a block group can be relocated, a transaction committed and the respective disk extents get re-allocated and written to or discarded while send is about to do something with the extents. This was explained in commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due to concurrent relocation"), which prevented balance and send from running in parallel but it did not address one remaining case where chunk relocation can happen: shrinking a device (and device deletion which shrinks a device's size to 0 before deleting the device). We also have now one more case where relocation is triggered: on zoned filesystems partially used block groups get relocated by a background thread, introduced in commit 18bb8bbf13c183 ("btrfs: zoned: automatically reclaim zones"). So make sure that instead of preventing balance from running when there are ongoing send operations, we prevent relocation from happening. This uses the infrastructure recently added by a patch that has the subject: "btrfs: add cancellable chunk relocation support". Also it adds a spinlock used exclusively for the exclusivity between send and relocation, as before fs_info->balance_mutex was used, which would make an attempt to run send to block waiting for balance to finish, which can take a lot of time on large filesystems. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: shorten integrity checker extent data mount optionDavid Sterba1-1/+1
Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the mount option name. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: switch mount option bits to enums and use wider typeDavid Sterba1-32/+33
Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are recorded in the debugging information for convenience). There are two more things done but separating them would not make much sense as it's touching the same lines: - Renumber shifts 18..31 to 17..30 to get rid of the hole in the sequence. - Use 1UL as the value that gets shifted because we're approaching the 32bit limit and due to integer promotions the value of (1 << 31) becomes 0xffffffff80000000 when cast to unsigned long (eg. the option manipulating helpers). This is not causing any problems yet as the operations are in-memory and masking the 31st bit works, we don't have more than 31 bits so the ill effects of not masking higher bits don't happen. But once we have more, the problems will emerge. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: fix typos in commentsDavid Sterba1-3/+3
Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: make btrfs_set_range_writeback() subpage compatibleQu Wenruo1-1/+1
Function btrfs_set_range_writeback() currently just sets the page writeback unconditionally. Change it to call the subpage helper so that we can handle both cases well. Since the subpage helpers needs btrfs_fs_info, also change the parameter to accept btrfs_inode. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: rename PagePrivate2 to PageOrdered inside btrfsQu Wenruo1-0/+10
Inside btrfs we use Private2 page status to indicate we have an ordered extent with pending IO for the sector. But the page status name, Private2, tells us nothing about the bit itself, so this patch will rename it to Ordered. And with extra comment about the bit added, so reader who is still uncertain about the page Ordered status, will find the comment pretty easily. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered()Qu Wenruo1-1/+2
There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in end_compressed_bio_write(). It passes compressed pages to btrfs_writepage_endio_finish_ordered(), which is only supposed to accept inode pages. Thankfully the important info here is the inode, so let's pass btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and make @page parameter optional. By this, end_compressed_bio_write() can happily pass page=NULL while still getting everything done properly. Also, to cooperate with such modification, replace @page parameter for trace_btrfs_writepage_end_io_hook() with btrfs_inode. Although this removes page_index info, the existing start/len should be enough for most usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: refactor submit_extent_page() to make bio and its flag tracing easierQu Wenruo1-2/+0
There is a lot of code inside extent_io.c needs both "struct bio **bio_ret" and "unsigned long prev_bio_flags", along with some parameters like "unsigned long bio_flags". Such strange parameters are here for bio assembly. For example, we have such inode page layout: 0 4K 8K 12K |<-- Extent A-->|<- EB->| Then what we do is: - Page [0, 4K) *bio_ret = NULL So we allocate a new bio to bio_ret, Add page [0, 4K) to *bio_ret. - Page [4K, 8K) *bio_ret != NULL We found this page is continuous to *bio_ret, and if we're not at stripe boundary, we add page [4K, 8K) to *bio_ret. - Page [8K, 12K) *bio_ret != NULL But we found this page is not continuous, so we submit *bio_ret, then allocate a new bio, and add page [8K, 12K) to the new bio. This means we need to record both the bio and its bio_flag, but we record them manually using those strange parameter list, other than encapsulating them into their own structure. So this patch will introduce a new structure, btrfs_bio_ctrl, to record both the bio, and its bio_flags. Also, in above case, for all pages added to the bio, we need to check if the new page crosses stripe boundary. This check itself can be time consuming, and we don't really need to do that for each page. This patch also integrates the stripe boundary check into btrfs_bio_ctrl. When a new bio is allocated, the stripe and ordered extent boundary is also calculated, so no matter how large the bio will be, we only calculate the boundaries once, to save some CPU time. The following functions/structures are affected: - struct extent_page_data Replace its bio pointer with structure btrfs_bio_ctrl (embedded structure, not pointer) - end_write_bio() - flush_write_bio() Just change how bio is fetched - btrfs_bio_add_page() Use pre-calculated boundaries instead of re-calculating them. And use @bio_ctrl to replace @bio and @prev_bio_flags. - calc_bio_boundaries() New function - submit_extent_page() callers - btrfs_do_readpage() callers - contiguous_readpages() callers To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab bio. - btrfs_bio_fits_in_ordered_extent() Removed, as now the ordered extent size limit is done at bio allocation time, no need to check for each page range. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: introduce try-lock semantics for exclusive op startDavid Sterba1-0/+3
Add try-lock for exclusive operation start to allow callers to do more checks. The same operation must already be running. The try-lock and unlock must pair and are a substitute for btrfs_exclop_start, thus it must also pair with btrfs_exclop_finish to release the exclop context. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: add cancellable chunk relocation supportDavid Sterba1-0/+9
Add support code that will allow canceling relocation on the chunk granularity. This is different and independent of balance, that also uses relocation but is a higher level operation and manages it's own state and pause/cancellation requests. Relocation is used for resize (shrink) and device deletion so this will be a common point to implement cancellation for both. The context is entirely in btrfs_relocate_block_group and btrfs_recover_relocation, enclosing one chunk relocation. The status bit is set and unset between the chunks. As relocation can take long, the effects may not be immediate and the request and actual action can slightly race. The fs_info::reloc_cancel_req is only supposed to be increased and does not pair with decrement like fs_info::balance_cancel_req. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: protect exclusive_operation by super_lockDavid Sterba1-2/+2
The exclusive operation is now atomically checked and set using bit operations. Switch it to protection by spinlock. The super block lock is not frequently used and adding a new lock seems like an overkill so it should be safe to reuse it. The reason to use spinlock is to enhance the locking context so more checks can be done, eg. allowing the same exclusive operation enter the exclop section and cancel the running one. This will be used for resize and device delete. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: document byte swap optimization of root_item::flags accessorsDavid Sterba1-0/+2
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: don't set the full sync flag when truncation does not touch extentsFilipe Manana1-1/+1
At btrfs_truncate() where we truncate the inode either to the same size or to a smaller size, we always set the full sync flag on the inode. This is needed in case the truncation drops or trims any file extent items that start beyond or cross the new inode size, so that the next fsync drops all inode items from the log and scans again the fs/subvolume tree to find all items that must be logged. However if the truncation does not drop or trims any file extent items, we do not need to set the full sync flag and force the next fsync to use the slow code path. So do not set the full sync flag in such cases. One use case where it is frequent to do truncations that do not change the inode size and do not drop any extents (no prealloc extents beyond i_size) is when running Microsoft's SQL Server inside a Docker container. One example workload is the one Philipp Fent reported recently, in the thread with a link below. In this workload a large number of fsyncs are preceded by such truncate operations. After this change I constantly get the runtime for that workload from Philipp to be reduced by about -12%, for example from 184 seconds down to 162 seconds. Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/ Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: make btrfs_verify_data_csum() to return a bitmapQu Wenruo1-2/+2
This will provide the basis for later per-sector repair for subpage, while still keeping the existing code happy. As if all csums match, the return value will be 0, same as now. Only when csum mismatches, the return value is different. The new return value will be a bitmap, for 4K sectorsize and 4K page size, it will be either 1, instead of the -EIO (which is not used directly by the callers, no effective change). But for 4K sectorsize and 64K page size, aka subpage case, since the bvec can contain multiple sectors, knowing which sectors are corrupted will allow us to submit repair only for corrupted sectors. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-11Merge tag 'for-5.13-rc1-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "First batch of various fixes, here's a list of notable ones: - fix unmountable seed device after fstrim - fix silent data loss in zoned mode due to ordered extent splitting - fix race leading to unpersisted data and metadata on fsync - fix deadlock when cloning inline extents and using qgroups" * tag 'for-5.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize return variable in cleanup_free_space_cache_v1 btrfs: zoned: sanity check zone type btrfs: fix unmountable seed device after fstrim btrfs: fix deadlock when cloning inline extents and using qgroups btrfs: fix race leading to unpersisted data and metadata on fsync btrfs: do not consider send context as valid when trying to flush qgroups btrfs: zoned: fix silent data loss after failure splitting ordered extent
2021-04-28btrfs: fix deadlock when cloning inline extents and using qgroupsFilipe Manana1-1/+1
There are a few exceptional cases where cloning an inline extent needs to copy the inline extent data into a page of the destination inode. When this happens, we end up starting a transaction while having a dirty page for the destination inode and while having the range locked in the destination's inode iotree too. Because when reserving metadata space for a transaction we may need to flush existing delalloc in case there is not enough free space, we have a mechanism in place to prevent a deadlock, which was introduced in commit 3d45f221ce627d ("btrfs: fix deadlock when cloning inline extent and low on free metadata space"). However when using qgroups, a transaction also reserves metadata qgroup space, which can also result in flushing delalloc in case there is not enough available space at the moment. When this happens we deadlock, since flushing delalloc requires locking the file range in the inode's iotree and the range was already locked at the very beginning of the clone operation, before attempting to start the transaction. When this issue happens, stack traces like the following are reported: [72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000 [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142) [72747.556271] Call Trace: [72747.556273] __schedule+0x296/0x760 [72747.556277] schedule+0x3c/0xa0 [72747.556279] io_schedule+0x12/0x40 [72747.556284] __lock_page+0x13c/0x280 [72747.556287] ? generic_file_readonly_mmap+0x70/0x70 [72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs] [72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160 [72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs] [72747.556362] ? update_group_capacity+0x25/0x210 [72747.556366] ? cpumask_next_and+0x1a/0x20 [72747.556391] extent_writepages+0x44/0xa0 [btrfs] [72747.556394] do_writepages+0x41/0xd0 [72747.556398] __writeback_single_inode+0x39/0x2a0 [72747.556403] writeback_sb_inodes+0x1ea/0x440 [72747.556407] __writeback_inodes_wb+0x5f/0xc0 [72747.556410] wb_writeback+0x235/0x2b0 [72747.556414] ? get_nr_inodes+0x35/0x50 [72747.556417] wb_workfn+0x354/0x490 [72747.556420] ? newidle_balance+0x2c5/0x3e0 [72747.556424] process_one_work+0x1aa/0x340 [72747.556426] worker_thread+0x30/0x390 [72747.556429] ? create_worker+0x1a0/0x1a0 [72747.556432] kthread+0x116/0x130 [72747.556435] ? kthread_park+0x80/0x80 [72747.556438] ret_from_fork+0x1f/0x30 [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [72747.566961] Call Trace: [72747.566964] __schedule+0x296/0x760 [72747.566968] ? finish_wait+0x80/0x80 [72747.566970] schedule+0x3c/0xa0 [72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs] [72747.566999] ? finish_wait+0x80/0x80 [72747.567024] lock_extent_bits+0x37/0x90 [btrfs] [72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs] [72747.567051] ? find_get_pages_range_tag+0x2cd/0x380 [72747.567076] __extent_writepage+0x203/0x320 [btrfs] [72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs] [72747.567106] ? update_load_avg+0x7e/0x5f0 [72747.567109] ? enqueue_entity+0xf4/0x6f0 [72747.567134] extent_writepages+0x44/0xa0 [btrfs] [72747.567137] ? enqueue_task_fair+0x93/0x6f0 [72747.567140] do_writepages+0x41/0xd0 [72747.567144] __filemap_fdatawrite_range+0xc7/0x100 [72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs] [72747.567200] process_one_work+0x1aa/0x340 [72747.567202] worker_thread+0x30/0x390 [72747.567205] ? create_worker+0x1a0/0x1a0 [72747.567208] kthread+0x116/0x130 [72747.567211] ? kthread_park+0x80/0x80 [72747.567214] ret_from_fork+0x1f/0x30 [72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000 [72747.569689] Call Trace: [72747.569691] __schedule+0x296/0x760 [72747.569694] schedule+0x3c/0xa0 [72747.569721] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569725] ? finish_wait+0x80/0x80 [72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs] [72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs] [72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs] [72747.569810] ? path_lookupat.isra.48+0x97/0x140 [72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs] [72747.569836] ? __kmalloc+0x16a/0x2c0 [72747.569839] do_iter_readv_writev+0x160/0x1c0 [72747.569843] do_iter_write+0x80/0x1b0 [72747.569847] vfs_writev+0x84/0x140 [72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs] [72747.569873] do_writev+0x65/0x100 [72747.569876] do_syscall_64+0x33/0x40 [72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000 [72747.569903] Call Trace: [72747.569906] __schedule+0x296/0x760 [72747.569909] schedule+0x3c/0xa0 [72747.569936] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569940] ? finish_wait+0x80/0x80 [72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs] [72747.569989] start_transaction+0x279/0x580 [btrfs] [72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs] [72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs] [72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs] [72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs] [72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs] [72747.570126] do_clone_file_range+0xed/0x200 [72747.570131] vfs_clone_file_range+0x37/0x110 [72747.570134] ioctl_file_clone+0x7d/0xb0 [72747.570137] do_vfs_ioctl+0x138/0x630 [72747.570140] __x64_sys_ioctl+0x62/0xc0 [72747.570143] do_syscall_64+0x33/0x40 [72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 So fix this by skipping the flush of delalloc for an inode that is flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under such a special case of cloning an inline extent, when flushing delalloc during qgroup metadata reservation. The special cases for cloning inline extents were added in kernel 5.7 by by commit 05a5a7621ce66c ("Btrfs: implement full reflink support for inline extents"), while having qgroup metadata space reservation flushing delalloc when low on space was added in kernel 5.9 by commit c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable kernel backports. Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/ Fixes: c53e9653605dbf ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-27Merge branch 'miklos.fileattr' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fileattr conversion updates from Miklos Szeredi via Al Viro: "This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a separate method. The interface is reasonably uniform across the filesystems that support it and gives nice boilerplate removal" * 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) ovl: remove unneeded ioctls fuse: convert to fileattr fuse: add internal open/release helpers fuse: unsigned open flags fuse: move ioctl to separate source file vfs: remove unused ioctl helpers ubifs: convert to fileattr reiserfs: convert to fileattr ocfs2: convert to fileattr nilfs2: convert to fileattr jfs: convert to fileattr hfsplus: convert to fileattr efivars: convert to fileattr xfs: convert to fileattr orangefs: convert to fileattr gfs2: convert to fileattr f2fs: convert to fileattr ext4: convert to fileattr ext2: convert to fileattr btrfs: convert to fileattr ...
2021-04-20btrfs: zoned: automatically reclaim zonesJohannes Thumshirn1-0/+5
When a file gets deleted on a zoned file system, the space freed is not returned back into the block group's free space, but is migrated to zone_unusable. As this zone_unusable space is behind the current write pointer it is not possible to use it for new allocations. In the current implementation a zone is reset once all of the block group's space is accounted as zone unusable. This behaviour can lead to premature ENOSPC errors on a busy file system. Instead of only reclaiming the zone once it is completely unusable, kick off a reclaim job once the amount of unusable bytes exceeds a user configurable threshold between 51% and 100%. It can be set per mounted filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75% by default. Similar to reclaiming unused block groups, these dirty block groups are added to a to_reclaim list and then on a transaction commit, the reclaim process is triggered but after we deleted unused block groups, which will free space for the relocation process. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-20btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lockJohannes Thumshirn1-1/+2
As a preparation for extending the block group deletion use case, rename the unused_bgs_mutex to reclaim_bgs_lock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-20btrfs: more graceful errors/warnings on 32bit systems when reaching limitsQu Wenruo1-0/+19
Btrfs uses internally mapped u64 address space for all its metadata. Due to the page cache limit on 32bit systems, btrfs can't access metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K page size while 256T for 64K page size. Users can have a filesystem which doesn't have metadata beyond the boundary at mount time, but later balance can cause it to create metadata beyond the boundary. And modification to MM layer is unrealistic just for such minor use case. We can't do more than to prevent mounting such filesystem or warn early when the numbers are still within the limits. To address such problem, this patch will introduce the following checks: - Mount time rejection This will reject any fs which has metadata chunk at or beyond the boundary. - Mount time early warning If there is any metadata chunk beyond 5/8th of the boundary, we do an early warning and hope the end user will see it. - Runtime extent buffer rejection If we're going to allocate an extent buffer at or beyond the boundary, reject such request with EOVERFLOW. This is definitely going to cause problems like transaction abort, but we have no better ways. - Runtime extent buffer early warning If an extent buffer beyond 5/8th of the max file size is allocated, do an early warning. Above error/warning message will only be printed once for each fs to reduce dmesg flood. If the mount is rejected, the filesystem will be mountable only on a 64bit host. Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/ Reported-by: Erik Jensen <erikjensen@rkjnsn.net> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: improve btree readahead for full send operationsFilipe Manana1-1/+21
Currently a full send operation uses the standard btree readahead when iterating over the subvolume/snapshot btree, which despite bringing good performance benefits, it could be improved in a few aspects for use cases such as full send operations, which are guaranteed to visit every node and leaf of a btree, in ascending and sequential order. The limitations of that standard btree readahead implementation are the following: 1) It only triggers readahead for leaves that are physically close to the leaf being read, within a 64K range; 2) It only triggers readahead for the next or previous leaves if the leaf being read is not currently in memory; 3) It never triggers readahead for nodes. So add a new readahead mode that addresses all these points and use it for full send operations. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of RAM: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The durations of the full send operation in seconds were the following: Before this change: 217 seconds After this change: 205 seconds (-5.7%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: use a bit to track the existence of tree mod log usersFilipe Manana1-0/+3
The tree modification log functions are called very frequently, basically they are called every time a btree is modified (a pointer added or removed to a node, a new root for a btree is set, etc). Because of that, to avoid heavy lock contention on the lock that protects the list of tree mod log users, we have checks that test the emptiness of the list with a full memory barrier before the checks, so that when there are no tree mod log users we avoid taking the lock. Replace the memory barrier and list emptiness check with a test for a new bit set at fs_info->flags. This bit is used to indicate when there are tree mod log users, set whenever a user is added to the list and cleared when the last user is removed from the list. This makes the intention a bit more obvious and possibly more efficient (assuming test_bit() may be cheaper than a full memory barrier on some architectures). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>