summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2023-12-15btrfs: set default compress type at btrfs_init_fs_info timeJosef Bacik1-7/+3
With the new mount API we'll be setting our compression well before we call open_ctree. We don't want to overwrite our settings, so set the default in btrfs_init_fs_info instead of open_ctree. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: split out the mount option validation code into its own helperJosef Bacik1-29/+37
We're going to need to validate mount options after they're all parsed with the new mount API, split this code out into its own helper so we can use it when we swap over to the new mount API. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor adjustments in the messages ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove no longer used EXTENT_MAP_DELALLOC block start valueFilipe Manana3-7/+1
After commit ac3c0d36a2a2 ("btrfs: make fiemap more efficient and accurate reporting extent sharedness") we no longer need to create special extent maps during fiemap that have a block start with the EXTENT_MAP_DELALLOC value. So this block start value for extent maps is no longer used since then, therefore remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: allow extent buffer helpers to skip cross-page handlingQu Wenruo3-3/+75
Currently btrfs extent buffer helpers are doing all the cross-page handling, as there is no guarantee that all those eb pages are contiguous. However on systems with enough memory, there is a very high chance the page cache for btree_inode are allocated with physically contiguous pages. In that case, we can skip all the complex cross-page handling, thus speeding up the code. This patch adds a new member, extent_buffer::addr, which is only set to non-NULL if all the extent buffer pages are physically contiguous. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: reflow btrfs_free_tree_blockJohannes Thumshirn1-49/+50
Reflow btrfs_free_tree_block() so that there is one level of indentation needed. This patch has no functional changes. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use memset_page instead of opencoding itJohannes Thumshirn1-1/+1
Use memset_page() in memset_extent_buffer() instead of opencoding it. This does not not change any functionality. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove now unneeded btrfs_redirty_list_addJohannes Thumshirn4-27/+1
Now that we're not clearing the dirty flag off of extent_buffers in zoned mode, all that is left of btrfs_redirty_list_add() is a memzero() and some ASSERT()ions. As we're also memzero()ing the buffer on write-out btrfs_redirty_list_add() has become obsolete and can be removed. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: zoned: don't clear dirty flag of extent bufferJohannes Thumshirn3-4/+22
One a zoned filesystem, never clear the dirty flag of an extent buffer, but instead mark it as zeroout. On writeout, when encountering a marked extent_buffer, zero it out. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: rename EXTENT_BUFFER_NO_CHECK to EXTENT_BUFFER_ZONED_ZEROOUTJohannes Thumshirn5-5/+6
EXTENT_BUFFER_ZONED_ZEROOUT better describes the state of the extent buffer, namely it is written as all zeros. This is needed in zoned mode, to preserve I/O ordering. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: always set extent_io_tree::inode and drop fs_infoDavid Sterba4-25/+78
The extent_io_tree is embedded in several structures, notably in struct btrfs_inode. The fs_info is only used for reporting errors and for reference in trace points. We can get to the pointer through the inode, but not all io trees set it. However, we always know the owner and can recognize if inode is valid. For access helpers are provided, const variant for the trace points. This reduces size of extent_io_tree by 8 bytes and following structures in turn: - btrfs_inode 1104 -> 1088 - btrfs_device 520 -> 512 - btrfs_root 1360 -> 1344 - btrfs_transaction 456 -> 440 - btrfs_fs_info 3600 -> 3592 - reloc_control 1520 -> 1512 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: enhance extent_io_tree error reportsDavid Sterba1-10/+14
Pass the type of the extent io tree operation which failed in the report helper. The message wording and contents is updated, though locking might be the cause of the error it's probably not the only one and we're interested in the state. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: constify fs_info parameter in __btrfs_panic()David Sterba2-2/+2
The printk helpers take const fs_info if it's used just for the identifier in the messages, __btrfs_panic() lacks that. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: drop error message in extent_io_tree insert_state()David Sterba1-3/+0
The helper insert_state errors are handled in all callers and reported by extent_io_tree_panic so we don't need to do it twice. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: move lockdep class setting out of extent_io_tree_initDavid Sterba2-10/+11
The per-inode file extent tree was added in 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree"), it's the only tree type that requires the lockdep class. Move it to the file where it is actually used. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove stripe size local variable from insert_dev_extents()Filipe Manana1-4/+1
It's not needed to have a local variable to store the stripe size at insert_dev_extents(), we can just take from the chunk map as it's only used once and typing 'map->stripe_size' is not much more verbose than simply typing 'stripe_size'. So remove the local variable. This was added before the recent addition of a dedicated structure for chunk mappings because the stripe size was encoded in the 'orig_block_len' field of an extent_map structure, so the use of the local variable made things more readable. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use a dedicated data structure for chunk mapsFilipe Manana16-488/+501
Currently we abuse the extent_map structure for two purposes: 1) To actually represent extents for inodes; 2) To represent chunk mappings. This is odd and has several disadvantages: 1) To create a chunk map, we need to do two memory allocations: one for an extent_map structure and another one for a map_lookup structure, so more potential for an allocation failure and more complicated code to manage and link two structures; 2) For a chunk map we actually only use 3 fields (24 bytes) of the respective extent map structure: the 'start' field to have the logical start address of the chunk, the 'len' field to have the chunk's size, and the 'orig_block_len' field to contain the chunk's stripe size. Besides wasting a memory, it's also odd and not intuitive at all to have the stripe size in a field named 'orig_block_len'. We are also using 'block_len' of the extent_map structure to contain the chunk size, so we have 2 fields for the same value, 'len' and 'block_len', which is pointless; 3) When an extent map is associated to a chunk mapping, we set the bit EXTENT_FLAG_FS_MAPPING on its flags and then make its member named 'map_lookup' point to the associated map_lookup structure. This means that for an extent map associated to an inode extent, we are not using this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform); 4) Extent maps associated to a chunk mapping are never merged or split so it's pointless to use the existing extent map infrastructure. So add a dedicated data structure named 'btrfs_chunk_map' to represent chunk mappings, this is basically the existing map_lookup structure with some extra fields: 1) 'start' to contain the chunk logical address; 2) 'chunk_len' to contain the chunk's length; 3) 'stripe_size' for the stripe size; 4) 'rb_node' for insertion into a rb tree; 5) 'refs' for reference counting. This way we do a single memory allocation for chunk mappings and we don't waste memory for them with unused/unnecessary fields from an extent_map. We also save 8 bytes from the extent_map structure by removing the 'map_lookup' pointer, so the size of struct extent_map is reduced from 144 bytes down to 136 bytes, and we can now have 30 extents map per 4K page instead of 28. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use btrfs_next_item() at scrub.c:find_first_extent_item()Filipe Manana1-8/+5
There's no reason to open code what btrfs_next_item() does when searching for extent items at scrub.c:scrub.c:find_first_extent_item(), so remove the logic to find the next item and use btrfs_next_item() instead, making the code shorter and less nested code blocks. While at it also fix the comment to the plural "items" instead of "item" and end it with proper punctuation. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: unexport extent_map_block_end()Filipe Manana2-7/+7
The helper extent_map_block_end() is currently not used anywhere outside extent_map.c, so move into from extent_map.h into extent_map.c. While at it, also make the extent map pointer argument as const. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: split assert into two different asserts when removing block groupFilipe Manana1-1/+2
When starting a transaction to remove a block group we have one ASSERT that checks we found an extent map and that the extent map's start offset matches the desired chunk offset. In case one of the conditions fails, we get a stack trace that point to the respective line of code, however we can't tell which condition failed: either there's no extent map or we got one with an unexpected start offset. To make such an issue easier to debug and analyse, split the assertion into two, one for each condition. This was actually triggered during development of another upcoming change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: mark sanity checks when getting chunk map as unlikelyFilipe Manana1-2/+2
When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity checks to verify that we found an extent map and that it includes the requested logical address. These are never expected to fail, so mark them as unlikely to make it more clear as well as to allow a compiler to generate more efficient code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove unused btrfs_root::typeDavid Sterba1-2/+0
Looks like the struct member was added in 2007 in 2.6.29 in commit 87ee04eb0f2f ("Btrfs: Add simple stripe size parameter") but hasn't been used at all since. So let's remove it. This was found by tool https://github.com/jirislaby/clang-struct, then build tested after removing the struct member. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove unused definition of tree_entry in extent-io-tree.cDavid Sterba1-6/+0
The declaration was temporarily moved in a4055213bf69 ("btrfs: unexport all the temporary exports for extent-io-tree.c") and then should have been removed in 6.0 in 071d19f5130f ("btrfs: remove struct tree_entry in extent-io-tree.c") but was not. This was found by tool https://github.com/jirislaby/clang-struct . Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: raid56: remove unused btrfs_plug_cb::workDavid Sterba1-1/+0
The raid56 changes in 6.2 reworked the IO path to RMW, commit 93723095b5d5 ("btrfs: raid56: switch write path to rmw_rbio()") in particular removed the last use of the work member so it can be removed as well. This was found by tool https://github.com/jirislaby/clang-struct . Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove unused btrfs_ordered_extent::outstanding_isizeDavid Sterba1-7/+0
The whole isize code was deleted in 5.6 3f1c64ce0438 ("btrfs: delete the ordered isize update code"), except the struct member. This was found by tool https://github.com/jirislaby/clang-struct . Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: scrub: remove unused scrub_ctx::sectors_per_bioDavid Sterba1-1/+0
The recent scrub rewrite forgot to remove the sectors_per_bio in 6.3 in 13a62fd997f0 ("btrfs: scrub: remove scrub_bio structure"). This was found by tool https://github.com/jirislaby/clang-struct . Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: migrate to use folio private instead of page privateQu Wenruo5-89/+124
As a cleanup and preparation for future folio migration, this patch would replace all page->private to folio version. This includes: - PagePrivate() -> folio_test_private() - page->private -> folio_get_private() - attach_page_private() -> folio_attach_private() - detach_page_private() -> folio_detach_private() Since we're here, also remove the forced cast on page->private, since it's (void *) already, we don't really need to do the cast. For now even if we missed some call sites, it won't cause any problem yet, as we're only using order 0 folio (single page), thus all those folio/page flags should be synced. But for the future conversion to utilize higher order folio, the page <-> folio flag sync is no longer guaranteed, thus we have to migrate to utilize folio flags. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use shrinker for compression page poolDavid Sterba1-0/+102
The pages are now allocated and freed centrally, so we can extend the logic to manage the lifetime. The main idea is to keep a few recently used pages and hand them to all writers. Ideally we won't have to go to allocator at all (a slight performance gain) and also raise chance that we'll have the pages available (slightly increased reliability). In order to avoid gathering too many pages, the shrinker is attached to the cache so we can free them on when MM demands that. The first implementation will drain the whole cache. Further this can be refined to keep some minimal number of pages for emergency purposes. The ultimate goal to avoid memory allocation failures on the write out path from the compression. The pool threshold is set to cover full BTRFS_MAX_COMPRESSED / PAGE_SIZE for minimal thread pool, which is 8 (btrfs_init_fs_info()). This is 128K / 4K * 8 = 256 pages at maximum, which is 1MiB. This is for all filesystems currently mounted, with heavy use of compression IO the allocator is still needed. The cache helps for short burst IO. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use page alloc/free wrappers for compression pagesDavid Sterba6-12/+30
This is a preparation for managing compression pages in a cache-like manner, instead of asking the allocator each time. The common allocation and free wrappers are introduced and are functionally equivalent to the current code. The freeing helpers need to be carefully placed where the last reference is dropped. This is either after directly allocating (error handling) or when there are no other users of the pages (after copying the contents). It's safe to not use the helper and use put_page() that will handle the reference count. Not using the helper means there's lower number of pages that could be reused without passing them back to allocator. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: do not utilize goto to implement delayed inode ref deletionQu Wenruo1-21/+24
[PROBLEM] The function __btrfs_update_delayed_inode() is doing something not meeting the code standard of today: path->slots[0]++ if (path->slots[0] >= btrfs_header_nritems(leaf)) goto search; again: if (!is_the_target_inode_ref()) goto out; ret = btrfs_delete_item(); /* Some cleanup. */ return ret; search: ret = search_for_the_last_inode_ref(); goto again; With the tag named "again", it's pretty common to think it's a loop, but the truth is, we only need to do the search once, to locate the last (also the first, since there should only be one INODE_REF or INODE_EXTREF now) ref of the inode. [FIX] Instead of the weird jumps, just do them in a stream-lined fashion. This removes those weird labels, and add extra comments on why we can do the different searches. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: make the logic from btrfs_block_can_be_shared() easier to readFilipe Manana1-16/+24
The logic in btrfs_block_can_be_shared() is hard to follow as we have a lot of conditions in a single if statement including a subexpression with a logical or and two nested if statements inside the main if statement. Make this easier to read by using separate if statements that return immediately when we find a condition that determines if a block can be or can not be shared. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use bool for return type of btrfs_block_can_be_shared()Filipe Manana2-9/+9
Currently btrfs_block_can_be_shared() returns an int that is used as a boolean. Since it all it needs is to return true or false, and it can't return errors for example, change the return type from int to bool to make it a bit more readable and obvious. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove log_extents_lock and logged_list from struct btrfs_rootFilipe Manana2-7/+0
The logged_list[2] and log_extents_lock[2] members of struct btrfs_root are no longer used, their last use was removed in commit 5636cf7d6dc8 ("btrfs: remove the logged extents infrastructure"). So remove these fields. This reduces the size of struct btrfs_root, on a release kernel, from 1392 bytes down to 1352 bytes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove duplicate btrfs_clear_buffer_dirty() prototype from disk-io.hFilipe Manana1-2/+0
The prototype for btrfs_clear_buffer_dirty() is declared in both disk-io.h and extent_io.h, but the function is defined at extent_io.c. So remove the prototype declaration from disk-io.h. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-14Merge tag 'for-6.7-rc5-tag' of ↵Linus Torvalds11-50/+116
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Some fixes to quota accounting code, mostly around error handling and correctness: - free reserves on various error paths, after IO errors or transaction abort - don't clear reserved range at the folio release time, it'll be properly cleared after final write - fix integer overflow due to int used when passing around size of freed reservations - fix a regression in squota accounting that missed some cases with delayed refs" * tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ensure releasing squota reserve on head refs btrfs: don't clear qgroup reserved bit in release_folio btrfs: free qgroup pertrans reserve on transaction abort btrfs: fix qgroup_free_reserved_data int overflow btrfs: free qgroup reserve when ORDERED_IOERR is set
2023-12-11fs: convert error_remove_page to error_remove_folioMatthew Wilcox (Oracle)1-1/+1
There were already assertions that we were not passing a tail page to error_remove_page(), so make the compiler enforce that by converting everything to pass and use a folio. Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07btrfs: ensure releasing squota reserve on head refsBoris Burkov3-19/+46
A reservation goes through a 3 step lifetime: - generated during delalloc - released/counted by ordered_extent allocation - freed by running delayed ref That third step depends on must_insert_reserved on the head ref, so the head ref with that field set owns the reservation. Once you prepare to run the head ref, must_insert_reserved is unset, which means that running the ref must free the reservation, whether or not it succeeds, or else the reservation is leaked. That results in either a risk of spurious ENOSPC if the fs stays writeable or a warning on unmount if it is readonly. The existing squota code was aware of these invariants, but missed a few cases. Improve it by adding a helper function to use in the cleanup paths and call it from the existing early returns in running delayed refs. This also simplifies btrfs_record_squota_delta and struct btrfs_quota_delta. This fixes (or at least improves the reliability of) generic/475 with "mkfs -O squota". On my machine, that test failed ~4/10 times without this patch and passed 100/100 times with it. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-07btrfs: don't clear qgroup reserved bit in release_folioBoris Burkov1-1/+2
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for duplicate reservations. That is two writes to that range in one transaction shouldn't create two reservations, as the reservation will only be freed once when the write finally goes down. Therefore, it is never OK to clear that bit without freeing the associated qgroup reserve. At this point, we don't want to be freeing the reserve, so mask off the bit. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-07btrfs: free qgroup pertrans reserve on transaction abortBoris Burkov4-4/+34
If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that reservation has been leaked. The leak isn't a huge issue since the fs is read-only, but it's better to clean it up when we know we can/should. Do it during the cleanup_transaction step of aborting. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-07btrfs: fix qgroup_free_reserved_data int overflowBoris Burkov6-25/+31
The reserved data counter and input parameter is a u64, but we inadvertently accumulate it in an int. Overflowing that int results in freeing the wrong amount of data and breaking reserve accounting. Unfortunately, this overflow rot spreads from there, as the qgroup release/free functions rely on returning an int to take advantage of negative values for error codes. Therefore, the full fix is to return the "released" or "freed" amount by a u64 argument and to return 0 or negative error code via the return value. Most of the call sites simply ignore the return value, though some of them handle the error and count the returned bytes. Change all of them accordingly. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-07btrfs: free qgroup reserve when ORDERED_IOERR is setBoris Burkov1-1/+3
An ordered extent completing is a critical moment in qgroup reserve handling, as the ownership of the reservation is handed off from the ordered extent to the delayed ref. In the happy path we release (unlock) but do not free (decrement counter) the reservation, and the delayed ref drives the free. However, on an error, we don't create a delayed ref, since there is no ref to add. Therefore, free on the error path. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-28Merge tag 'for-6.7-rc3-tag' of ↵Linus Torvalds9-10/+62
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few fixes and message updates: - for simple quotas, handle the case when a snapshot is created and the target qgroup already exists - fix a warning when file descriptor given to send ioctl is not writable - fix off-by-one condition when checking chunk maps - free pages when page array allocation fails during compression read, other cases were handled - fix memory leak on error handling path in ref-verify debugging feature - copy missing struct member 'version' in 64/32bit compat send ioctl - tree-checker verifies inline backref ordering - print messages to syslog on first mount and last unmount - update error messages when reading chunk maps" * tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: ensure send_fd is writable btrfs: free the allocated memory if btrfs_alloc_page_array() fails btrfs: fix 64bit compat send ioctl arguments not initializing version member btrfs: make error messages more clear when getting a chunk map btrfs: fix off-by-one when checking chunk map includes logical address btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod() btrfs: add dmesg output for first mount and last unmount of a filesystem btrfs: do not abort transaction if there is already an existing qgroup btrfs: tree-checker: add type and sequence check for inline backrefs
2023-11-24btrfs: send: ensure send_fd is writableJann Horn1-1/+1
kernel_write() requires the caller to ensure that the file is writable. Let's do that directly after looking up the ->send_fd. We don't need a separate bailout path because the "out" path already does fput() if ->send_filp is non-NULL. This has no security impact for two reasons: - the ioctl requires CAP_SYS_ADMIN - __kernel_write() bails out on read-only files - but only since 5.8, see commit a01ac27be472 ("fs: check FMODE_WRITE in __kernel_write") Reported-and-tested-by: syzbot+12e098239d20385264d3@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=12e098239d20385264d3 Fixes: 31db9f7c23fb ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24btrfs: free the allocated memory if btrfs_alloc_page_array() failsQu Wenruo1-3/+8
[BUG] If btrfs_alloc_page_array() fail to allocate all pages but part of the slots, then the partially allocated pages would be leaked in function btrfs_submit_compressed_read(). [CAUSE] As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM, caller is responsible to free the partially allocated pages. For the existing call sites, most of them are fine: - btrfs_raid_bio::stripe_pages Handled by free_raid_bio(). - extent_buffer::pages[] Handled btrfs_release_extent_buffer_pages(). - scrub_stripe::pages[] Handled by release_scrub_stripe(). But there is one exception in btrfs_submit_compressed_read(), if btrfs_alloc_page_array() failed, we didn't cleanup the array and freed the array pointer directly. Initially there is still the error handling in commit dd137dd1f2d7 ("btrfs: factor out allocating an array of pages"), but later in commit 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio"), the error handling is removed, leading to the possible memory leak. [FIX] This patch would add back the error handling first, then to prevent such situation from happening again, also Make btrfs_alloc_page_array() to free the allocated pages as a extra safety net, then we don't need to add the error handling to btrfs_submit_compressed_read(). Fixes: 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio") CC: stable@vger.kernel.org # 6.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24btrfs: fix 64bit compat send ioctl arguments not initializing version memberDavid Sterba1-0/+1
When the send protocol versioning was added in 5.16 e77fbf990316 ("btrfs: send: prepare for v2 protocol"), the 32/64bit compat code was not updated (added by 2351f431f727 ("btrfs: fix send ioctl on 32bit with 64bit kernel")), missing the version struct member. The compat code is probably rarely used, nobody reported any bugs. Found by tool https://github.com/jirislaby/clang-struct . Fixes: e77fbf990316 ("btrfs: send: prepare for v2 protocol") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24btrfs: move file_start_write() to after permission hookAmir Goldstein1-6/+6
In vfs code, file_start_write() is usually called after the permission hook in rw_verify_area(). btrfs_ioctl_encoded_write() in an exception to this rule. Move file_start_write() to after the rw_verify_area() check in encoded write to make the permission hook "start-write-safe". This is needed for fanotify "pre content" events. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231122122715.2561213-9-amir73il@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-24btrfs: make error messages more clear when getting a chunk mapFilipe Manana1-3/+4
When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity checks to verify we found a chunk map and that map found covers the logical address the caller passed in. However the messages aren't very clear in the sense that don't mention the issue is with a chunk map and one of them prints the 'length' argument as if it were the end offset of the requested range (while the in the string format we use %llu-%llu which suggests a range, and the second %llu-%llu is actually a range for the chunk map). So improve these two details in the error messages. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24btrfs: fix off-by-one when checking chunk map includes logical addressFilipe Manana1-1/+1
At btrfs_get_chunk_map() we get the extent map for the chunk that contains the given logical address stored in the 'logical' argument. Then we do sanity checks to verify the extent map contains the logical address. One of these checks verifies if the extent map covers a range with an end offset behind the target logical address - however this check has an off-by-one error since it will consider an extent map whose start offset plus its length matches the target logical address as inclusive, while the fact is that the last byte it covers is behind the target logical address (by 1). So fix this condition by using '<=' rather than '<' when comparing the extent map's "start + length" against the target logical address. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()Bragatheswaran Manickavel1-0/+2
In btrfs_ref_tree_mod(), when !parent 're' was allocated through kmalloc(). In the following code, if an error occurs, the execution will be redirected to 'out' or 'out_unlock' and the function will be exited. However, on some of the paths, 're' are not deallocated and may lead to memory leaks. For example: lookup_block_entry() for 'be' returns NULL, the out label will be invoked. During that flow ref and 'ra' are freed but not 're', which can potentially lead to a memory leak. CC: stable@vger.kernel.org # 5.10+ Reported-and-tested-by: syzbot+d66de4cbf532749df35f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d66de4cbf532749df35f Signed-off-by: Bragatheswaran Manickavel <bragathemanick0908@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24btrfs: add dmesg output for first mount and last unmount of a filesystemQu Wenruo2-1/+5
There is a feature request to add dmesg output when unmounting a btrfs. There are several alternative methods to do the same thing, but with their own problems: - Use eBPF to watch btrfs_put_super()/open_ctree() Not end user friendly, they have to dip their head into the source code. - Watch for directory /sys/fs/<uuid>/ This is way more simple, but still requires some simple device -> uuid lookups. And a script needs to use inotify to watch /sys/fs/. Compared to all these, directly outputting the information into dmesg would be the most simple one, with both device and UUID included. And since we're here, also add the output when mounting a filesystem for the first time for parity. A more fine grained monitoring of subvolume mounts should be done by another layer, like audit. Now mounting a btrfs with all default mkfs options would look like this: [81.906566] BTRFS info (device dm-8): first mount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2 [81.907494] BTRFS info (device dm-8): using crc32c (crc32c-intel) checksum algorithm [81.908258] BTRFS info (device dm-8): using free space tree [81.912644] BTRFS info (device dm-8): auto enabling async discard [81.913277] BTRFS info (device dm-8): checking UUID tree [91.668256] BTRFS info (device dm-8): last unmount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2 CC: stable@vger.kernel.org # 5.4+ Link: https://github.com/kdave/btrfs-progs/issues/689 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-21fs: Rename mapping private membersMatthew Wilcox (Oracle)2-28/+28
It is hard to find where mapping->private_lock, mapping->private_list and mapping->private_data are used, due to private_XXX being a relatively common name for variables and structure members in the kernel. To fit with other members of struct address_space, rename them all to have an i_ prefix. Tested with an allmodconfig build. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>