summaryrefslogtreecommitdiff
path: root/fs/btrfs/compression.c
AgeCommit message (Collapse)AuthorFilesLines
2025-05-15btrfs: rename ret2 to ret in btrfs_submit_compressed_read()David Sterba1-3/+3
We can now rename 'ret2' to 'ret' and use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename ret to status in btrfs_submit_compressed_read()David Sterba1-5/+5
We're using 'status' for the blk_status_t variables, rename 'ret' so we can use it for generic errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify reading bio status in end_compressed_writeback()David Sterba1-3/+3
We don't need to have a separate variable to read the bio status, 'ret' works for that just fine so remove 'error'. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use bvec_kmap_local() in btrfs_decompress_buf2page()Christoph Hellwig1-3/+6
This removes the last direct poke into bvec internals in btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename remaining exported extent map functionsFilipe Manana1-2/+2
Rename all the exported functions from extent_map.h that don't have a 'btrfs_' prefix in their names, so that they are consistent with all the other functions, to make it clear they are btrfs specific functions and to avoid potential name collisions in the future with functions defined elsewhere in the kernel. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename functions to allocate and free extent mapsFilipe Manana1-3/+3
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename extent map functions to get block start, end and check if in treeFilipe Manana1-2/+2
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename exported extent map compression functionsFilipe Manana1-2/+2
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add btrfs prefix to main lock, try lock and unlock extent functionsFilipe Manana1-3/+3
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use folio_contains() for EOF detectionQu Wenruo1-1/+1
Currently we use the following pattern to detect if the folio contains the end of a file: if (folio->index == end_index) folio_zero_range(); But that only works if the folio is page sized. For the following case, it will not work and leave the range beyond EOF uninitialized: The page size is 4K, and the fs block size is also 4K. 16K 20K 24K | | | | | EOF at 22K And we have a large folio sized 8K at file offset 16K. In that case, the old "folio->index == end_index" will not work, thus the range [22K, 24K) will not be zeroed out. Fix the following call sites which use the above pattern: - add_ra_bio_pages() - extent_writepage() Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()Qu Wenruo1-1/+17
[BUG WITH EXPERIMENTAL LARGE FOLIOS] When testing the experimental large data folio support with compression, there are several ASSERT()s triggered from btrfs_decompress_buf2page() when running fsstress with compress=zstd mount option: - ASSERT(copy_len) from btrfs_decompress_buf2page() - VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page() [CAUSE] Inside btrfs_decompress_buf2page(), we need to grab the file offset from the current bvec.bv_page, to check if we even need to copy data into the bio. And since we're using single page bvec, and no large folio, every page inside the folio should have its index properly setup. But when large folios are involved, only the first page (aka, the head page) of a large folio has its index properly initialized. The other pages inside the large folio will not have their indexes properly initialized. Thus the page_offset() call inside btrfs_decompress_buf2page() will result garbage, and completely screw up the @copy_len calculation. [FIX] Instead of using page->index directly, go with page_pgoff(), which can handle non-head pages correctly. So introduce a helper, file_offset_from_bvec(), to get the file offset from a single page bio_vec, so the copy_len calculation can be done correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02btrfs: compression: adjust cb->compressed_folios allocation typeKees Cook1-1/+1
In preparation for making the kmalloc() family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type is "struct folio **" but the returned type will be "struct page **". These are the same allocation size (pointer size), but the types don't match. Adjust the allocation type to match the assignment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: defrag: extend ioctl to accept compression levelsDaniel Vacek1-0/+10
The zstd and zlib compression types support setting compression levels. Enhance the defrag interface to specify the levels as well. For zstd the negative (realtime) levels are also accepted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: zstd: enable negative compression levels mount optionDaniel Vacek1-11/+10
Allow using the fast modes (negative compression levels) of zstd as a mount option. As per the results, the compression ratio is (expectedly) lower: for level in {-15..-1} 1 2 3; \ do printf "level %3d\n" $level; \ mount -o compress=zstd:$level /dev/sdb /mnt/test/; \ grep sdb /proc/mounts; \ cp -r /usr/bin /mnt/test/; sync; compsize /mnt/test/bin; \ cp -r /usr/share/doc /mnt/test/; sync; compsize /mnt/test/doc; \ cp enwik9 /mnt/test/; sync; compsize /mnt/test/enwik9; \ cp linux-6.13.tar /mnt/test/; sync; compsize /mnt/test/linux-6.13.tar; \ rm -r /mnt/test/{bin,doc,enwik9,linux-6.13.tar}; \ umount /mnt/test/; \ done |& tee results | \ awk '/^level/{print}/^TOTAL/{print$3"\t"$2" |"}' | paste - - - - - 266M bin | 45M doc | 953M wiki | 1.4G source =============================+===============+===============+===============+ level -15 180M 67% | 30M 68% | 694M 72% | 598M 40% | level -14 180M 67% | 30M 67% | 683M 71% | 581M 39% | level -13 177M 66% | 29M 66% | 671M 70% | 566M 38% | level -12 174M 65% | 29M 65% | 658M 69% | 548M 37% | level -11 174M 65% | 28M 64% | 645M 67% | 530M 35% | level -10 171M 64% | 28M 62% | 631M 66% | 512M 34% | level -9 165M 62% | 27M 61% | 615M 64% | 493M 33% | level -8 161M 60% | 27M 59% | 598M 62% | 475M 32% | level -7 155M 58% | 26M 58% | 582M 61% | 457M 30% | level -6 151M 56% | 25M 56% | 565M 59% | 437M 29% | level -5 145M 54% | 24M 55% | 545M 57% | 417M 28% | level -4 139M 52% | 23M 52% | 520M 54% | 391M 26% | level -3 135M 50% | 22M 50% | 495M 51% | 369M 24% | level -2 127M 47% | 22M 48% | 470M 49% | 349M 23% | level -1 120M 45% | 21M 47% | 452M 47% | 332M 22% | level 1 110M 41% | 17M 39% | 362M 38% | 290M 19% | level 2 106M 40% | 17M 38% | 349M 36% | 288M 19% | level 3 104M 39% | 16M 37% | 340M 35% | 276M 18% | The samples represent some data sets that can be commonly found and show approximate compressibility. The fast levels trade off speed for ratio and are best suitable for highly compressible data. As can be seen above, comparing the results to the current default zstd level 3, the negative levels are roughly 2x worse at -15 and the ratio increases almost linearly with each level. Signed-off-by: Daniel Vacek <neelx@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: use filemap_get_folio() helperAnand Jain1-1/+1
When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B) instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra arguments 0, 0. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: rename btrfs_folio_(set|start|end)_writer_lock()Qu Wenruo1-1/+1
Since there is no user of reader locks, rename the writer locks into a more generic name, by removing the "_writer" part from the name. And also rename btrfs_subpage::writer into btrfs_subpage::locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: unify to use writer locks for subpage lockingQu Wenruo1-2/+1
Since commit d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading"), metadata read no longer relies on the subpage reader locking. This means we do not need to maintain a different metadata/data split for locking, so we can convert the existing reader lock users by: - add_ra_bio_pages() Convert to btrfs_folio_set_writer_lock() - end_folio_read() Convert to btrfs_folio_end_writer_lock() - begin_folio_read() Convert to btrfs_folio_set_writer_lock() - folio_range_has_eb() Remove the subpage->readers checks, since it is always 0. - Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: drop unused parameter level from alloc_heuristic_ws()David Sterba1-2/+2
The compression heuristic pass does not need a level, so we can drop the parameter. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: lzo: drop unused paramter level from lzo_alloc_workspace()David Sterba1-1/+1
The LZO compression has only one level, we don't need to pass the parameter. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: compression: add an ASSERT() to ensure the read-in length is saneQu Wenruo1-0/+3
There are already two bugs (one in zlib, one in zstd) that involved compression path is not handling sector size < page size cases well. So it makes more sense to make sure that btrfs_compress_folios() returns Since we already have two bugs (one in zlib, one in zstd) in the compression path resulting the @total_in be to larger than the to-be-compressed range length, there is enough reason to add an ASSERT() to make sure the total read-in length doesn't exceed the input length. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert btrfs_decompress() to take a folioLi Zetao1-7/+7
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. Based on the previous patch, the compression path can be directly used in folio without converting to page. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert zstd_decompress() to take a folioLi Zetao1-1/+1
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert lzo_decompress() to take a folioLi Zetao1-1/+1
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert zlib_decompress() to take a folioLi Zetao1-1/+1
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: do not hold the extent lock for entire readJosef Bacik1-1/+1
Historically we've held the extent lock throughout the entire read. There's been a few reasons for this, but it's mostly just caused us problems. For example, this prevents us from allowing page faults during direct io reads, because we could deadlock. This has forced us to only allow 4k reads at a time for io_uring NOWAIT requests because we have no idea if we'll be forced to page fault and thus have to do a whole lot of work. On the buffered side we are protected by the page lock, as long as we're reading things like buffered writes, punch hole, and even direct IO to a certain degree will get hung up on the page lock while the page is in flight. On the direct side we have the dio extent lock, which acts much like the way the extent lock worked previously to this patch, however just for direct reads. This protects direct reads from concurrent direct writes, while we're protected from buffered writes via the inode lock. Now that we're protected in all cases, narrow the extent lock to the part where we're getting the extent map to submit the reads, no longer holding the extent lock for the entire read operation. Push the extent lock down into do_readpage() so that we're only grabbing it when looking up the extent map. This portion was contributed by Goldwyn. Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename btrfs_submit_bio() to btrfs_submit_bbio()David Sterba1-2/+2
The function name is a bit misleading as it submits the btrfs_bio (bbio), rename it so we can use btrfs_submit_bio() when an actual bio is submitted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert add_ra_bio_pages() to use only foliosJosef Bacik1-29/+33
Willy is going to get rid of page->index, and add_ra_bio_pages uses page->index. Make his life easier by converting add_ra_bio_pages to use folios so that we are no longer using page->index. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: fix extent map use-after-free when adding pages to compressed bioFilipe Manana1-1/+1
At add_ra_bio_pages() we are accessing the extent map to calculate 'add_size' after we dropped our reference on the extent map, resulting in a use-after-free. Fix this by computing 'add_size' before dropping our extent map reference. Reported-by: syzbot+853d80cba98ce1157ae6@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000038144061c6d18f2@google.com/ Fixes: 6a4049102055 ("btrfs: subpage: make add_ra_bio_pages() compatible") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()Qu Wenruo1-1/+1
The function btrfs_alloc_folio_array() is only utilized in btrfs_submit_compressed_read() and no other location, and the only caller is not utilizing the @extra_gfp parameter. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass a btrfs_inode to btrfs_compress_heuristic()David Sterba1-2/+2
Pass a struct btrfs_inode to btrfs_compress_heuristic() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inodeDavid Sterba1-1/+1
The structure is internal so we should use struct btrfs_inode for that, allowing to remove some use of BTRFS_I. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::block_start memberQu Wenruo1-1/+2
The member extent_map::block_start can be calculated from extent_map::disk_bytenr + extent_map::offset for regular extents. And otherwise just extent_map::disk_bytenr. And this is already validated by the validate_extent_map(). Now we can remove the member. However there is a special case in btrfs_create_dio_extent() where we for NOCOW/PREALLOC ordered extents cannot directly use the resulting btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them yet. So for that call site, we pass file_extent->disk_bytenr + file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for offset. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::block_len memberQu Wenruo1-1/+1
The extent_map::block_len is either extent_map::len (non-compressed extent) or extent_map::disk_num_bytes (compressed extent). Since we already have sanity checks to do the cross-checks between the new and old members, we can drop the old extent_map::block_len now. For most call sites, they can manually select extent_map::len or extent_map::disk_num_bytes, since most if not all of them have checked if the extent is compressed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::orig_start memberQu Wenruo1-1/+1
Since we have extent_map::offset, the old extent_map::orig_start is just extent_map::start - extent_map::offset for non-hole/inline extents. And since the new extent_map::offset is already verified by validate_extent_map() while the old orig_start is not, let's just remove the old member from all call sites. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: fix misspelled end IO compression callbacksFilipe Manana1-4/+4
Fix typo in the end IO compression callbacks, from "comprssed" to "compressed". Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik1-1/+1
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: compression: migrate compression/decompression paths to foliosQu Wenruo1-45/+45
For both compression and decompression paths, we always require a "struct page **pages" and "unsigned long nr_pages", this involves quite some part of the btrfs compression paths: - All the compression entry points - compressed_bio structure This affects both compression and decompression. - async_extent structure Unfortunately with all those involved parts, there is no good way to split the conversion into smaller patches while still passing compiling. So do this in one big conversion in one go. Please note this is direct page->folio conversion, no change on the page sized folio requirement yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: compression: convert page allocation to folio interfacesQu Wenruo1-12/+12
Currently we have two wrappers to allocate and free a page for compression usage: - btrfs_alloc_compr_page() - btrfs_free_compr_page() The allocator would try to grab a page from the pool, and only allocate a new page if the pool is empty. The reclaimer would check if the pool is full, and if not full it would put the page into the pool. This patch converts both helpers to use folio interfaces, and allowing further conversion of compression path to folios. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: compression: add error handling for missed page cacheQu Wenruo1-0/+23
For all the supported compression algorithms, the compression path would always need to grab the page cache, then do the compression. Normally we would get a page reference without any problem, since the write path should have already locked the pages in the write range. For the sake of error handling, we should handle the page cache miss case. Adds a common wrapper, btrfs_compress_find_get_page(), which calls find_get_page(), and do the error handling along with an error message. Callers inside compression path would only need to call btrfs_compress_find_get_page(), and error out if it returned any error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: compression: remove dead comments in btrfs_compress_heuristic()Qu Wenruo1-5/+0
Since commit a440d48c7f93 ("Btrfs: heuristic: implement sampling logic"), btrfs_compress_heuristic() is no longer a simple "return true", but more complex to determine if we should compress. Thus the comment is dead and can be confusing, just remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: add helper to get fs_info from struct inode pointerDavid Sterba1-3/+3
Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: add helpers to get fs_info from page/folio pointersDavid Sterba1-1/+1
Add convenience helpers to get a fs_info from a page or folio pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct page, folio, btrfs_root and btrfs_fs_info. The latter can't be static inlines as this would create loop between ctree.h <-> fs.h, or the headers would have to be restructured. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersDavid Sterba1-4/+1
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-19btrfs: zlib: fix and simplify the inline extent decompressionQu Wenruo1-7/+16
[BUG] If we have a filesystem with 4k sectorsize, and an inlined compressed extent created like this: item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160 generation 8 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24 index 2 namelen 14 name: source_inlined item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69 generation 8 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 1 (zlib) Which has an inline compressed extent at file offset 0, and its decompressed size is 4K, allowing us to reflink that 4K range to another location (which will not be compressed). If we do such reflink on a subpage system, it would fail like this: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest XFS_IOC_CLONE_RANGE: Input/output error [CAUSE] In zlib_decompress(), we didn't treat @start_byte as just a page offset, but also use it as an indicator on whether we should switch our output buffer. In reality, for subpage cases, although @start_byte can be non-zero, we should never switch input/output buffer, since the whole input/output buffer should never exceed one sector. Note: The above assumption is only not true if we're going to support multi-page sectorsize. Thus the current code using @start_byte as a condition to switch input/output buffer or finish the decompression is completely incorrect. [FIX] The fix involves several modifications: - Rename @start_byte to @dest_pgoff to properly express its meaning - Add an extra ASSERT() inside btrfs_decompress() to make sure the input/output size never exceeds one sector. - Use Z_FINISH flag to make sure the decompression happens in one go - Remove the loop needed to switch input/output buffers - Use correct destination offset inside the destination page - Consider early end as an error After the fix, even on 64K page sized aarch64, above reflink now works as expected: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest linked 4096/4096 bytes at offset 61440 And resulted a correct file layout: item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160 generation 10 transid 10 size 65536 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14 index 3 namelen 4 name: dest item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 10 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53 generation 10 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate various end io functions to foliosQu Wenruo1-4/+4
If we still go the old page based iterator functions, like bio_for_each_segment_all(), we can hit middle pages of a folio (compound page). In that case if we set any page flag on those middle pages, we can easily trigger VM_BUG_ON(), as for compound page flags, they should follow their flag policies (normally only set on leading or tail pages). To avoid such problem in the future full folio migration, here we do: - Change from bio_for_each_segment_all() to bio_for_each_folio_all() This completely removes the ability to access the middle page. - Add extra ASSERT()s for data read/write paths To ensure we only get single paged folio for data now. - Rename those end io functions to follow a certain schema * end_bbio_compressed_read() * end_bbio_compressed_write() These two endio functions don't set any page flags, as they use pages not mapped to any address space. They can be very good candidates for higher order folio testing. And they are shared between compression and encoded IO. * end_bbio_data_read() * end_bbio_data_write() * end_bbio_meta_read() * end_bbio_meta_write() The old function names are not unified: - end_bio_extent_writepage() - end_bio_extent_readpage() - extent_buffer_write_end_io() - extent_buffer_read_end_io() They share no schema on where the "end_*io" string should be, nor can be confusing just using "extent_buffer" and "extent" to distinguish data and metadata paths. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: migrate subpage code to folio interfacesQu Wenruo1-3/+4
Although subpage itself is conflicting with higher folio, since subpage (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never need higher order folio, there is a hidden pitfall: - btrfs_page_*() helpers Those helpers are an abstraction to handle both subpage and non-subpage cases, which means we're going to pass pages pointers to those helpers. And since those helpers are shared between data and metadata paths, it's unavoidable to let them to handle folios, including higher order folios). Meanwhile for true subpage case, we should only have a single page backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert() to ensure that. Also since those helpers are shared between both data and metadata, add some extra ASSERT()s for data path to make sure we only get single page backed folio for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: refactor alloc_extent_buffer() to allocate-then-attach methodQu Wenruo1-1/+1
Currently alloc_extent_buffer() utilizes find_or_create_page() to allocate one page a time for an extent buffer. This method has the following disadvantages: - find_or_create_page() is the legacy way of allocating new pages With the new folio infrastructure, find_or_create_page() is just redirected to filemap_get_folio(). - Lacks the way to support higher order (order >= 1) folios As we can not yet let filemap give us a higher order folio. This patch would change the workflow by the following way: Old | new -----------------------------------+------------------------------------- | ret = btrfs_alloc_page_array(); for (i = 0; i < num_pages; i++) { | for (i = 0; i < num_pages; i++) { p = find_or_create_page(); | ret = filemap_add_folio(); /* Attach page private */ | /* Reuse page cache if needed */ /* Reused eb if needed */ | | /* Attach page private and | reuse eb if needed */ | } By this we split the page allocation and private attaching into two parts, allowing future updates to each part more easily, and migrate to folio interfaces (especially for possible higher order folios). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-16btrfs: use the flags of an extent map to identify the compression typeFilipe Manana1-2/+2
Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use shrinker for compression page poolDavid Sterba1-0/+102
The pages are now allocated and freed centrally, so we can extend the logic to manage the lifetime. The main idea is to keep a few recently used pages and hand them to all writers. Ideally we won't have to go to allocator at all (a slight performance gain) and also raise chance that we'll have the pages available (slightly increased reliability). In order to avoid gathering too many pages, the shrinker is attached to the cache so we can free them on when MM demands that. The first implementation will drain the whole cache. Further this can be refined to keep some minimal number of pages for emergency purposes. The ultimate goal to avoid memory allocation failures on the write out path from the compression. The pool threshold is set to cover full BTRFS_MAX_COMPRESSED / PAGE_SIZE for minimal thread pool, which is 8 (btrfs_init_fs_info()). This is 128K / 4K * 8 = 256 pages at maximum, which is 1MiB. This is for all filesystems currently mounted, with heavy use of compression IO the allocator is still needed. The cache helps for short burst IO. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use page alloc/free wrappers for compression pagesDavid Sterba1-1/+15
This is a preparation for managing compression pages in a cache-like manner, instead of asking the allocator each time. The common allocation and free wrappers are introduced and are functionally equivalent to the current code. The freeing helpers need to be carefully placed where the last reference is dropped. This is either after directly allocating (error handling) or when there are no other users of the pages (after copying the contents). It's safe to not use the helper and use put_page() that will handle the reference count. Not using the helper means there's lower number of pages that could be reused without passing them back to allocator. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>