summaryrefslogtreecommitdiff
path: root/fs/bcachefs
AgeCommit message (Collapse)AuthorFilesLines
2024-09-28bcachefs: fast exit when darray_make_room failedHongbo Li1-1/+3
In downgrade_table_extra, the return value is needed. When it return failed, we should exit immediately. Fixes: 7773df19c35f ("bcachefs: metadata version bucket_stripe_sectors") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: Fix iterator leak in check_subvol()Kent Overstreet1-28/+26
A couple small error handling fixes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: Add snapshot to bch_inode_unpackedKent Overstreet2-4/+7
this allows for various cleanups in fsck Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: assign return error when iterating through layoutDiogo Jahchan Koike1-1/+3
syzbot reported a null ptr deref in __copy_user [0] In __bch2_read_super, when a corrupt backup superblock matches the default opts offset, no error is assigned to ret and the freed superblock gets through, possibly being assigned as the best sb in bch2_fs_open and being later dereferenced, causing a fault. Assign EINVALID to ret when iterating through layout. [0]: https://syzkaller.appspot.com/bug?extid=18a5c5e8a9c856944876 Reported-by: syzbot+18a5c5e8a9c856944876@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18a5c5e8a9c856944876 Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: Fix srcu warning in check_topologyKent Overstreet1-0/+2
check_topology doesn't need the srcu lock and doesn't use normal btree transactions - we can just drop the srcu lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: Fix error path in check_dirent_inode_dirent()Kent Overstreet1-3/+2
fsck_err() jumps to the fsck_err label when bailing out; need to make sure bp_iter was initialized... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: memset bounce buffer portion to 0 after key_sort_fix_overlappingPiotr Zalewski1-0/+4
Zero-initialize part of allocated bounce buffer which wasn't touched by subsequent bch2_key_sort_fix_overlapping to mitigate later uinit-value use KMSAN bug[1]. After applying the patch reproducer still triggers stack overflow[2] but it seems unrelated to the uninit-value use warning. After further investigation it was found that stack overflow occurs because KMSAN adds too many function calls[3]. Backtrace of where the stack magic number gets smashed was added as a reply to syzkaller thread[3]. It was confirmed that task's stack magic number gets smashed after the code path where KSMAN detects uninit-value use is executed, so it can be assumed that it doesn't contribute in any way to uninit-value use detection. [1] https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718 [2] https://lore.kernel.org/lkml/66e57e46.050a0220.115905.0002.GAE@google.com [3] https://lore.kernel.org/all/rVaWgPULej8K7HqMPNIu8kVNyXNjjCiTB-QBtItLFBmk0alH6fV2tk4joVPk97Evnuv4ZRDd8HB5uDCkiFG6u81xKdzDj-KrtIMJSlF6Kt8=@proton.me Reported-by: syzbot+6f655a60d3244d0c6718@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718 Fixes: ec4edd7b9d20 ("bcachefs: Prep work for variable size btree node buffers") Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: Improve bch2_is_inode_open() warning messageKent Overstreet1-3/+3
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: Add extra padding in bkey_make_mut_noupdate()Kent Overstreet1-1/+2
This fixes a kasan splat in propagate_key_to_snapshot_leaves() - varint_decode_fast() does reads (that it never uses) up to 7 bytes past the end of the integer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-28bcachefs: Mark inode errors as autofixKent Overstreet1-16/+16
Most or all errors will be autofix in the future, we're currently just doing the ones that we know are well tested. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27[tree-wide] finally take no_llseek outAl Viro2-3/+0
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-24bcachefs: Fix infinite loop in propagate_key_to_snapshot_leaves()Kent Overstreet1-0/+1
As we iterate we need to mark that we no longer need iterators - otherwise we'll infinite loop via the "too many iters" check when there's many snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-24bcachefs: Ensure BCH_FS_accounting_replay_done is always setKent Overstreet1-0/+3
if it doesn't get set we'll never be able to flush the btree write buffer; this only happens in fake rw mode, but prevents us from shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-23Merge tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefsLinus Torvalds84-1604/+3037
Pull bcachefs updates from Kent Overstreet: - rcu_pending, btree key cache rework: this solves lock contenting in the key cache, eliminating the biggest source of the srcu lock hold time warnings, and drastically improving performance on some metadata heavy workloads - on multithreaded creates we're now 3-4x faster than xfs. - We're now using an rhashtable instead of the system inode hash table; this is another significant performance improvement on multithreaded metadata workloads, eliminating more lock contention. - for_each_btree_key_in_subvolume_upto(): new helper for iterating over keys within a specific subvolume, eliminating a lot of open coded "subvolume_get_snapshot()" and also fixing another source of srcu lock time warnings, by running each loop iteration in its own transaction (as the existing for_each_btree_key() does). - More work on btree_trans locking asserts; we now assert that we don't hold btree node locks when trans->locked is false, which is important because we don't use lockdep for tracking individual btree node locks. - Some cleanups and improvements in the bset.c btree node lookup code, from Alan. - Rework of btree node pinning, which we use in backpointers fsck. The old hacky implementation, where the shrinker just skipped over nodes in the pinned range, was causing OOMs; instead we now use another shrinker with a much higher seeks number for pinned nodes. - Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue where rebalance would sometimes fall back to allocating from the full filesystem, which is not what we want when it's trying to move data to a specific target. - Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache allocations. - Idmap mounts are now supported (Hongbo Li) - Rename whiteouts are now supported (Hongbo Li) - Erasure coding can now handle devices being marked as failed, or forcibly removed. We still need the evacuate path for erasure coding, but it's getting very close to ready for people to start using. * tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits) bcachefs: return err ptr instead of null in read sb clean bcachefs: Remove duplicated include in backpointers.c bcachefs: Don't drop devices with stripe pointers bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices bcachefs: bch_fs.rw_devs_change_count bcachefs: bch2_dev_remove_stripes() bcachefs: bch2_trigger_ptr() calculates sectors even when no device bcachefs: improve error messages in bch2_ec_read_extent() bcachefs: improve error message on too few devices for ec bcachefs: improve bch2_new_stripe_to_text() bcachefs: ec_stripe_head.nr_created bcachefs: bch_stripe.disk_label bcachefs: stripe_to_mem() bcachefs: EIO errcode cleanup bcachefs: Rework btree node pinning bcachefs: split up btree cache counters for live, freeable bcachefs: btree cache counters should be size_t bcachefs: Don't count "skipped access bit" as touched in btree cache scan bcachefs: Failed devices no longer require mounting in degraded mode bcachefs: bch2_dev_rcu_noerror() ...
2024-09-21bcachefs: Hold read lock in bch2_snapshot_tree_oldest_subvol()Ahmed Ehab1-0/+2
Syzbot reports a problem that a warning is triggered due to suspicious use of rcu_dereference_check(). That is triggered by a call of bch2_snapshot_tree_oldest_subvol(). The cause of the warning is that inside bch2_snapshot_tree_oldest_subvol(), snapshot_t() is called which calls rcu_dereference() that requires a read lock to be held. Also, the call of bch2_snapshot_tree_next() eventually calls snapshot_t(). To fix this, call rcu_read_lock() before calling snapshot_t(). Then, release the lock after the termination of the while loop. Reported-by: <syzbot+f7c41a878676b72c16a6@syzkaller.appspotmail.com> Signed-off-by: Ahmed Ehab <bottaawesome633@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: return err ptr instead of null in read sb cleanDiogo Jahchan Koike1-1/+1
syzbot reported a null-ptr-deref in bch2_fs_start. [0] When a sb is marked clear but doesn't have a clean section bch2_read_superblock_clean returns NULL which PTR_ERR_OR_ZERO lets through, eventually leading to a null ptr dereference down the line. Adjust read sb clean to return an ERR_PTR indicating the invalid clean section. [0] https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543 Reported-by: syzbot+1cecc37d87c4286e5543@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543 Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Remove duplicated include in backpointers.cYang Li1-1/+0
The header files bbpos.h is included twice in backpointers.c, so one inclusion of each can be removed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Don't drop devices with stripe pointersKent Overstreet4-9/+32
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devicesKent Overstreet2-27/+60
This factors out ec_strie_head_devs_update(), which initializes the bitmap of devices we're allocating from, and runs it every time c->rw_devs_change_count changes. We also cancel pending, not allocated stripes, since they may refer to devices that are no longer available. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch_fs.rw_devs_change_countKent Overstreet2-4/+9
Add a counter that's incremented whenever rw devices change; this will be used for erasure coding so that it can keep ec_stripe_head in sync and not deadlock on a new stripe when a device it wants goes away. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch2_dev_remove_stripes()Kent Overstreet4-3/+74
We can now correctly force-remove a device that has stripes on it; this uses the new BCH_SB_MEMBER_INVALID sentinal value. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch2_trigger_ptr() calculates sectors even when no deviceKent Overstreet2-10/+21
This is necessary for erasure coded pointers to devices that have been removed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: improve error messages in bch2_ec_read_extent()Kent Overstreet3-19/+23
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: improve error message on too few devices for ecKent Overstreet1-3/+16
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: improve bch2_new_stripe_to_text()Kent Overstreet1-0/+2
also print out the new stripe key Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: ec_stripe_head.nr_createdKent Overstreet2-2/+6
additional debug stat Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch_stripe.disk_labelKent Overstreet4-16/+43
When reshaping existing stripes, we should keep them on the same target that they were allocated on; to do this, we need to add a field to the btree stripe type. This is a tad awkward, because we only have 8 bits left, and targets are 16 bits - but we only need to store a label, not a full target. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: stripe_to_mem()Kent Overstreet1-18/+15
factor out a common helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: EIO errcode cleanupKent Overstreet5-27/+33
We want to be using private errcodes whenever possible, for better error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Rework btree node pinningKent Overstreet7-75/+150
In backpointers fsck, we do a seqential scan of one btree, and check references to another: extents <-> backpointers Checking references generates random lookups, so we want to pin that btree in memory (or only a range, if it doesn't fit in ram). Previously, this was done with a simple check in the shrinker - "if btree node is in range being pinned, don't free it" - but this generated OOMs, as our shrinker wasn't well behaved if there was less memory available than expected. Instead, we now have two different shrinkers and lru lists; the second shrinker being for pinned nodes, with seeks set much higher than normal - so they can still be freed if necessary, but we'll prefer not to. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: split up btree cache counters for live, freeableKent Overstreet6-32/+47
this is prep for introducing a second live list and shrinker for pinned nodes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: btree cache counters should be size_tKent Overstreet6-36/+37
32 bits won't overflow any time soon, but size_t is the correct type for counting objects in memory. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Don't count "skipped access bit" as touched in btree cache scanKent Overstreet1-0/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Failed devices no longer require mounting in degraded modeKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch2_dev_rcu_noerror()Kent Overstreet6-13/+22
bch2_dev_rcu() now properly errors if the device is invalid Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Progress indicator for extents_to_backpointersKent Overstreet1-6/+82
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch2_opts_to_text()Kent Overstreet3-21/+35
Factor out bch2_show_options() into a generic helper, for debugging option passing issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: improve "no device to read from" messageKent Overstreet1-1/+7
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Fix compilation error for bch2_sb_member_allocHongbo Li1-6/+10
Fix the following compilation error: ``` fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’: fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement 508 | unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices); ``` Fixes: a7d364a133c7 ("bcachefs: bch2_sb_member_alloc()") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch2_sb_member_alloc()Kent Overstreet3-46/+53
refactoring Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: bch2_dev_remove_alloc() -> alloc_background.cKent Overstreet3-27/+30
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Move tabstop setup to bch2_dev_usage_to_text()Kent Overstreet2-7/+9
No reason for it not to be where it's needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Options for recovery_passes, recovery_passes_excludeKent Overstreet8-20/+33
This adds mount options for specifying recovery passes to run, or exclude; the immediate need for this is that backpointers fsck is having trouble completing, so we need a way to skip it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Use mm_account_reclaimed_pages() when freeing btree nodesKent Overstreet1-0/+11
When freeing in a shrinker callback, we need to notify memory reclaim, so it knows forward progress has been made. Normally this is done in e.g. slab code, but we're not freeing through slab - or rather we are, but these allocations are big, and use the kmalloc_large() path. This is really a bug in the slub code, but we're working around it here for now. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Use __GFP_ACCOUNT for reclaimable memoryKent Overstreet2-0/+4
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: Hook up RENAME_WHITEOUT in rename.Sasha Finkelstein4-14/+52
This is needed for overlayfs, which is used by container managers. Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: rebalance writes use BCH_WRITE_ONLY_SPECIFIED_DEVSKent Overstreet2-2/+3
this was an oversight: rebalance is moving data to a specific device, so we don't want it falling back to the full filesystem Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocationKent Overstreet3-12/+16
rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't allocate from the full filesystem - but we don't want spurious allocation failures due to open buckets. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: fix prototype to bch2_alloc_sectors_start_trans()Kent Overstreet4-17/+18
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21bcachefs: kill redundant is_vmalloc_addr()Kent Overstreet1-8/+4
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>