summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-10-23bcachefs: Improved copygc wait debuggingKent Overstreet2-1/+10
This just adds a line for how long copygc has been waiting to sysfs copygc_wait, helpful for debugging why copygc isn't running. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Call bch2_path_put_nokeep() before bch2_path_put()Kent Overstreet1-3/+3
bch2_path_put_nokeep() is sketchy, and we should consider removing it: it unconditionally frees btree_paths once their ref hits 0. The assumption is that we only use it for paths that have never been visible outside the btree core btree code; i.e. higher level code will never be making assumptions about locking based on these paths. However, there's subtle brokenness with this approach: - If we call bch2_path_put(), then bch2_path_put_nokeep(), bch2_path_put() may free the first path on the assumption that we we have another path keeping a node locked - but then bch2_path_put_nokeep() just unconditionally frees it. The same bug may arise if we're calling bch2_path_put() and bch2_path_put_nokeep() on the same (refcounted) path, or two adjacent paths that point to the same btree node. This patch hacks around one of these bugs by calling bch2_path_put_nokeep() first in bch2_trans_iter_exit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: drop unnecessary journal stuck check from space calculationBrian Foster1-18/+1
The journal stucking check in bch2_journal_space_available() is particularly aggressive and can lead to premature shutdown in some rare cases. This is difficult to reproduce, but also comes along with a fatal error and so is worthwhile to be cautious. For example, we've seen instances where the journal is under heavy reservation pressure, the journal allocation path transitions into the final available journal bucket, the journal write path immediately consumes that bucket and calls into bch2_journal_space_available(), which then in turn flags the journal as stuck because there is no available space and shuts down the filesystem instead of submitting the journal write (that would have otherwise succeeded). To avoid this problem, simplify the journal stuck checking by just relying on the higher level logic in the journal reservation path. This produces more useful debug output and is a more reliable indicator that things have bogged down. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: refactor journal stuck checking into standalone helperBrian Foster1-22/+63
bcachefs checks for journal stuck conditions both in the journal space calculation code and the journal reservation slow path. The logic in both places is rather tricky and can result in non-deterministic failure characteristics and debug output. In preparation to condense journal stuck handling to a single place, refactor the __journal_res_get() logic into a standalone helper. Since multiple callers into the reservation code can result in duplicate reports, use the ->err_seq field as a serialization mechanism for the debug dump. Finally, add some comments to help explain the logic and hopefully facilitate further improvements in the future. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: gracefully unwind journal res slowpath on shutdownBrian Foster1-0/+7
bcachefs detects journal stuck conditions in a couple different places. If the logic in the journal reservation slow path happens to detect the problem, I've seen instances where the filesystem remains deadlocked even though it has been shut down. This is occasionally reproduced by generic/333, and usually manifests as one or more tasks stuck in the journal reservation slow path. To help avoid this problem, repeat the journal error check in __journal_res_get() once under spinlock to cover the case where the previous lock holder might have triggered shutdown. This also helps avoid spurious/duplicate stuck reports. Also, wake the journal from the halt code to make sure blocked callers of the journal res slowpath have a chance to wake up and observe the pending error. This survives an overnight looping run of generic/333 without the aforementioned lockups. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: more aggressive fast path write buffer key flushingBrian Foster1-21/+22
The btree write buffer flush code is prone to causing journal deadlock due to inefficient use and release of reservation space. Reservation is not pre-reserved for write buffered keys (as is done for key cache keys, for example), because the write buffer flush side uses a fast path that attempts insertion without need for any reservation at all. The write buffer flush attempts to deal with this by inserting keys using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on journal reservations that require blocking. Upon first error, it falls back to a slow path that inserts in journal order and supports moving the associated journal pin forward. The problem is that under pathological conditions (i.e. smaller log, larger write buffer and journal reservation pressure), we've seen instances where the fast path fails fairly quickly without having completed many insertions, and then the slow path is unable to push the journal pin forward enough to free up the space it needs to completely flush the buffer. This problem is occasionally reproduced by fstest generic/333. To avoid this problem, update the fast path algorithm to skip key inserts that fail due to inability to acquire needed journal reservation without immediately breaking out of the loop. Instead, insert as many keys as possible, zap the sequence numbers to mark them as processed, and then fall back to the slow path to process the remaining set in journal order. This reduces the amount of journal reservation that might be required to flush the entire buffer and increases the odds that the slow path is able to move the journal pin forward and free up space as keys are processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: use dedicated workqueue for tasks holding write refsBrian Foster5-5/+15
A workqueue resource deadlock has been observed when running fsck on a filesystem with a full/stuck journal. fsck is not currently able to repair the fs due to fairly rapid emergency shutdown, but rather than exit gracefully the fsck process hangs during the shutdown sequence. Fortunately this is easily recoverable from userspace, but the root cause involves code shared between the kernel and userspace and so should be addressed. The deadlock scenario involves the main task in the bch2_fs_stop() -> bch2_fs_read_only() path waiting on write references to drain with the fs state lock held. A bch2_read_only_work() workqueue task is scheduled on the system_long_wq, blocked on the state lock. Finally, various other write ref holding workqueue tasks are scheduled to run on the same workqueue and must complete in order to release references that the initial task is waiting on. To avoid this problem, we can split the dependent workqueue tasks across different workqueues. It's a bit of a waste to create a dedicated wq for the read-only worker, but there are several tasks throughout the fs that follow the pattern of acquiring a write reference and then scheduling to the system wq. Use a local wq for such tasks to break the subtle dependency between these and the read-only worker. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: remove unused bch2_trans_log_msg()Brian Foster2-13/+0
Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix bch2_verify_bucket_evacuated()Kent Overstreet1-0/+5
We were going into an infinite loop when printing out backpointers, due to never incrementing bp_offset - whoops. Also limit the number of backpointers we print to 10; this is debug code and we only need to print a sample, not all of them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: verify_bucket_evacuated() -> set_btree_iter_dontneed()Kent Overstreet1-0/+3
This should help with excessive 'would deadlock' transaction restarts. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Make reconstruct_alloc quieterKent Overstreet2-36/+39
We shouldn't be printing out fsck errors for expected errors - this helps make test logs more readable, and makes it easier to see what the actual failure was. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix an unhandled transaction restart errorKent Overstreet1-0/+5
This is a bit awkward: we're passing around a btree_trans, but we're not in a context where transaction restarts are handled - we should try to come up with a better way to denote situations like this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix nocow write path closure bugKent Overstreet1-4/+4
With regular waitlists, we need to ensure we always call finish_wait(). With closures, the equivalent is that we need to call closure_sync() before returning with a stack-allocated closure. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Nocow write error path fixKent Overstreet1-11/+7
The nocow write error path was iterating over pointers in an extent, aftre we'd dropped btree locks - oops. Fortunately we'd already stashed what we need in nocow_lock_bucket, so use that instead. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix bch2_extent_fallocate() in nocow modeKent Overstreet2-0/+9
When we allocate disk space, we need to be incrementing the WRITE io clock, which perhaps should be renamed to sectors allocated - copygc uses this io clock to know when to run. Also, we should be incrementing the same clock when allocating btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Add an assert in inode_write for -ENOENTKent Overstreet1-0/+5
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix bch2_evict_subvolume_inodes()Kent Overstreet6-38/+81
This fixes a bug in bch2_evict_subvolume_inodes(): d_mark_dontcache() doesn't handle the case where i_count is already 0, we need to grab and put the inode in order for it to be dropped. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Improve error handling in bch2_ioctl_subvolume_destroy()Kent Overstreet1-7/+8
Pure style fixes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix for 'missing subvolume' errorKent Overstreet1-11/+19
Subvolumes, including their root inodes, get deleted asynchronously after an unlink. But we still need to ensure that we tell the VFS the inode has been deleted, otherwise VFS writeback could fire after asynchronous deletion has finished, and try to write to an inode/subvolume that no longer exists. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Don't run transaction hooks multiple timesKent Overstreet1-8/+8
transaction hooks aren't supposed to run unless we know the transaction is going to commit succesfully: this fixes a bug with attempting to delete a subvolume multiple times. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Add a fallback when journal_keys doesn't fit in ramKent Overstreet1-18/+49
We may end up in a situation where allocating the buffer for the sorted journal_keys fails - but it would likely succeed, post compaction where we drop duplicates. We've had reports of this allocation failing, so this adds a slowpath to do the compaction incrementally. This is only a band-aid fix; we need to look at limiting the number of keys in the journal based on the amount of system RAM. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Improve the backpointer to missing extent messageKent Overstreet2-16/+25
We now print the pos where the backpointer was found in the btree, as well as the exact bucket:bucket_offset of the data, to aid in grepping through logs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Add error message for failing to allocate sorted journal keysKent Overstreet1-1/+4
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: New erasure coding shutdown pathKent Overstreet10-94/+141
This implements a new shutdown path for erasure coding, which is needed for the upcoming BCH_WRITE_WAIT_FOR_EC write path. The process is: - Cancel new stripes being built up - Close out/cancel open buckets on write points or the partial list that are for stripes - Shutdown rebalance/copygc - Then wait for in flight new stripes to finish With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill up before they complete; the new ec shutdown path is needed for shutting down copygc/rebalance without deadlocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_fs_moving_ctxts_to_text()Kent Overstreet7-67/+179
This also adds bch2_write_op_to_text(): now we can see outstand moves, useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC and likely for other things in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Private error codes: ENOMEMKent Overstreet29-157/+259
This adds private error codes for most (but not all) of our ENOMEM uses, which makes it easier to track down assorted allocation failures. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix bch2_check_extents_to_backpointers()Kent Overstreet1-11/+19
In rare cases, bch2_check_extents_to_backpointers() would incorrectly flag an extent has having a missing backpointer when we just needed to flush the btree write buffer - we weren't tracking the last flushed position correctly. This adds a level field to the last_flushed pos, fixing a bug where we'd sometimes fail on a new root node. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix an assert in copygc thread shutdown pathKent Overstreet1-1/+1
We're not supposed to have nested (locked) btree_trans on the stack: this means copygc shutdown needs to exit our btree_trans before exiting the move_ctxt, which calls bch2_write(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_bucket_is_movable() -> BTREE_ITER_CACHEDKent Overstreet1-1/+1
BTREE_ITER_CACHED should really be the default for cached btrees - this is an easy mistake to make. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Don't use BTREE_ITER_INTENT in make_extent_indirect()Kent Overstreet1-1/+1
This is a workaround for a btree path overflow - searching with BTREE_ITER_INTENT periodically saves the iterator position for updates, which eventually overflows. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix stripe create error pathKent Overstreet1-6/+8
If we errored out on a new stripe before fully allocating it, we shouldn't be zeroing out unwritten data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Mark new snapshots earlier in create pathKent Overstreet1-2/+4
This fixes a null ptr deref when creating new snapshots: bch2_create_trans() will lookup the subvolume and find the _new_ snapshot in the BCH_CREATE_SUBVOL path that's being created in that transaction. We have to call bch2_mark_snapshot() earlier so that it's properly initialized, instead of leaving it for transaction commit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Improve bch2_new_stripes_to_text()Kent Overstreet1-5/+9
Print out the alloc reserve, and format it a bit more nicely. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill bch_write_op->btree_update_readyKent Overstreet2-25/+14
This changes the write path to not add write ops to to the write_point's list of pending work items until it's ready; this means we have to change the lock protecting it to an irq-safe lock, but means bch2_write_point_do_index_updates() no longer has to iterate over the list, which is beneficial with the way the new BCH_WRITE_WAIT_FOR_EC code works. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Simplify stripe_idx_to_deleteKent Overstreet1-5/+4
This is not technically correct - it's subject to a race if we ever end up with a stripe with all empty blocks (that needs to be deleted) being held open. But the "correct" version was much too inefficient, and soon we'll be adding a stripes LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix next_bucket()Kent Overstreet1-1/+1
This fixes an infinite loop in bch2_get_key_or_real_bucket_hole(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Second layer of refcounting for new stripesKent Overstreet3-23/+49
This will be used for move writes, which will be waiting until the stripe is created to do the index update. They need to prevent the stripe from being reclaimed until their index update is done, so we need another refcount that just keeps the stripe open. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> # Conflicts: # fs/bcachefs/ec.c # fs/bcachefs/io.c
2023-10-23bcachefs: ec: fall back to creating new stripes for copygcKent Overstreet1-0/+8
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Rework __bch2_data_update_index_update()Kent Overstreet1-52/+52
This makes some improvements to the logic for adding/removing replicas, as part of the larger erasure coding improvements. We now directly consider number of replicas desired for the given inode, and extent/pointer durability: this ensures that the extent ends up with the desired number of replicas when we're replacing multiple pointers with one that has higher durability (e.g. erasure coded). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Extent helper improvementsKent Overstreet7-30/+52
- __bch2_bkey_drop_ptr() -> bch2_bkey_drop_ptr_noerror(), now available outside extents. - Split bch2_bkey_has_device() and bch2_bkey_has_device_c(), const and non const versions - bch2_extent_has_ptr() now returns the pointer it found Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: evacuate_bucket() no longer moves cached ptrsKent Overstreet1-1/+6
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: evacuate_bucket() no longer calls verify_bucket_evacuated()Kent Overstreet1-8/+0
The copygc code itself now calls this when all moves from a given bucket are complete. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Suppress transaction restart err messageKent Overstreet1-2/+2
This isn't a real error, and doesn't need to be printed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Rework open bucket partial list allocationKent Overstreet6-167/+228
Now, any open_bucket can go on the partial list: allocating from the partial list has been moved to its own dedicated function, open_bucket_add_bucets() -> bucket_alloc_set_partial(). In particular, this means that erasure coded buckets can safely go on the partial list; the new location works with the "allocate an ec bucket first, then the rest" logic. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: don't bump key cache journal seq on nojournal commitsBrian Foster3-6/+21
fstest generic/388 occasionally reproduces corruptions where an inode has extents beyond i_size. This is a deliberate crash and recovery test, and the post crash+recovery characteristics are usually the same: the inode exists on disk in an early (i.e. just allocated) state based on the journal sequence number associated with the inode. Subsequent inode updates exist in the journal at higher sequence numbers, but the inode hadn't been written back before the associated crash and the post-crash recovery processes a set of journal sequence numbers that doesn't include updates to the inode. In fact, the sequence with the most recent inode key update always happens to be the sequence just before the front of the journal processed by recovery. This last bit is a significant hint that the problem relates to an on-disk journal update of the front of the journal. The root cause of this problem is basically that the inode is updated (multiple times) in-core and in the key cache, each time bumping the key cache sequence number used to control the cache flush. The cache flush skips one or more times, bumping the associated key cache journal pin to the key cache seq value. This has a side effect of holding the inode in memory a bit longer than normal, which helps exacerbate this problem, but is also unsafe in certain cases where the key cache seq may have been updated by a transaction commit that didn't journal the associated key. For example, consider an inode that has been allocated, updated several times in the key cache, journaled, but not yet written back. At this stage, everything should be consistent if the fs happens to crash because the latest update has been journal. Now consider a key update via bch2_extent_update_i_size_sectors() that uses the BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode state, it can have the side effect of bumping ck->seq in bch2_btree_insert_key_cached(). In turn, if a subsequent key cache flush skips due to seq not matching the former, the ck->journal pin is updated to ck->seq even though the most recent key update was not journaled. If this pin happens to reside at the front (tail) of the journal, this means a subsequent journal write can update last_seq to a value beyond that which includes the most recent update to the inode. If this occurs and the fs happens to crash before the inode happens to flush, recovery will see the latest last_seq, fail to recover the inode and leave the inode in the inconsistent state described above. To avoid this problem, skip the key cache seq update on NOJOURNAL commits, except on initial pin add. Pass the insert entry directly to bch2_btree_insert_key_cached() to make the associated flag available and be consistent with btree_insert_key_leaf(). Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: When shutting down, flush btree node writes lastKent Overstreet6-49/+81
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Verbose on by default when CONFIG_BCACHEFS_DEBUG=yKent Overstreet1-1/+7
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23fixup bcachefs: Use for_each_btree_key_upto() more consistentlyKent Overstreet1-1/+2
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23six locks: be more careful about lost wakeupsKent Overstreet1-3/+11
This is a workaround for a lost wakeup bug we've been seeing - we still need to discover the actual bug. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Journal resize fixesKent Overstreet4-96/+89
- Fix a sleeping-in-atomic bug due to calling bch2_journal_buckets_to_sb() under the journal lock. - Additionally, now we mark buckets as journal buckets before adding them to the journal in memory and the superblock. This ensures that if we crash part way through we'll never be writing to journal buckets that aren't marked correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>