summaryrefslogtreecommitdiff
path: root/fs/bcachefs/super.c
AgeCommit message (Collapse)AuthorFilesLines
2023-10-23bcachefs: Fix use-after-free in bch2_dev_add()Christophe JAILLET1-3/+1
If __bch2_dev_attach_bdev() fails, bch2_dev_free() is called twice. Once here and another time in the error handling path. This leads to several use-after-free. Remove the redundant call and only rely on the error handling path. Fixes: 6a44735653d4 ("bcachefs: Improved superblock-related error messages") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: add module description to fix modpost warningBrian Foster1-0/+1
modpost produces the following warning: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/bcachefs/bcachefs.o Add a module description for bcachefs. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Heap allocate btree_transKent Overstreet1-7/+0
We're using more stack than we'd like in a number of functions, and btree_trans is the biggest object that we stack allocate. But we have to do a heap allocatation to initialize it anyways, so there's no real downside to heap allocating the entire thing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix W=12 build errorsKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: BTREE_ID_logged_opsKent Overstreet1-0/+1
Add a new btree for long running logged operations - i.e. for logging operations that we can't do within a single btree transaction, so that they can be resumed if we crash. Keys in the logged operations btree will represent operations in progress, with the state of the operation stored in the value. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Break up io.cKent Overstreet1-3/+6
More reorganization, this splits up io.c into - io_read.c - io_misc.c - fallocate, fpunch, truncate - io_write.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill incorrect assertionKent Overstreet1-2/+0
In the bch2_fs_alloc() error path we call bch2_fs_free() without setting BCH_FS_STOPPING - this is fine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Convert more code to bch_err_msg()Kent Overstreet1-22/+21
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: restart journal reclaim thread on ro->rw transitionsBrian Foster1-0/+4
Commit c2d5ff36065a4 ("bcachefs: Start journal reclaim thread earlier") tweaked reclaim thread management to start a bit earlier in the mount sequence by moving the start call from __bch2_fs_read_write() to bch2_fs_journal_start(). This has the side effect of never starting the reclaim thread on a ro->rw transition, which can be observed by monitoring reclaim behavior via the journal_reclaim tracepoints. I.e. once an fs has remounted ro->rw, we only ever rely on direct reclaim from that point forward. Since bch2_journal_reclaim_start() properly handles the case where the reclaim thread has already been created, restore the start call in the read-write helper. This allows the reclaim thread to start early when appropriate and also exit/restart on remounts or freeze cycles. In the latter case it may be possible to simply allow the task to freeze rather than destroy it, but for now just fix the immediate bug. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix bch2_mount error pathKent Overstreet1-0/+2
In the bch2_mount() error path, we were calling deactivate_locked_super(), which calls ->kill_sb(), which in our case was calling bch2_fs_free() without __bch2_fs_stop(). This changes bch2_mount() to just call bch2_fs_stop() directly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Split out snapshot.cKent Overstreet1-0/+1
subvolume.c has gotten a bit large, this splits out a separate file just for managing snapshot trees - BTREE_ID_snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: btree_journal_iter.cKent Overstreet1-0/+1
Split out a new file from recovery.c for managing the list of keys we read from the journal: before journal replay finishes the btree iterator code needs to be able to iterate over and return keys from the journal as well, so there's a fair bit of code here. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: sb-clean.cKent Overstreet1-0/+1
Pull code for bch_sb_field_clean out into its own file. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Split up fs-io.[ch]Kent Overstreet1-1/+7
fs-io.c is too big - time for some reorganization - fs-dio.c: direct io - fs-pagecache.c: pagecache data structures (bch_folio), utility code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: BCH_COMPAT_bformat_overflow_done no longer requiredKent Overstreet1-0/+1
Awhile back, we changed bkey_format generation to ensure that the packed representation could never represent fields larger than the unpacked representation. This was to ensure that bkey_packed_successor() always gave a sensible result, but in the current code bkey_packed_successor() is only used in a debug assertion - not for anything important. This kills the requirement that we've gotten rid of those weird bkey formats, and instead changes the assertion to check if we're dealing with an old weird bkey format. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Print version, options earlier in startup pathKent Overstreet1-2/+3
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Don't start copygc until recovery is finishedKent Overstreet1-6/+13
With "bcachefs: Snapshot depth, skiplist fields", we now can't run data move operations until after bch2_check_snapshots() is complete. Ideally we'd have the copygc (and rebalance) threads wait until c->curr_recovery_pass has advanced, but the waitlist handling is tricky - so for now, move starting copygc back to read_write_late(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Convert more -EROFS to private error codesKent Overstreet1-5/+6
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Delete redundant log messagesKent Overstreet1-13/+1
Now that we have distinct error codes for different memory allocation failures, the early init log messages are no longer needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Assorted sparse fixesKent Overstreet1-3/+3
- endianness fixes - mark some things static - fix a few __percpu annotations - fix silent enum conversions Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Allow for unknown btree IDsKent Overstreet1-0/+1
We need to allow filesystems with metadata from newer versions to be mountable and usable by older versions. This patch enables us to roll out new btrees without a new major version number; we can now handle btree roots for unknown btree types. The unknown btree roots will be retained, and fsck (including backpointers) will check them, the same as other btree types. We add a dynamic array for the extra, unknown btree roots, in addition to the fixed size btree root array, and add new helpers for looking up btree roots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: flush journal to avoid invalid dev usage entries on recoveryBrian Foster1-0/+11
A crash immediately after device removal can result in an unmountable filesystem due to recovery failure. The following command reliably reproduces on a multi-device fs: bcachefs device remove <dev> && xfs_io -xc shutdown <mnt> The post-crash mount fails with an error similar to the following, reported by fsck: invalid journal entry dev_usage at offset 7994/8034 seq 12: bad dev, fixing This refers to a device usage entry in the journal that refers to the index of the just removed device. Recovery considers this an invalid entry and fails to proceed. Device usage entries are added to journal buffer writes via bch_journal_write() -> bch2_journal_super_entries_add_common(), which means any journal buffer write has content that refers to member devices at the time of the journal write. The device remove sequence already removes metadata references to the device being removed. It then flushes any pins that refer to the device, clears replica entries, removes the in-memory device object and lastly updates the superblock to reflect that the device is no longer present. The problem is that any journal writes that occur during this sequence will include a dev usage entry so long as the device is present. To avoid this problem, we can flush the journal once more after the device entry is removed from the in-core structures, but before the superblock is updated to fully remove the device on-disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_version_to_text()Kent Overstreet1-1/+2
Add a new helper for printing out metadata versions in a standard format. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a null ptr deref in bch2_fs_alloc() error pathKent Overstreet1-0/+1
This fixes a null ptr deref in bch2_free_pending_node_rewrites() when the list head wasn't initialized. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: New assertions when marking filesystem cleanKent Overstreet1-0/+5
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Convert -ENOENT to private error codesKent Overstreet1-1/+1
As with previous conversions, replace -ENOENT uses with more informative private error codes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: remove bucket_gens btree keys on device removalBrian Foster1-0/+2
If a device has keys in the bucket_gens btree associated with its buckets and is removed from a bcachefs volume, fsck will complain about the presence of keys associated with an invalid device index. A repair removes the associated keys and restores correctness. Update bch2_dev_remove_alloc() to remove device related keys at device removal time to avoid the problem. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Run freespace init in device hot add pathKent Overstreet1-0/+4
Like in the recovery, and device add, we have to check if devices don't have the freespace btree initialized - this was missed in the device hot add path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: use dedicated workqueue for tasks holding write refsBrian Foster1-0/+4
A workqueue resource deadlock has been observed when running fsck on a filesystem with a full/stuck journal. fsck is not currently able to repair the fs due to fairly rapid emergency shutdown, but rather than exit gracefully the fsck process hangs during the shutdown sequence. Fortunately this is easily recoverable from userspace, but the root cause involves code shared between the kernel and userspace and so should be addressed. The deadlock scenario involves the main task in the bch2_fs_stop() -> bch2_fs_read_only() path waiting on write references to drain with the fs state lock held. A bch2_read_only_work() workqueue task is scheduled on the system_long_wq, blocked on the state lock. Finally, various other write ref holding workqueue tasks are scheduled to run on the same workqueue and must complete in order to release references that the initial task is waiting on. To avoid this problem, we can split the dependent workqueue tasks across different workqueues. It's a bit of a waste to create a dedicated wq for the read-only worker, but there are several tasks throughout the fs that follow the pattern of acquiring a write reference and then scheduling to the system wq. Use a local wq for such tasks to break the subtle dependency between these and the read-only worker. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix bch2_evict_subvolume_inodes()Kent Overstreet1-0/+3
This fixes a bug in bch2_evict_subvolume_inodes(): d_mark_dontcache() doesn't handle the case where i_count is already 0, we need to grab and put the inode in order for it to be dropped. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: New erasure coding shutdown pathKent Overstreet1-9/+3
This implements a new shutdown path for erasure coding, which is needed for the upcoming BCH_WRITE_WAIT_FOR_EC write path. The process is: - Cancel new stripes being built up - Close out/cancel open buckets on write points or the partial list that are for stripes - Shutdown rebalance/copygc - Then wait for in flight new stripes to finish With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill up before they complete; the new ec shutdown path is needed for shutting down copygc/rebalance without deadlocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_fs_moving_ctxts_to_text()Kent Overstreet1-2/+1
This also adds bch2_write_op_to_text(): now we can see outstand moves, useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC and likely for other things in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Private error codes: ENOMEMKent Overstreet1-4/+4
This adds private error codes for most (but not all) of our ENOMEM uses, which makes it easier to track down assorted allocation failures. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: When shutting down, flush btree node writes lastKent Overstreet1-2/+4
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Improve a verbose log messageKent Overstreet1-1/+2
We should be using bch2_err_str() where applicable. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Switch ec_stripes_heap_lock to a mutexKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fragmentation LRUKent Overstreet1-1/+0
Now that we have much more efficient updates to the LRU btree, this patch adds a new LRU that indexes buckets by fragmentation. This means copygc no longer has to scan every bucket to find buckets that need to be evacuated. Changes: - A new field in bch_alloc_v4, fragmentation_lru - this corresponds to the bucket's position in the fragmentation LRU. We add a new field for this instead of calculating it as needed because we may make the fragmentation LRU optional; this field indicates whether a bucket is on the fragmentation LRU. Also, zoned devices will introduce variable bucket sizes; explicitly recording the LRU position will be safer for them. - A new copygc path for using the fragmentation LRU instead of scanning every bucket and building up an in-memory heap. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Handle btree node rewrites before going RWKent Overstreet1-0/+2
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Improved nocow lockingKent Overstreet1-0/+2
This improves the nocow lock table so that hash table entries have multiple locks, and locks specify which bucket they're for - i.e. we can now resolve hash collisions. This is important because the allocator has to skip buckets that are locked in the nocow lock table, and previously hash collisions would cause it to spuriously skip unlocked buckets. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Don't stop copygc while removing devicesKent Overstreet1-8/+0
With the new backpointer based copygc we don't need an explicit copygc reserve, we're always evacuating buckets one at a time - so this is no longer needed, and in fact removing it fixes a deadlock in bch2_dev_allocator_remove(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: New on disk format: BackpointersKent Overstreet1-0/+2
This patch adds backpointers: we now have a reverse index from device and offset on that device (specifically, offset within a bucket) back to btree nodes and (non cached) data extents. The first 40 backpointers within a bucket are stored in the alloc key; after that backpointers spill over to the next backpointers btree. This is to help avoid performance regressions from additional btree updates on large streaming workloads. This patch adds all the code for creating, checking and repairing backpointers. The next patch in the series is going to use backpointers for copygc - finally getting rid of the need to scan all extents to do copygc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Btree write bufferKent Overstreet1-0/+3
This adds a new method of doing btree updates - a straight write buffer, implemented as a flat fixed size array. This is only useful when we don't need to read from the btree in order to do the update, and when reading is infrequent - perfect for the LRU btree. This will make LRU btree updates fast enough that we'll be able to use it for persistently indexing buckets by fragmentation, which will be a massive boost to copygc performance. Changes: - A new btree_insert_type enum, for btree_insert_entries. Specifies btree, btree key cache, or btree write buffer. - bch2_trans_update_buffered(): updates via the btree write buffer don't need a btree path, so we need a new update path. - Transaction commit path changes: The update to the btree write buffer both mutates global, and can fail if there isn't currently room. Therefore we do all write buffer updates in the transaction all at once, and also if it fails we have to revert filesystem usage counter changes. If there isn't room we flush the write buffer in the transaction commit error path and retry. - A new persistent option, for specifying the number of entries in the write buffer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Start copygc when first going read-writeKent Overstreet1-14/+12
In the distant past, it wasn't possible to start copygc until after journal replay had finished. Now, the btree iterator code overlays keys from the journal, so there's no reason not to start it earlier - and it solves a rare deadlock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Debug mode for c->writes referencesKent Overstreet1-5/+25
This adds a debug mode where we split up the c->writes refcount into distinct refcounts for every codepath that takes a reference, and adds sysfs code to print the value of each ref. This will make it easier to debug shutdown hangs due to refcount leaks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: ec_stripe_delete_work() now takes ref on c->writesKent Overstreet1-5/+4
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Make log message at startup a bit cleanerKent Overstreet1-7/+6
Don't print out opts= if no options have been specified. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: More errcode cleanupKent Overstreet1-44/+32
We shouldn't be overloading standard error codes now that we have provisions for bcachefs-specific errorcodes: this patch converts super.c and super-io.c to per error site errcodes, with a bit of cleanup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a "no journal entries found" bugKent Overstreet1-0/+8
On startup, we need to ensure the first journal entry written is a flush write: after a clean shutdown we generally don't read the journal, which means we might be overwriting whatever was there previously, and there must always be at least one flush entry in the journal or recovery will fail. Found by fstests generic/388. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Assorted checkpatch fixesKent Overstreet1-5/+4
checkpatch.pl gives lots of warnings that we don't want - suggested ignore list: ASSIGN_IN_IF UNSPECIFIED_INT - bcachefs coding style prefers single token type names NEW_TYPEDEFS - typedefs are occasionally good FUNCTION_ARGUMENTS - we prefer to look at functions in .c files (hopefully with docbook documentation), not .h file prototypes MULTISTATEMENT_MACRO_USE_DO_WHILE - we have _many_ x-macros and other macros where we can't do this Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: time stats now uses the mean_and_variance module.Daniel Hill1-0/+6
Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>