summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-10-23bcachefs: Inline bch2_snapshot_is_ancestor() fast pathKent Overstreet2-2/+9
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Upgrade path fixesKent Overstreet4-4/+7
Some minor fixes to not print errors that are actually due to a verson upgrade. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: is_ancestor bitmapKent Overstreet2-8/+18
Further optimization for bch2_snapshot_is_ancestor(). We add a small inline bitmap to snapshot_t, which indicates which of the next 128 snapshot IDs are ancestors of the current id - eliminating the last few iterations of the loop in bch2_snapshot_is_ancestor(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: mark bch_inode_info and bkey_cached as reclaimableMikulas Patocka2-2/+2
Mark these caches as reclaimable, so that available memory is correctly reported when there is a lot of cached inodes. Note that more work is needed - you should add __GFP_RECLAIMABLE to some of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*" caches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Compression levelsKent Overstreet10-57/+174
This allows including a compression level when specifying a compression type, e.g. compression=zstd:15 Values from 1 through 15 indicate compression levels, 0 or unspecified indicates the default. For LZ4, values 3-15 specify that the HC algorithm should be used. Note that for compatibility, extents themselves only include the compression type, not the compression level. This means that specifying the same compression algorithm but different compression levels for the compression and background_compression options will have no effect. XXX: perhaps we could add a warning for this Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Extent sb compression type fields to 8 bitsKent Overstreet1-3/+28
The upper 4 bits are for compression level. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bcachefs_format.h should be using __u64Kent Overstreet1-10/+10
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: fix_errors option is now a proper enumKent Overstreet6-47/+95
Before, it was parsed as a bool but internally it was really an enum: this lets us pass in all the possible values. But we special case the option parsing: no supplied value is parsed as FSCK_FIX_yes, to match the previous behaviour. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch_opt_fnKent Overstreet4-19/+33
Minor refactoring to get rid of some unneeded token pasting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Convert snapshot table to RCU arrayKent Overstreet7-47/+207
This switches the generic radix tree for the in-memory table of snapshot nodes to a simple rcu array. This means we have to add new locking to deal with reallocations, but is faster than traversing the radix tree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Add a race_fault() for write buffer slowpathKent Overstreet1-0/+3
We haven't hooked up dynamic fault injection quite yet, but we will soon Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Add buffered IO fallback for userspaceKent Overstreet1-2/+15
In userspace, we want to be able to switch to buffered IO when we're dealing with an image on a filesystem/device that doesn't support the blocksize the filesystem was formatted with. This plumbs through !opts.direct_io -> FMODE_BUFFERED, which will be supported by the shim version of blkdev_get_by_path() in -tools, and it adds a fallback to disable direct IO and retry for userspace. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fallocate now checks page cacheKent Overstreet1-22/+61
Previously, fallocate would only check the state of the extents btree when determining if we need to create a reservation. But the page cache might already have dirty data or a disk reservation. This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to check for this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Don't start copygc until recovery is finishedKent Overstreet1-6/+13
With "bcachefs: Snapshot depth, skiplist fields", we now can't run data move operations until after bch2_check_snapshots() is complete. Ideally we'd have the copygc (and rebalance) threads wait until c->curr_recovery_pass has advanced, but the waitlist handling is tricky - so for now, move starting copygc back to read_write_late(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix build error on weird gccKent Overstreet1-3/+1
fixes ./include/linux/stddef.h:8:14: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init] Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Snapshot depth, skiplist fieldsKent Overstreet6-56/+267
This extents KEY_TYPE_snapshot to include some new fields: - depth, to indicate depth of this particular node from the root - skip[3], skiplist entries for quickly walking back up to the root These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n)) instead of O(n) in the snapshot tree depth. Skiplist nodes are picked at random from the set of ancestor nodes, not some fixed fraction. This introduces bcachefs_metadata_version 1.1, snapshot_skiplists. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Version table now lists required recovery passesKent Overstreet6-46/+110
Now that we've got forward compatibility sorted out, we should be doing more frequent version upgrades in the future. To avoid having to run a full fsck for every version upgrade, this improves the BCH_METADATA_VERSIONS() table to explicitly specify a bitmask of recovery passes to run when upgrading to or past a given version. This means we can also delete PASS_UPGRADE(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_sb_maybe_downgrade(), bch2_sb_upgrade()Kent Overstreet3-14/+33
Add some new helpers, and fix upgrade/downgrade in bch2_fs_initialize(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a write buffer flush deadlockKent Overstreet2-0/+15
We're not supposed to block if BTREE_INSERT_JOURNAL_RECLAIM && watermark != BCH_WATERMARK_reclaim. This should really be a separate BTREE_INSERT_NONBLOCK flag - add some comments to that effect, it's not important for this patch. btree write buffer flush depends on this behaviour though - the first loop tries to flush sequentially, which doesn't free up space in the journal optimally. If that can't proceed we bail out and flush in journal order - that won't work if we're blocked instead of returning an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bcachefs_metadata_version_major_minorKent Overstreet7-61/+129
This introduces major/minor versioning to the superblock version number. Major version number changes indicate incompatible releases; we can move forward to a new major version number, but not backwards. Minor version numbers indicate compatible changes - these add features, but can still be mounted and used by old versions. With the recent patches that make it possible to roll out new btrees and key types without breaking compatibility, we should be able to roll out most new features without incompatible changes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Add new assertions for shutdown pathKent Overstreet2-1/+4
We've been seeing assertions pop that indicate the btree node cache or key cache have dirty items when we just did a clean shutdown. Add some more assertions so we can catch this when we're dirtying items. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_xattr_set() now updates ctimeKent Overstreet3-13/+22
Fixes fstests generic/728 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill bch2_xattr_get()Kent Overstreet2-12/+3
Inline it into the only caller Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix try_decrease_writepoints()Kent Overstreet1-0/+1
We were freeing open buckets on the writepoint list, but forgetting to take them off the writepoint list - whoops Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Mark as EXPERIMENTALKent Overstreet1-1/+1
As discussed on list, bcachefs is going to be marked as experimental for a few releases, until the inevitable tide of new bug reports subsides. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Enumerate recovery passesKent Overstreet11-239/+175
Recovery and fsck have many different passes/jobs to do, which always run in the same order - but not all of them run all the time. Some are for fsck, some for unclean shutdown, some for version upgrades. This adds some new structure: a defined list of recovery passes that we can run in a loop, as well as consolidating the log messages. The main benefit is consolidating the "should run this recovery pass" logic, as well as cleaning up the "this recovery pass has finished" state; instead of having a bunch of ad-hoc state bits in c->flags, we've now got c->curr_recovery_pass. By consolidating the "should run this recovery pass" logic, in the future on disk format upgrades will be able to say "upgrading to this version requires x passes to run", instead of forcing all of fsck to run. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Stash journal replay params in bch_fsKent Overstreet2-3/+11
For the upcoming enumeration of recovery passes, we need all recovery passes to be called the same way - including journal replay. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill bch2_bucket_gens_read()Kent Overstreet3-62/+45
This folds bch2_bucket_gens_read() into bch2_alloc_read(), doing the version check there. This is prep work for enumarating all recovery passes: we need some cleanup first to make calling all the recovery passes consistent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix error path in bch2_journal_flush_device_pins()Kent Overstreet2-5/+6
We need to always call bch2_replicas_gc_end() after we've called bch2_replicas_gc_start(), else we leave state around that needs to be cleaned up. Partial fix for: https://github.com/koverstreet/bcachefs/issues/560 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: version_upgrade is now an enumKent Overstreet6-13/+35
The version_upgrade parameter is now an enum, not a bool, and it's persistent in the superblock: - compatible (default): upgrade to the latest compatible version - incompatible: upgrade to latest incompatible version - none Currently all upgrades are incompatible upgrades, but the next release will introduce major:minor versions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: BCH_SB_VERSION_UPGRADE_COMPLETE()Kent Overstreet5-25/+62
Version upgrades are not atomic operations: when we do a version upgrade we need to update the superblock before we start using new features, and then when the upgrade completes we need to update the superblock again. This adds a new superblock field so we can detect and handle incomplete version upgrades. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Convert more -EROFS to private error codesKent Overstreet4-9/+13
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Delete redundant log messagesKent Overstreet8-64/+17
Now that we have distinct error codes for different memory allocation failures, the early init log messages are no longer needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Change check for invalid key typesKent Overstreet26-80/+141
As part of the forward compatibility patch series, we need to allow for new key types without complaining loudly when running an old version. This patch changes the flags parameter of bkey_invalid to an enum, and adds a new flag to indicate we're being called from the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Assorted sparse fixesKent Overstreet38-118/+115
- endianness fixes - mark some things static - fix a few __percpu annotations - fix silent enum conversions Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Refactor bch_sb_field_ops handlingKent Overstreet1-11/+18
This changes bch_sb_field_ops lookup to match how bkey_ops now works; for an unknown field type we return an empty ops struct. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Allow for unknown key typesKent Overstreet3-27/+33
This adds a new helper for lookups bkey_ops for a given key type, which returns a null bkey_ops for unknown key types; various bkey_ops users are tweaked as well to handle unknown key types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Allow for unknown btree IDsKent Overstreet11-42/+91
We need to allow filesystems with metadata from newer versions to be mountable and usable by older versions. This patch enables us to roll out new btrees without a new major version number; we can now handle btree roots for unknown btree types. The unknown btree roots will be retained, and fsck (including backpointers) will check them, the same as other btree types. We add a dynamic array for the extra, unknown btree roots, in addition to the fixed size btree root array, and add new helpers for looking up btree roots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: flush journal to avoid invalid dev usage entries on recoveryBrian Foster1-0/+11
A crash immediately after device removal can result in an unmountable filesystem due to recovery failure. The following command reliably reproduces on a multi-device fs: bcachefs device remove <dev> && xfs_io -xc shutdown <mnt> The post-crash mount fails with an error similar to the following, reported by fsck: invalid journal entry dev_usage at offset 7994/8034 seq 12: bad dev, fixing This refers to a device usage entry in the journal that refers to the index of the just removed device. Recovery considers this an invalid entry and fails to proceed. Device usage entries are added to journal buffer writes via bch_journal_write() -> bch2_journal_super_entries_add_common(), which means any journal buffer write has content that refers to member devices at the time of the journal write. The device remove sequence already removes metadata references to the device being removed. It then flushes any pins that refer to the device, clears replica entries, removes the in-memory device object and lastly updates the superblock to reflect that the device is no longer present. The problem is that any journal writes that occur during this sequence will include a dev usage entry so long as the device is present. To avoid this problem, we can flush the journal once more after the device entry is removed from the in-core structures, but before the superblock is updated to fully remove the device on-disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: mark active journal devices on journal replicas gcBrian Foster1-1/+13
A simple device evacuate, remove, add test loop with concurrent shutdowns occasionally reproduces a problem where the filesystem fails to mount. The mount failure occurs because the filesystem was uncleanly shut down, yet no member device is marked for journal data in the superblock. An fsck detects the problem, restores the mark and allows the mount to proceed without further consistency issues. The reason for the lack of journal data marks is the gc mechanism invoked via bch2_journal_flush_device_pins() runs while the journal happens to be empty. This results in garbage collection of all journal replicas entries. Once the updated replicas table is written to the superblock, the filesystem is put in a transiently unrecoverable state until further journal data is written, because journal recovery expects to find at least one marked journal device whenever the filesystem is not otherwise marked clean (i.e. as on clean unmount). To fix this problem, update the journal replicas gc algorithm to always mark currently active journal replicas entries by writing to the journal. This ensures that only entries for devices that are no longer used for journaling are garbage collected, not just those that don't happen to currently hold journal data. This preserves the journal recovery invariant above and avoids putting the fs into a transiently unrecoverable state. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_version_compatible()Kent Overstreet5-66/+60
This adds a new helper for checking if an on-disk version is compatible with the running version of bcachefs - prep work for introducing major:minor version numbers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: bch2_version_to_text()Kent Overstreet6-20/+37
Add a new helper for printing out metadata versions in a standard format. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill BTREE_INSERT_USE_RESERVEKent Overstreet9-55/+56
Now that we have journal watermarks and alloc watermarks unified, BTREE_INSERT_USE_RESERVE is redundant and can be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a null ptr deref in bch2_fs_alloc() error pathKent Overstreet3-1/+6
This fixes a null ptr deref in bch2_free_pending_node_rewrites() when the list head wasn't initialized. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix a format string warningKent Overstreet1-1/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Kill JOURNAL_WATERMARKKent Overstreet12-55/+46
This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: BCH_WATERMARK_reclaimKent Overstreet3-2/+6
Add another watermark for journal reclaim - this is needed for the next patches, that unify BCH_WATERMARK with JOURNAL_WATERMARK. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: struct bch_extent_rebalanceKent Overstreet3-2/+26
This adds the extent entry for extents that rebalance needs to do something with. We're adding this ahead of the main rebalance_work patchset, because adding new extent entries can't be done in a forwards-compatible way. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Expand BTREE_NODE_IDKent Overstreet1-3/+14
We now have 20 bits for the btree ID in the on disk format - sufficient for 1 million distinct btrees. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-23bcachefs: Fix btree node write error messageKent Overstreet1-1/+1
Error messages should include the error code, when available. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>