diff options
Diffstat (limited to 'Documentation/filesystems/bcachefs')
-rw-r--r-- | Documentation/filesystems/bcachefs/SubmittingPatches.rst | 47 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/casefolding.rst | 108 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/future/idle_work.rst | 78 | ||||
-rw-r--r-- | Documentation/filesystems/bcachefs/index.rst | 27 |
4 files changed, 239 insertions, 21 deletions
diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst index 026b12ae0d6a..18c79d548391 100644 --- a/Documentation/filesystems/bcachefs/SubmittingPatches.rst +++ b/Documentation/filesystems/bcachefs/SubmittingPatches.rst @@ -1,8 +1,13 @@ -Submitting patches to bcachefs: -=============================== +Submitting patches to bcachefs +============================== + +Here are suggestions for submitting patches to bcachefs subsystem. + +Submission checklist +-------------------- Patches must be tested before being submitted, either with the xfstests suite -[0], or the full bcachefs test suite in ktest [1], depending on what's being +[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being touched. Note that ktest wraps xfstests and will be an easier method to running it for most users; it includes single-command wrappers for all the mainstream in-kernel local filesystems. @@ -26,21 +31,21 @@ considered out of date), but try not to deviate too much without reason. Focus on writing code that reads well and is organized well; code should be aesthetically pleasing. -CI: -=== +CI +-- Instead of running your tests locally, when running the full test suite it's -prefereable to let a server farm do it in parallel, and then have the results +preferable to let a server farm do it in parallel, and then have the results in a nice test dashboard (which can tell you which failures are new, and presents results in a git log view, avoiding the need for most bisecting). -That exists [2], and community members may request an account. If you work for +That exists [2]_, and community members may request an account. If you work for a big tech company, you'll need to help out with server costs to get access - but the CI is not restricted to running bcachefs tests: it runs any ktest test (which generally makes it easy to wrap other tests that can run in qemu). -Other things to think about: -============================ +Other things to think about +--------------------------- - How will we debug this code? Is there sufficient introspection to diagnose when something starts acting wonky on a user machine? @@ -68,7 +73,7 @@ Other things to think about: land - use them. Use them judiciously, and not as a replacement for proper error handling, but use them. -- Does it need to be performance tested? Should we add new peformance counters? +- Does it need to be performance tested? Should we add new performance counters? bcachefs has a set of persistent runtime counters which can be viewed with the 'bcachefs fs top' command; this should give users a basic idea of what @@ -79,20 +84,22 @@ Other things to think about: tested? (Automated tests exists but aren't in the CI, due to the hassle of disk image management; coordinate to have them run.) -Mailing list, IRC: -================== +Mailing list, IRC +----------------- -Patches should hit the list [3], but much discussion and code review happens on -IRC as well [4]; many people appreciate the more conversational approach and -quicker feedback. +Patches should hit the list [3]_, but much discussion and code review happens +on IRC as well [4]_; many people appreciate the more conversational approach +and quicker feedback. Additionally, we have a lively user community doing excellent QA work, which exists primarily on IRC. Please make use of that resource; user feedback is important for any nontrivial feature, and documenting it in commit messages would be a good idea. -[0]: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git -[1]: https://evilpiepirate.org/git/ktest.git/ -[2]: https://evilpiepirate.org/~testdashboard/ci/ -[3]: linux-bcachefs@vger.kernel.org -[4]: irc.oftc.net#bcache, #bcachefs-dev +.. rubric:: References + +.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git +.. [1] https://evilpiepirate.org/git/ktest.git/ +.. [2] https://evilpiepirate.org/~testdashboard/ci/ +.. [3] linux-bcachefs@vger.kernel.org +.. [4] irc.oftc.net#bcache, #bcachefs-dev diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst new file mode 100644 index 000000000000..871a38f557e8 --- /dev/null +++ b/Documentation/filesystems/bcachefs/casefolding.rst @@ -0,0 +1,108 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Casefolding +=========== + +bcachefs has support for case-insensitive file and directory +lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) +casefolding attributes. + +The main usecase for casefolding is compatibility with software written +against other filesystems that rely on casefolded lookups +(eg. NTFS and Wine/Proton). +Taking advantage of file-system level casefolding can lead to great +loading time gains in many applications and games. + +Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. +Once a directory has been flagged for casefolding, a feature bit +is enabled on the superblock which marks the filesystem as using +casefolding. +When the feature bit for casefolding is enabled, it is no longer possible +to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. + +On the lookup/query side: casefolding is implemented by allocating a new +string of `BCH_NAME_MAX` length using the `utf8_casefold` function to +casefold the query string. + +On the dirent side: casefolding is implemented by ensuring the `bkey`'s +hash is made from the casefolded string and storing the cached casefolded +name with the regular name in the dirent. + +The structure looks like this: + +* Regular: [dirent data][regular name][nul][nul]... +* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... + +(Do note, the number of NULs here is merely for illustration; their count can +vary per-key, and they may not even be present if the key is aligned to +`sizeof(u64)`.) + +This is efficient as it means that for all file lookups that require casefolding, +it has identical performance to a regular lookup: +a hash comparison and a `memcmp` of the name. + +Rationale +--------- + +Several designs were considered for this system: +One was to introduce a dirent_v2, however that would be painful especially as +the hash system only has support for a single key type. This would also need +`BCH_NAME_MAX` to change between versions, and a new feature bit. + +Another option was to store without the two lengths, and just take the length of +the regular name and casefolded name contiguously / 2 as the length. This would +assume that the regular length == casefolded length, but that could potentially +not be true, if the uppercase unicode glyph had a different UTF-8 encoding than +the lowercase unicode glyph. +It would be possible to disregard the casefold cache for those cases, but it was +decided to simply encode the two string lengths in the key to avoid random +performance issues if this edgecase was ever hit. + +The option settled on was to use a free-bit in d_type to mark a dirent as having +a casefold cache, and then treat the first 4 bytes the name block as lengths. +You can see this in the `d_cf_name_block` member of union in `bch_dirent`. + +The feature bit was used to allow casefolding support to be enabled for the majority +of users, but some allow users who have no need for the feature to still use bcachefs as +`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, +which may be decider between using bcachefs for eg. embedded platforms. + +Other filesystems like ext4 and f2fs have a super-block level option for casefolding +encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose +any encodings than a single UTF-8 version. When future encodings are desirable, +they will be added trivially using the opts mechanism. + +dentry/dcache considerations +---------------------------- + +Currently, in casefolded directories, bcachefs (like other filesystems) will not cache +negative dentry's. + +This is because currently doing so presents a problem in the following scenario: + + - Lookup file "blAH" in a casefolded directory + - Creation of file "BLAH" in a casefolded directory + - Lookup file "blAH" in a casefolded directory + +This would fail if negative dentry's were cached. + +This is slightly suboptimal, but could be fixed in future with some vfs work. + + +References +---------- + +(from Peter Anvin, on the list) + +It is worth noting that Microsoft has basically declared their +"recommended" case folding (upcase) table to be permanently frozen (for +new filesystem instances in the case where they use an on-disk +translation table created at format time.) As far as I know they have +never supported anything other than 1:1 conversion of BMP code points, +nor normalization. + +The exFAT specification enumerates the full recommended upcase table, +although in a somewhat annoying format (basically a hex dump of +compressed data): + +https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst new file mode 100644 index 000000000000..59a332509dcd --- /dev/null +++ b/Documentation/filesystems/bcachefs/future/idle_work.rst @@ -0,0 +1,78 @@ +Idle/background work classes design doc: + +Right now, our behaviour at idle isn't ideal, it was designed for servers that +would be under sustained load, to keep pending work at a "medium" level, to +let work build up so we can process it in more efficient batches, while also +giving headroom for bursts in load. + +But for desktops or mobile - scenarios where work is less sustained and power +usage is more important - we want to operate differently, with a "rush to +idle" so the system can go to sleep. We don't want to be dribbling out +background work while the system should be idle. + +The complicating factor is that there are a number of background tasks, which +form a heirarchy (or a digraph, depending on how you divide it up) - one +background task may generate work for another. + +Thus proper idle detection needs to model this heirarchy. + +- Foreground writes +- Page cache writeback +- Copygc, rebalance +- Journal reclaim + +When we implement idle detection and rush to idle, we need to be careful not +to disturb too much the existing behaviour that works reasonably well when the +system is under sustained load (or perhaps improve it in the case of +rebalance, which currently does not actively attempt to let work batch up). + +SUSTAINED LOAD REGIME +--------------------- + +When the system is under continuous load, we want these jobs to run +continuously - this is perhaps best modelled with a P/D controller, where +they'll be trying to keep a target value (i.e. fragmented disk space, +available journal space) roughly in the middle of some range. + +The goal under sustained load is to balance our ability to handle load spikes +without running out of x resource (free disk space, free space in the +journal), while also letting some work accumululate to be batched (or become +unnecessary). + +For example, we don't want to run copygc too aggressively, because then it +will be evacuating buckets that would have become empty (been overwritten or +deleted) anyways, and we don't want to wait until we're almost out of free +space because then the system will behave unpredicably - suddenly we're doing +a lot more work to service each write and the system becomes much slower. + +IDLE REGIME +----------- + +When the system becomes idle, we should start flushing our pending work +quicker so the system can go to sleep. + +Note that the definition of "idle" depends on where in the heirarchy a task +is - a task should start flushing work more quickly when the task above it has +stopped generating new work. + +e.g. rebalance should start flushing more quickly when page cache writeback is +idle, and journal reclaim should only start flushing more quickly when both +copygc and rebalance are idle. + +It's important to let work accumulate when more work is still incoming and we +still have room, because flushing is always more efficient if we let it batch +up. New writes may overwrite data before rebalance moves it, and tasks may be +generating more updates for the btree nodes that journal reclaim needs to flush. + +On idle, how much work we do at each interval should be proportional to the +length of time we have been idle for. If we're idle only for a short duration, +we shouldn't flush everything right away; the system might wake up and start +generating new work soon, and flushing immediately might end up doing a lot of +work that would have been unnecessary if we'd allowed things to batch more. + +To summarize, we will need: + + - A list of classes for background tasks that generate work, which will + include one "foreground" class. + - Tracking for each class - "Am I doing work, or have I gone to sleep?" + - And each class should check the class above it when deciding how much work to issue. diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst index 7db4d7ceab58..e5c4c2120b93 100644 --- a/Documentation/filesystems/bcachefs/index.rst +++ b/Documentation/filesystems/bcachefs/index.rst @@ -4,10 +4,35 @@ bcachefs Documentation ====================== +Subsystem-specific development process notes +-------------------------------------------- + +Development notes specific to bcachefs. These are intended to supplement +:doc:`general kernel development handbook </process/index>`. + .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :numbered: CodingStyle SubmittingPatches + +Filesystem implementation +------------------------- + +Documentation for filesystem features and their implementation details. +At this moment, only a few of these are described here. + +.. toctree:: + :maxdepth: 1 + :numbered: + + casefolding errorcodes + +Future design +------------- +.. toctree:: + :maxdepth: 1 + + future/idle_work |