summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)AuthorFilesLines
2023-04-14Merge tag 'intents-perag-refs-6.4_2023-04-11' of ↵Dave Chinner11-60/+67
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: make intent items take a perag reference [v24.5] Now that we've cleaned up some code warts in the deferred work item processing code, let's make intent items take an active perag reference from their creation until they are finally freed by the defer ops machinery. This change facilitates the scrub drain in the next patchset and will make it easier for the future AG removal code to detect a busy AG in need of quiescing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-12xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is ↵Darrick J. Wong3-3/+25
completely done While fuzzing the data fork extent count on a btree-format directory with xfs/375, I observed the following (excerpted) splat: XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs] Call Trace: <TASK> xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd] __x64_sys_ioctl+0x82/0xa0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The cause of this is a race condition in xfs_ilock_data_map_shared, which performs an unlocked access to the data fork to guess which lock mode it needs: Thread 0 Thread 1 xfs_need_iread_extents <observe no iext tree> xfs_ilock(..., ILOCK_EXCL) xfs_iread_extents <observe no iext tree> <check ILOCK_EXCL> <load bmbt extents into iext> <notice iext size doesn't match nextents> xfs_need_iread_extents <observe iext tree> xfs_ilock(..., ILOCK_SHARED) <tear down iext tree> xfs_iunlock(..., ILOCK_EXCL) xfs_iread_extents <observe no iext tree> <check ILOCK_EXCL> *BOOM* Fix this race by adding a flag to the xfs_ifork structure to indicate that we have not yet read in the extent records and changing the predicate to look at the flag state, not if_height. The memory barrier ensures that the flag will not be set until the very end of the function. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-12xfs: don't consider future format versions validDave Chinner1-5/+6
In commit fe08cc504448 we reworked the valid superblock version checks. If it is a V5 filesystem, it is always valid, then we checked if the version was less than V4 (reject) and then checked feature fields in the V4 flags to determine if it was valid. What we missed was that if the version is not V4 at this point, we shoudl reject the fs. i.e. the check current treats V6+ filesystems as if it was a v4 filesystem. Fix this. cc: stable@vger.kernel.org Fixes: fe08cc504448 ("xfs: open code sb verifier feature checks") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-12xfs: stabilize the dirent name transformation function used for ascii-ci dir ↵Darrick J. Wong2-2/+34
hash computation Back in the old days, the "ascii-ci" feature was created to implement case-insensitive directory entry lookups for latin1-encoded names and remove the large overhead of Samba's case-insensitive lookup code. UTF8 names were not allowed, but nobody explicitly wrote in the documentation that this was only expected to work if the system used latin1 names. The kernel tolower function was selected to prepare names for hashed lookups. There's a major discrepancy in the function that computes directory entry hashes for filesystems that have ASCII case-insensitive lookups enabled. The root of this is that the kernel and glibc's tolower implementations have differing behavior for extended ASCII accented characters. I wrote a program to spit out characters for which the tolower() return value is different from the input: glibc tolower: 65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N 79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z kernel tolower: 65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H 73:I 74:J 75:K 76:L 77:M 78:N 79:O 80:P 81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X 89:Y 90:Z 192:À 193:Á 194:Â 195:Ã 196:Ä 197:Å 198:Æ 199:Ç 200:È 201:É 202:Ê 203:Ë 204:Ì 205:Í 206:Î 207:Ï 208:Ð 209:Ñ 210:Ò 211:Ó 212:Ô 213:Õ 214:Ö 215:× 216:Ø 217:Ù 218:Ú 219:Û 220:Ü 221:Ý 222:Þ Which means that the kernel and userspace do not agree on the hash value for a directory filename that contains those higher values. The hash values are written into the leaf index block of directories that are larger than two blocks in size, which means that xfs_repair will flag these directories as having corrupted hash indexes and rewrite the index with hash values that the kernel now will not recognize. Because the ascii-ci feature is not frequently enabled and the kernel touches filesystems far more frequently than xfs_repair does, fix this by encoding the kernel's toupper predicate and tolower functions into libxfs. Give the new functions less provocative names to make it really obvious that this is a pre-hash name preparation function, and nothing else. This change makes userspace's behavior consistent with the kernel. Found by auditing obfuscate_name in xfs_metadump as part of working on parent pointers, wondering how it could possibly work correctly with ci filesystems, writing a test tool to create a directory with hash-colliding names, and watching xfs_repair flag it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-04-12xfs: accumulate iextent records when checking bmapDarrick J. Wong1-1/+1
Currently, the bmap scrubber checks file fork mappings individually. In the case that the file uses multiple mappings to a single contiguous piece of space, the scrubber repeatedly locks the AG to check the existence of a reverse mapping that overlaps this file mapping. If the reverse mapping starts before or ends after the mapping we're checking, it will also crawl around in the bmbt checking correspondence for adjacent extents. This is not very time efficient because it does the crawling while holding the AGF buffer, and checks the middle mappings multiple times. Instead, create a custom iextent record iterator function that combines multiple adjacent allocated mappings into one large incore bmbt record. This is feasible because the incore bmbt record length is 64-bits wide. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan resultsDarrick J. Wong2-35/+52
Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill scan results because for a given range of inode numbers, we might have no indexed inodes at all; the entire region might be allocated ondisk inodes; or there might be a mix of the two. Unfortunately, sparse inodes adds to the complexity, because each inode record can have holes, which means that we cannot use the generic btree _scan_keyfill function because we must look for holes in individual records to decide the result. On the plus side, online fsck can now detect sub-chunk discrepancies in the inobt. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: teach scrub to check for sole ownership of metadata objectsDarrick J. Wong2-62/+148
Strengthen online scrub's checking even further by enabling us to check that a range of blocks are owned solely by a given owner. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: remove pointless shadow variable from xfs_difree_inobtDarrick J. Wong1-2/+0
In xfs_difree_inobt, the pag passed in was previously used to look up the AGI buffer. There's no need to extract it again, so remove the shadow variable and shut up -Wshadow. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: implement masked btree key comparisons for _has_records scansDarrick J. Wong10-40/+142
For keyspace fullness scans, we want to be able to mask off the parts of the key that we don't care about. For most btree types we /do/ want the full keyspace, but for checking that a given space usage also has a full complement of rmapbt records (even if different/multiple owners) we need this masking so that we only track sparseness of rm_startblock, not the whole keyspace (which is extremely sparse). Augment the ->diff_two_keys and ->keys_contiguous helpers to take a third union xfs_btree_key argument, and wire up xfs_rmap_has_records to pass this through. This third "mask" argument should contain a nonzero value in each structure field that should be used in the key comparisons done during the scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: replace xfs_btree_has_record with a general keyspace scannerDarrick J. Wong14-33/+239
The current implementation of xfs_btree_has_record returns true if it finds /any/ record within the given range. Unfortunately, that's not sufficient for scrub. We want to be able to tell if a range of keyspace for a btree is devoid of records, is totally mapped to records, or is somewhere in between. By forcing this to be a boolean, we conflated sparseness and fullness, which caused scrub to return incorrect results. Fix the API so that we can tell the caller which of those three is the current state. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: refactor ->diff_two_keys callsitesDarrick J. Wong2-33/+79
Create wrapper functions around ->diff_two_keys so that we don't have to remember what the return values mean, and adjust some of the code comments to reflect the longtime code behavior. We're going to introduce more uses of ->diff_two_keys in the next patch, so reduce the cognitive load for readers by doing this refactoring now. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: refactor converting btree irec to btree keyDarrick J. Wong1-8/+15
We keep doing these conversions to support btree queries, so refactor this into a helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: fix rm_offset flag handling in rmap keysDarrick J. Wong1-10/+30
Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows: (physical block, owner, fork, is_btree, offset) This provides users the ability to look up a reverse mapping from a file block mapping record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset. Unfortunately, the code that creates rmap lookup keys from rmap records forgot to mask off the record attribute flags, leading to ondisk keys that look like this: (physical block, owner, fork, is_btree, unwritten state, offset) Fortunately, this has all worked ok for the past six years because the key comparison functions incorrectly ignore the fork/bmbt/unwritten information that's encoded in the on-disk offset. This means that lookup comparisons are only done with: (physical block, owner, offset) Queries can (theoretically) return incorrect results because of this omission. On consistent filesystems this isn't an issue because xattr and bmbt blocks cannot be shared and hence the comparisons succeed purely on the contents of the rm_startblock field. For the one case where we support sharing (written data fork blocks) all flag bits are zero, so the omission in the comparison has no ill effects. Unfortunately, this bug prevents scrub from detecting incorrect fork and bmbt flag bits in the rmap btree, so we really do need to fix the compare code. Old filesystems with the unwritten bit erroneously set in the rmap key struct will work fine on new kernels since we still ignore the unwritten bit. New filesystems on older kernels will work fine since the old kernels never paid attention to the unwritten bit. A previous version of this patch forgot to keep the (un)written state flag masked during the comparison and caused a major regression in 5.9.x since unwritten extent conversion can update an rmap record without requiring key updates. Note that blocks cannot go directly from data fork to attr fork without being deallocated and reallocated, nor can they be added to or removed from a bmbt without a free/alloc cycle, so this should not cause any regressions. Found by fuzzing keys[1].attrfork = ones on xfs/371. Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: hoist inode record alignment checks from scrubDarrick J. Wong1-0/+4
Move the inobt record alignment checks from xchk_iallocbt_rec into xfs_inobt_check_irec so that they are applied everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: hoist rmap record flag checks from scrubDarrick J. Wong1-0/+5
Move the rmap record flag checks from xchk_rmapbt_rec into xfs_rmap_check_irec so that they are applied everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: complain about bad file mapping records in the ondisk bmbtDarrick J. Wong3-2/+34
Similar to what we've just done for the other btrees, create a function to log corrupt bmbt records and call it whenever we encounter a bad record in the ondisk btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: hoist rmap record flag checks from scrubDarrick J. Wong1-0/+22
Move the rmap record flag checks from xchk_rmapbt_rec into xfs_rmap_check_irec so that they are applied everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: complain about bad records in query_range helpersDarrick J. Wong4-57/+91
For every btree type except for the bmbt, refactor the code that complains about bad records into a helper and make the ->query_range helpers call it so that corruptions found via that avenue are logged. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: standardize ondisk to incore conversion for rmap btreesDarrick J. Wong2-23/+42
Create a xfs_rmap_check_irec function to detect corruption in btree records. Fix all xfs_rmap_btrec_to_irec callsites to call the new helper and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: return a failure address from xfs_rmap_irec_offset_unpackDarrick J. Wong2-9/+9
Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED. Change this function to return the code address of a failed conversion in preparation for the next patch, which standardizes localized record checking and reporting code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: standardize ondisk to incore conversion for refcount btreesDarrick J. Wong2-14/+33
Create a xfs_refcount_check_irec function to detect corruption in btree records. Fix all xfs_refcount_btrec_to_irec callsites to call the new helper and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: standardize ondisk to incore conversion for inode btreesDarrick J. Wong4-20/+39
Create a xfs_inobt_check_irec function to detect corruption in btree records. Fix all xfs_inobt_btrec_to_irec callsites to call the new helper and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: standardize ondisk to incore conversion for free space btreesDarrick J. Wong2-13/+49
Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to an incore record, and a xfs_alloc_check_irec function to detect corruption. Replace all the open-coded logic with calls to the new helpers and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: allow queued AG intents to drain before scrubbingDarrick J. Wong3-2/+16
When a writer thread executes a chain of log intent items, the AG header buffer locks will cycle during a transaction roll to get from one intent item to the next in a chain. Although scrub takes all AG header buffer locks, this isn't sufficient to guard against scrub checking an AG while that writer thread is in the middle of finishing a chain because there's no higher level locking primitive guarding allocation groups. When there's a collision, cross-referencing between data structures (e.g. rmapbt and refcountbt) yields false corruption events; if repair is running, this results in incorrect repairs, which is catastrophic. Fix this by adding to the perag structure the count of active intents and make scrub wait until it has both AG header buffer locks and the intent counter reaches zero. One quirk of the drain code is that deferred bmap updates also bump and drop the intent counter. A fundamental decision made during the design phase of the reverse mapping feature is that updates to the rmapbt records are always made by the same code that updates the primary metadata. In other words, callers of bmapi functions expect that the bmapi functions will queue deferred rmap updates. Some parts of the reflink code queue deferred refcount (CUI) and bmap (BUI) updates in the same head transaction, but the deferred work manager completely finishes the CUI before the BUI work is started. As a result, the CUI drops the intent count long before the deferred rmap (RUI) update even has a chance to bump the intent count. The only way to keep the intent count elevated between the CUI and RUI is for the BUI to bump the counter until the RUI has been created. A second quirk of the intent drain code is that deferred work items must increment the intent counter as soon as the work item is added to the transaction. When a BUI completes and queues an RUI, the RUI must increment the counter before the BUI decrements it. The only way to accomplish this is to require that the counter be bumped as soon as the deferred work item is created in memory. In the next patches we'll improve on this facility, but this patch provides the basic functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: create traced helper to get extra perag referencesDarrick J. Wong6-14/+18
There are a few places in the XFS codebase where a caller has either an active or a passive reference to a perag structure and wants to give a passive reference to some other piece of code. Btree cursor creation and inode walks are good examples of this. Replace the open-coded logic with a helper to do this. The new function adds a few safeguards -- it checks that there's at least one reference to the perag structure passed in, and it records the refcount bump in the ftrace information. This makes it much easier to debug perag refcounting problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: give xfs_refcount_intent its own perag referenceDarrick J. Wong2-19/+18
Give the xfs_refcount_intent a passive reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. Any space being modified by a refcount intent is already allocated, so we need to be able to operate even if the AG is being shrunk or offlined. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: give xfs_rmap_intent its own perag referenceDarrick J. Wong2-18/+15
Give the xfs_rmap_intent a passive reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. The space we're (reverse) mapping is already allocated, so we need to be able to operate even if the AG is being shrunk or offlined. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: give xfs_extfree_intent its own perag referenceDarrick J. Wong2-2/+9
Give the xfs_extfree_intent an passive reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. The space being freed must already be allocated, so we need to able to run even if the AG is being offlined or shrunk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: pass per-ag references to xfs_free_extentDarrick J. Wong5-21/+20
Pass a reference to the per-AG structure to xfs_free_extent. Most callers already have one, so we can eliminate unnecessary lookups. The one exception to this is the EFI code, which the next patch will fix. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-12xfs: give xfs_bmap_intent its own perag referenceDarrick J. Wong2-0/+5
Give the xfs_bmap_intent an active reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. Later, shrink will use these passive references to know if an AG is quiesced or not. The reason why we take a passive ref for a file mapping operation is simple: we're committing to some sort of action involving space in an AG, so we want to indicate our interest in that AG. The space is already allocated, so we need to be able to operate on AGs that are offline or being shrunk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-24xfs: fix mismerged tracepointsDarrick J. Wong1-4/+4
At some point in between sending this patch to the list and merging it into for-next, the tracepoints got all mixed up because I've over-reliant on automated tools not sucking. The end result is that the tracepoints are all wrong, so fix them. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-24xfs: clear incore AGFL_RESET state if it's not neededDarrick J. Wong1-0/+2
Prior to commit 7ac2ff8bb371, when we loaded the incore perag structure with information from the AGF header, we would set or clear the pagf_agfl_reset field based on whether or not the AGFL list was misaligned within the block. IOWs, it's an incore state bit that's supposed to cache something in the ondisk metadata. Therefore, the code still needs to support clearing the incore bit if (somehow) the AGFL were to correct itself. It turns out that xfs_repair does exactly this -- phase 4 loads the AGF to scan the rmapbt for corrupt records, which can set NEEDS_AGFL_RESET. The scan unsets AGF_INIT but doesn't unset NEEDS_AGFL_RESET. Phase 5 totally rewrites the AGFL and fixes the alignment problem, didn't clear NEEDS_AGFL_RESET historically, and reloads the perag state to fix the freelist. This results in the AGFL being reset based on stale data, which then causes the new AGFL blocks to be leaked. A subsequent xfs_repair -n then complains about the leaks. One could argue that phase 5 ought to clear this bit directly when it reloads the perag AGF data after rewriting the AGFL, but libxfs used to handle this for us, so it should go back to doing that. Found by fuzzing flfirst = ones in xfs/352. Fixes: 7ac2ff8bb371 ("xfs: perags need atomic operational state") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-03-19xfs: add tracepoints for each of the externally visible allocatorsDarrick J. Wong1-0/+17
There are now five separate space allocator interfaces exposed to the rest of XFS for five different strategies to find space. Add tracepoints for each of them so that I can tell from a trace dump exactly which ones got called and what happened underneath them. Add a sixth so it's more obvious if an allocation actually happened. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-19xfs: walk all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_agsDarrick J. Wong1-1/+5
Callers of xfs_alloc_vextent_iterate_ags that pass in the TRYLOCK flag want us to perform a non-blocking scan of the AGs for free space. There are no ordering constraints for non-blocking AGF lock acquisition, so the scan can freely start over at AG 0 even when minimum_agno > 0. This manifests fairly reliably on xfs/294 on 6.3-rc2 with the parent pointer patchset applied and the realtime volume enabled. I observed the following sequence as part of an xfs_dir_createname call: 0. Fragment the free space, then allocate nearly all the free space in all AGs except AG 0. 1. Create a directory in AG 2 and let it grow for a while. 2. Try to allocate 2 blocks to expand the dirent part of a directory. The space will be allocated out of AG 0, but the allocation will not be contiguous. This (I think) activates the LOWMODE allocator. 3. The bmapi call decides to convert from extents to bmbt format and tries to allocate 1 block. This allocation request calls xfs_alloc_vextent_start_ag with the inode number, which starts the scan at AG 2. We ignore AG 0 (with all its free space) and instead scrape AG 2 and 3 for more space. We find one block, but this now kicks t_highest_agno to 3. 4. The createname call decides it needs to split the dabtree. It tries to allocate even more space with xfs_alloc_vextent_start_ag, but now we're constrained to AG 3, and we don't find the space. The createname returns ENOSPC and the filesystem shuts down. This change fixes the problem by making the trylock scan wrap around to AG 0 if it doesn't like the AGs that it finds. Since the current transaction itself holds AGF 0, the trylock of AGF 0 will succeed, and we take space from the AG that has plenty. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-03-16xfs: try to idiot-proof the allocatorsDarrick J. Wong1-0/+13
In porting his development branch to 6.3-rc1, yours truly has repeatedly screwed up the args->pag being fed to the xfs_alloc_vextent* functions. Add some debugging assertions to test the preconditions required of the callers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-27xfs: restore old agirotor behaviorDarrick J. Wong1-1/+2
Prior to the removal of xfs_ialloc_next_ag, we would increment the agi rotor and return the *old* value. atomic_inc_return returns the new value, which causes mkfs to allocate the root directory in AG 1. Put back the old behavior (at least for mkfs) by subtracting 1 here. Fixes: 20a5eab49d35 ("xfs: convert xfs_ialloc_next_ag() to an atomic") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-13xfs: return a referenced perag from filestreams allocatorDave Chinner1-11/+28
Now that the filestreams AG selection tracks active perags, we need to return an active perag to the core allocator code. This is because the file allocation the filestreams code will run are AG specific allocations and so need to pin the AG until the allocations complete. We cannot rely on the filestreams item reference to do this - the filestreams association can be torn down at any time, hence we need to have a separate reference for the allocation process to pin the AG after it has been selected. This means there is some perag juggling in allocation failure fallback paths as they will do all AG scans in the case the AG specific allocation fails. Hence we need to track the perag reference that the filestream allocator returned to make sure we don't leak it on repeated allocation failure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.cDave Chinner2-83/+14
xfs_bmap_btalloc_filestreams() calls two filestreams functions to select the AG to allocate from. Both those functions end up in the same selection function that iterates all AGs multiple times. Worst case, xfs_bmap_btalloc_filestreams() can iterate all AGs 4 times just to select the initial AG to allocate in. Move the AG selection to fs/xfs/xfs_filestreams.c as a single interface so that the inefficient AG interation is contained entirely within the filestreams code. This will allow the implementation to be simplified and made more efficient in future patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_bmap_longest_free_extent() in filestreamsDave Chinner2-1/+3
The code in xfs_bmap_longest_free_extent() is open coded in xfs_filestream_pick_ag(). Export xfs_bmap_longest_free_extent and call it from the filestreams code instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: get rid of notinit from xfs_bmap_longest_free_extentDave Chinner1-48/+37
It is only set if reading the AGF gets a EAGAIN error. Just return the EAGAIN error and handle that error in the callers. This means we can remove the not_init parameter from xfs_bmap_select_minlen(), too, because the use of not_init there is pessimistic. If we can't read the agf, it won't increase blen. The only time we actually care whether we checked all the AGFs for contiguous free space is when the best length is less than the minimum allocation length. If not_init is set, then we ignore blen and set the minimum alloc length to the absolute minimum, not the best length we know already is present. However, if blen is less than the minimum we're going to ignore it anyway, regardless of whether we scanned all the AGFs or not. Hence not_init can go away, because we only use if blen is good from the scanned AGs otherwise we ignore it altogether and use minlen. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: factor out filestreams from xfs_bmap_btalloc_nullfbDave Chinner1-71/+96
There's many if (filestreams) {} else {} branches in this function. Split it out into a filestreams specific function so that we can then work directly on cleaning up the filestreams code without impacting the rest of the allocation algorithms. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walkerDave Chinner2-61/+57
Now that the AG iteration code in the core allocation code has been cleaned up, we can easily convert it to use a for_each_perag..() variant to use active references and skip AGs that it can't get active references on. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: move the minimum agno checks into xfs_alloc_vextent_check_argsDave Chinner1-55/+33
All of the allocation functions now extract the minimum allowed AG from the transaction and then use it in some way. The allocation functions that are restricted to a single AG all check if the AG requested can be allocated from and return an error if so. These all set args->agno appropriately. All the allocation functions that iterate AGs use it to calculate the scan start AG. args->agno is not set until the iterator starts walking AGs. Hence we can easily set up a conditional check against the minimum AG allowed in xfs_alloc_vextent_check_args() based on whether args->agno contains NULLAGNUMBER or not and move all the repeated setup code to xfs_alloc_vextent_check_args(), further simplifying the allocation functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: fold xfs_alloc_ag_vextent() into callersDave Chinner3-99/+28
We don't need the multiplexing xfs_alloc_ag_vextent() provided anymore - we can just call the exact/near/size variants directly. This allows us to remove args->type completely and stop using args->fsbno as an input to the allocator algorithms. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()Dave Chinner1-59/+63
Move it from xfs_alloc_ag_vextent() so we can get rid of that layer. Rename xfs_alloc_vextent_set_fsbno() to xfs_alloc_vextent_finish() to indicate that it's function is finishing off the allocation that we've run now that it contains much more functionality. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_prepare()Dave Chinner1-44/+76
Now that we have wrapper functions for each type of allocation we can ask for, we can start unravelling xfs_alloc_ag_vextent(). That is essentially just a prepare stage, the allocation multiplexer and a post-allocation accounting step is the allocation proceeded. The current xfs_alloc_vextent*() wrappers all have a prepare stage, the allocation operation and a post-allocation accounting step. We can consolidate this by moving the AG alloc prep code into the wrapper functions, the accounting code in the wrapper accounting functions, and cut out the multiplexer layer entirely. This patch consolidates the AG preparation stage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_exact_bno()Dave Chinner5-23/+71
Two of the callers to xfs_alloc_vextent_this_ag() actually want exact block number allocation, not anywhere-in-ag allocation. Split this out from _this_ag() as a first class citizen so no external extent allocation code needs to care about args->type anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_near_bno()Dave Chinner6-54/+55
The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO allocations. We can replace that function with a new xfs_alloc_vextent_near_bno() function that does this explicitly. We also multiplex NEAR_BNO allocations through xfs_alloc_vextent_this_ag via args->type. Replace all of these with direct calls to xfs_alloc_vextent_near_bno(), too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_start_bno() where appropriateDave Chinner4-38/+51
Change obvious callers of single AG allocation to use xfs_alloc_vextent_start_bno(). Callers no long need to specify XFS_ALLOCTYPE_START_BNO, and so the type can be driven inward and removed. While doing this, also pass the allocation target fsb as a parameter rather than encoding it in args->fsbno. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_first_ag() where appropriateDave Chinner3-32/+42
Change obvious callers of single AG allocation to use xfs_alloc_vextent_first_ag(). This gets rid of XFS_ALLOCTYPE_FIRST_AG as the type used within xfs_alloc_vextent_first_ag() during iteration is _THIS_AG. Hence we can remove the setting of args->type from all the callers of _first_ag() and remove the alloctype. While doing this, pass the allocation target fsb as a parameter rather than encoding it in args->fsbno. This starts the process of making args->fsbno an output only variable rather than input/output. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>