summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)AuthorFilesLines
2023-02-13xfs: active perag reference countingDave Chinner2-9/+92
We need to be able to dynamically remove instantiated AGs from memory safely, either for shrinking the filesystem or paging AG state in and out of memory (e.g. supporting millions of AGs). This means we need to be able to safely exclude operations from accessing perags while dynamic removal is in progress. To do this, introduce the concept of active and passive references. Active references are required for high level operations that make use of an AG for a given operation (e.g. allocation) and pin the perag in memory for the duration of the operation that is operating on the perag (e.g. transaction scope). This means we can fail to get an active reference to an AG, hence callers of the new active reference API must be able to handle lookup failure gracefully. Passive references are used in low level code, where we might need to access the perag structure for the purposes of completing high level operations. For example, buffers need to use passive references because: - we need to be able to do metadata IO during operations like grow and shrink transactions where high level active references to the AG have already been blocked - buffers need to pin the perag until they are reclaimed from memory, something that high level code has no direct control over. - unused cached buffers should not prevent a shrink from being started. Hence we have active references that will form exclusion barriers for operations to be performed on an AG, and passive references that will prevent reclaim of the perag until all objects with passive references have been reclaimed themselves. This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API for active AG reference functionality. We also need to convert the for_each_perag*() iterators to use active references, which will start the process of converting high level code over to using active references. Conversion of non-iterator based code to active references will be done in followup patches. Note that the implementation using reference counting is really just a development vehicle for the API to ensure we don't have any leaks in the callers. Once we need to remove perag structures from memory dyanmically, we will need a much more robust per-ag state transition mechanism for preventing new references from being taken while we wait for existing references to drain before removal from memory can occur.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: t_firstblock is tracking AGs not blocksDave Chinner4-13/+11
The tp->t_firstblock field is now raelly tracking the highest AG we have locked, not the block number of the highest allocation we've made. It's purpose is to prevent AGF locking deadlocks, so rename it to "highest AG" and simplify the implementation to just track the agno rather than a fsbno. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: drop firstblock constraints from allocation setupDave Chinner2-132/+66
Now that xfs_alloc_vextent() does all the AGF deadlock prevention filtering for multiple allocations in a single transaction, we no longer need the allocation setup code to care about what AGs we might already have locked. Hence we can remove all the "nullfb" conditional logic in places like xfs_bmap_btalloc() and instead have them focus simply on setting up locality constraints. If the allocation fails due to AGF lock filtering in xfs_alloc_vextent, then we just fall back as we normally do to more relaxed allocation constraints. As a result, any allocation that allows AG scanning (i.e. not confined to a single AG) and does not force a worst case full filesystem scan will now be able to attempt allocation from AGs lower than that defined by tp->t_firstblock. This is because xfs_alloc_vextent() allows try-locking of the AGFs and hence enables low space algorithms to at least -try- to get space from AGs lower than the one that we have currently locked and allocated from. This is a significant improvement in the low space allocation algorithm. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: block reservation too large for minleft allocationDave Chinner3-11/+12
When we enter xfs_bmbt_alloc_block() without having first allocated a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we are doing something like unwritten extent conversion, the transaction block reservation is used as the minleft value. This works for operations like unwritten extent conversion, but it assumes that the block reservation is only for a BMBT split. THis is not always true, and sometimes results in larger than necessary minleft values being set. We only actually need enough space for a btree split, something we already handle correctly in xfs_bmapi_write() via the xfs_bmapi_minleft() calculation. We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to calculate the number of blocks a BMBT split on this inode is going to require, not use the transaction block reservation that contains the maximum number of blocks this transaction may consume in it... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: prefer free inodes at ENOSPC over chunk allocationDave Chinner1-0/+17
When an XFS filesystem has free inodes in chunks already allocated on disk, it will still allocate new inode chunks if the target AG has no free inodes in it. Normally, this is a good idea as it preserves locality of all the inodes in a given directory. However, at ENOSPC this can lead to using the last few remaining free filesystem blocks to allocate a new chunk when there are many, many free inodes that could be allocated without consuming free space. This results in speeding up the consumption of the last few blocks and inode create operations then returning ENOSPC when there free inodes available because we don't have enough block left in the filesystem for directory creation reservations to proceed. Hence when we are near ENOSPC, we should be attempting to preserve the remaining blocks for directory block allocation rather than using them for unnecessary inode chunk creation. This particular behaviour is exposed by xfs/294, when it drives to ENOSPC on empty file creation whilst there are still thousands of free inodes available for allocation in other AGs in the filesystem. Hence, when we are within 1% of ENOSPC, change the inode allocation behaviour to prefer to use existing free inodes over allocating new inode chunks, even though it results is poorer locality of the data set. It is more important for the allocations to be space efficient near ENOSPC than to have optimal locality for performance, so lets modify the inode AG selection code to reflect that fact. This allows generic/294 to not only pass with this allocator rework patchset, but to increase the number of post-ENOSPC empty inode allocations to from ~600 to ~9080 before we hit ENOSPC on the directory create transaction reservation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-10xfs: fix low space alloc deadlockDave Chinner2-26/+57
I've recently encountered an ABBA deadlock with g/476. The upcoming changes seem to make this much easier to hit, but the underlying problem is a pre-existing one. Essentially, if we select an AG for allocation, then lock the AGF and then fail to allocate for some reason (e.g. minimum length requirements cannot be satisfied), then we drop out of the allocation with the AGF still locked. The caller then modifies the allocation constraints - usually loosening them up - and tries again. This can result in trying to access AGFs that are lower than the AGF we already have locked from the failed attempt. e.g. the failed attempt skipped several AGs before failing, so we have locks an AG higher than the start AG. Retrying the allocation from the start AG then causes us to violate AGF lock ordering and this can lead to deadlocks. The deadlock exists even if allocation succeeds - we can do a followup allocations in the same transaction for BMBT blocks that aren't guaranteed to be in the same AG as the original, and can move into higher AGs. Hence we really need to move the tp->t_firstblock tracking down into xfs_alloc_vextent() where it can be set when we exit with a locked AG. xfs_alloc_vextent() can also check there if the requested allocation falls within the allow range of AGs set by tp->t_firstblock. If we can't allocate within the range set, we have to fail the allocation. If we are allowed to to non-blocking AGF locking, we can ignore the AG locking order limitations as we can use try-locks for the first iteration over requested AG range. This invalidates a set of post allocation asserts that check that the allocation is always above tp->t_firstblock if it is set. Because we can use try-locks to avoid the deadlock in some circumstances, having a pre-existing locked AGF doesn't always prevent allocation from lower order AGFs. Hence those ASSERTs need to be removed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: don't use BMBT btree split workers for IO completionDave Chinner1-2/+16
When we split a BMBT due to record insertion, we offload it to a worker thread because we can be deep in the stack when we try to allocate a new block for the BMBT. Allocation can use several kilobytes of stack (full memory reclaim, swap and/or IO path can end up on the stack during allocation) and we can already be several kilobytes deep in the stack when we need to split the BMBT. A recent workload demonstrated a deadlock in this BMBT split offload. It requires several things to happen at once: 1. two inodes need a BMBT split at the same time, one must be unwritten extent conversion from IO completion, the other must be from extent allocation. 2. there must be a no available xfs_alloc_wq worker threads available in the worker pool. 3. There must be sustained severe memory shortages such that new kworker threads cannot be allocated to the xfs_alloc_wq pool for both threads that need split work to be run 4. The split work from the unwritten extent conversion must run first. 5. when the BMBT block allocation runs from the split work, it must loop over all AGs and not be able to either trylock an AGF successfully, or each AGF is is able to lock has no space available for a single block allocation. 6. The BMBT allocation must then attempt to lock the AGF that the second task queued to the rescuer thread already has locked before it finds an AGF it can allocate from. At this point, we have an ABBA deadlock between tasks queued on the xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task holding the AGF lock can't be run by the rescuer thread until the task the rescuer thread is runing gets the AGF lock.... This is a highly improbably series of events, but there it is. There's a couple of ways to fix this, but the easiest way to ensure that we only punt tasks with a locked AGF that holds enough space for the BMBT block allocations to the worker thread. This works for unwritten extent conversion in IO completion (which doesn't have a locked AGF and space reservations) because we have tight control over the IO completion stack. It is typically only 6 functions deep when xfs_btree_split() is called because we've already offloaded the IO completion work to a worker thread and hence we don't need to worry about stack overruns here. The other place we can be called for a BMBT split without a preceeding allocation is __xfs_bunmapi() when punching out the center of an existing extent. We don't remove extents in the IO path, so these operations don't tend to be called with a lot of stack consumed. Hence we don't really need to ship the split off to a worker thread in these cases, either. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: pass refcount intent directly through the log intent codeDarrick J. Wong2-56/+44
Pass the incore refcount intent through the CUI logging code instead of repeatedly boxing and unboxing parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: pass rmap space mapping directly through the log intent codeDarrick J. Wong2-30/+26
Pass the incore rmap space mapping through the RUI logging code instead of repeatedly boxing and unboxing parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: fix confusing xfs_extent_item variable namesDarrick J. Wong1-16/+16
Change the name of all pointers to xfs_extent_item structures to "xefi" to make the name consistent and because the current selections ("new" and "free") mean other things in C. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-02-05xfs: pass the xfs_bmbt_irec directly through the log intent codeDarrick J. Wong2-21/+16
Instead of repeatedly boxing and unboxing the incore extent mapping structure as it passes through the BUI code, pass the pointer directly through. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-01-03xfs: fix off-by-one error in xfs_btree_space_to_heightDarrick J. Wong1-1/+6
Lately I've been stress-testing extreme-sized rmap btrees by using the (new) xfs_db bmap_inflate command to clone bmbt mappings billions of times and then using xfs_repair to build new rmap and refcount btrees. This of course is /much/ faster than actually FICLONEing a file billions of times. Unfortunately, xfs_repair fails in xfs_btree_bload_compute_geometry with EOVERFLOW, which indicates that xfs_mount.m_rmap_maxlevels is not sufficiently large for the test scenario. For a 1TB filesystem (~67 million AG blocks, 4 AGs) the btheight command reports: $ xfs_db -c 'btheight -n 4400801200 -w min rmapbt' /dev/sda rmapbt: worst case per 4096-byte block: 84 records (leaf) / 45 keyptrs (node) level 0: 4400801200 records, 52390491 blocks level 1: 52390491 records, 1164234 blocks level 2: 1164234 records, 25872 blocks level 3: 25872 records, 575 blocks level 4: 575 records, 13 blocks level 5: 13 records, 1 block 6 levels, 53581186 blocks total The AG is sufficiently large to build this rmap btree. Unfortunately, m_rmap_maxlevels is 5. Augmenting the loop in the space->height function to report height, node blocks, and blocks remaining produces this: ht 1 node_blocks 45 blockleft 67108863 ht 2 node_blocks 2025 blockleft 67108818 ht 3 node_blocks 91125 blockleft 67106793 ht 4 node_blocks 4100625 blockleft 67015668 final height: 5 The goal of this function is to compute the maximum height btree that can be stored in the given number of ondisk fsblocks. Starting with the top level of the tree, each iteration through the loop adds the fanout factor of the next level down until we run out of blocks. IOWs, maximum height is achieved by using the smallest fanout factor that can apply to that level. However, the loop setup is not correct. Top level btree blocks are allowed to contain fewer than minrecs items, so the computation is incorrect because the first time through the loop it should be using a fanout factor of 2. With this corrected, the above becomes: ht 1 node_blocks 2 blockleft 67108863 ht 2 node_blocks 90 blockleft 67108861 ht 3 node_blocks 4050 blockleft 67108771 ht 4 node_blocks 182250 blockleft 67104721 ht 5 node_blocks 8201250 blockleft 66922471 final height: 6 Fixes: 9ec691205e7d ("xfs: compute the maximum height of the rmap btree when reflink enabled") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-12-14Merge tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds5-29/+148
Pull XFS updates from Darrick Wong: "The highlight of this is a batch of fixes for the online metadata checking code as we start the loooong march towards merging online repair. I aim to merge that in time for the 2023 LTS. There are also a large number of data corruption and race condition fixes in this patchset. Most notably fixed are write() calls to unwritten extents racing with writeback, which required some late(r than I prefer) code changes to iomap to support the necessary revalidations. I don't really like iomap changes going in past -rc4, but Dave and I have been working on it long enough that I chose to push it for 6.2 anyway. There are also a number of other subtle problems fixed, including the log racing with inode writeback to write inodes with incorrect link count to disk; file data mapping corruptions as a result of incorrect lock cycling when attaching dquots; refcount metadata corruption if one actually manages to share a block 2^32 times; and the log clobbering cow staging extents if they were formerly metadata blocks. Summary: - Fix a race condition w.r.t. percpu inode free counters - Fix a broken error return in xfs_remove - Print FS UUID at mount/unmount time - Numerous fixes to the online fsck code - Fix inode locking inconsistency problems when dealing with realtime metadata files - Actually merge pull requests so that we capture the cover letter contents - Fix a race between rebuilding VFS inode state and the AIL flushing inodes that could cause corrupt inodes to be written to the filesystem - Fix a data corruption problem resulting from a write() to an unwritten extent racing with writeback started on behalf of memory reclaim changing the extent state - Add debugging knobs so that we can test iomap invalidation - Fix the blockdev pagecache contents being stale after unmounting the filesystem, leading to spurious xfs_db errors and corrupt metadumps - Fix a file mapping corruption bug due to ilock cycling when attaching dquots to a file during delalloc reservation - Fix a refcount btree corruption problem due to the refcount adjustment code not handling MAXREFCOUNT correctly, resulting in unnecessary record splits - Fix COW staging extent alloctions not being classified as USERDATA, which results in filestreams being ignored and possible data corruption if the allocation was filled from the AGFL and the block buffer is still being tracked in the AIL - Fix new duplicated includes - Fix a race between the dquot shrinker and dquot freeing that could cause a UAF" * tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (50 commits) xfs: dquot shrinker doesn't check for XFS_DQFLAG_FREEING xfs: Remove duplicated include in xfs_iomap.c xfs: invalidate xfs_bufs when allocating cow extents xfs: get rid of assert from xfs_btree_islastblock xfs: estimate post-merge refcounts correctly xfs: hoist refcount record merge predicates xfs: fix super block buf log item UAF during force shutdown xfs: wait iclog complete before tearing down AIL xfs: attach dquots to inode before reading data/cow fork mappings xfs: shut up -Wuninitialized in xfsaild_push xfs: use memcpy, not strncpy, to format the attr prefix during listxattr xfs: invalidate block device page cache during unmount xfs: add debug knob to slow down write for fun xfs: add debug knob to slow down writeback for fun xfs: drop write error injection is unfixable, remove it xfs: use iomap_valid method to detect stale cached iomaps iomap: write iomap validity checks xfs: xfs_bmap_punch_delalloc_range() should take a byte range iomap: buffered write failure should not truncate the page cache xfs,iomap: move delalloc punching to iomap ...
2022-12-01xfs: invalidate xfs_bufs when allocating cow extentsDarrick J. Wong1-1/+1
While investigating test failures in xfs/17[1-3] in alwayscow mode, I noticed through code inspection that xfs_bmap_alloc_userdata isn't setting XFS_ALLOC_USERDATA when allocating extents for a file's CoW fork. COW staging extents should be flagged as USERDATA, since user data are persisted to these blocks before being remapped into a file. This mis-classification has a few impacts on the behavior of the system. First, the filestreams allocator is supposed to keep allocating from a chosen AG until it runs out of space in that AG. However, it only does that for USERDATA allocations, which means that COW allocations aren't tied to the filestreams AG. Fortunately, few people use filestreams, so nobody's noticed. A more serious problem is that xfs_alloc_ag_vextent_small looks for a buffer to invalidate *if* the USERDATA flag is set and the AG is so full that the allocation had to come from the AGFL because the cntbt is empty. The consequences of not invalidating the buffer are severe -- if the AIL incorrectly checkpoints a buffer that is now being used to store user data, that action will clobber the user's written data. Fix filestreams and yet another data corruption vector by flagging COW allocations as USERDATA. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-12-01xfs: get rid of assert from xfs_btree_islastblockGuo Xuenan1-1/+0
xfs_btree_check_block contains debugging knobs. With XFS_DEBUG setting up, turn on the debugging knob can trigger the assert of xfs_btree_islastblock, test script as follows: while true do mount $disk $mountpoint fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null echo 1 > /sys/fs/xfs/sda/errortag/btree_chk_sblk sleep 10 umount $mountpoint done Kick off fsstress and only *then* turn on the debugging knob. If it happens that the knob gets turned on after the cntbt lookup succeeds but before the call to xfs_btree_islastblock, then we *can* end up in the situation where a previously checked btree block suddenly starts returning EFSCORRUPTED from xfs_btree_check_block. Kaboom. Darrick give a very detailed explanation as follows: Looking back at commit 27d9ee577dcce, I think the point of all this was to make sure that the cursor has actually performed a lookup, and that the btree block at whatever level we're asking about is ok. If the caller hasn't ever done a lookup, the bc_levels array will be empty, so cur->bc_levels[level].bp pointer will be NULL. The call to xfs_btree_get_block will crash anyway, so the "ASSERT(block);" part is pointless. If the caller did a lookup but the lookup failed due to block corruption, the corresponding cur->bc_levels[level].bp pointer will also be NULL, and we'll still crash. The "ASSERT(xfs_btree_check_block);" logic is also unnecessary. If the cursor level points to an inode root, the block buffer will be incore, so it had better always be consistent. If the caller ignores a failed lookup after a successful one and calls this function, the cursor state is garbage and the assert wouldn't have tripped anyway. So get rid of the assert. Fixes: 27d9ee577dcc ("xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock") Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-12-01xfs: estimate post-merge refcounts correctlyDarrick J. Wong1-4/+21
Upon enabling fsdax + reflink for XFS, xfs/179 began to report refcount metadata corruptions after being run. Specifically, xfs_repair noticed single-block refcount records that could be combined but had not been. The root cause of this is improper MAXREFCOUNT edge case handling in xfs_refcount_merge_extents. When we're trying to find candidates for a refcount btree record merge, we compute the refcount attribute of the merged record, but we fail to account for the fact that once a record hits rc_refcount == MAXREFCOUNT, it is pinned that way forever. Hence the computed refcount is wrong, and we fail to merge the extents. Fix this by adjusting the merge predicates to compute the adjusted refcount correctly. Fixes: 3172725814f9 ("xfs: adjust refcount of an extent of blocks in refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
2022-12-01xfs: hoist refcount record merge predicatesDarrick J. Wong1-16/+113
Hoist these multiline conditionals into separate static inline helpers to improve readability and set the stage for corruption fixes that will be introduced in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
2022-11-29xfs: add debug knob to slow down write for funDarrick J. Wong1-1/+3
Add a new error injection knob so that we can arbitrarily slow down pagecache writes to test for race conditions and aberrant reclaim behavior if the writeback mechanisms are slow to issue writeback. This will enable functional testing for the ifork sequence counters introduced in commit 304a68b9c63b ("xfs: use iomap_valid method to detect stale cached iomaps") that fixes write racing with reclaim writeback. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-29xfs: add debug knob to slow down writeback for funDarrick J. Wong1-1/+3
Add a new error injection knob so that we can arbitrarily slow down writeback to test for race conditions and aberrant reclaim behavior if the writeback mechanisms are slow to issue writeback. This will enable functional testing for the ifork sequence counters introduced in commit 745b3f76d1c8 ("xfs: maintain a sequence count for inode fork manipulations"). Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-29Merge tag 'xfs-iomap-stale-fixes' of ↵Darrick J. Wong2-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB xfs, iomap: fix data corruption due to stale cached iomaps This patch series fixes a data corruption that occurs in a specific multi-threaded write workload. The workload combined racing unaligned adjacent buffered writes with low memory conditions that caused both writeback and memory reclaim to race with the writes. The result of this was random partial blocks containing zeroes instead of the correct data. The underlying problem is that iomap caches the write iomap for the duration of the write() operation, but it fails to take into account that the extent underlying the iomap can change whilst the write is in progress. The short story is that an iomap can span mutliple folios, and so under low memory writeback can be cleaning folios the write() overlaps. Whilst the overlapping data is cached in memory, this isn't a problem, but because the folios are now clean they can be reclaimed. Once reclaimed, the write() does the wrong thing when re-instantiating partial folios because the iomap no longer reflects the underlying state of the extent. e.g. it thinks the extent is unwritten, so it zeroes the partial range, when in fact the underlying extent is now written and so it should have read the data from disk. This is how we get random zero ranges in the file instead of the correct data. The gory details of the race condition can be found here: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ Fixing the problem has two aspects. The first aspect of the problem is ensuring that iomap can detect a stale cached iomap during a write in a race-free manner. We already do this stale iomap detection in the writeback path, so we have a mechanism for detecting that the iomap backing the data range may have changed and needs to be remapped. In the case of the write() path, we have to ensure that the iomap is validated at a point in time when the page cache is stable and cannot be reclaimed from under us. We also need to validate the extent before we start performing any modifications to the folio state or contents. Combine these two requirements together, and the only "safe" place to validate the iomap is after we have looked up and locked the folio we are going to copy the data into, but before we've performed any initialisation operations on that folio. If the iomap fails validation, we then mark it stale, unlock the folio and end the write. This effectively means a stale iomap results in a short write. Filesystems should already be able to handle this, as write operations can end short for many reasons and need to iterate through another mapping cycle to be completed. Hence the iomap changes needed to detect and handle stale iomaps during write() operations is relatively simple... However, the assumption is that filesystems should already be able to handle write failures safely, and that's where the second (first?) part of the problem exists. That is, handling a partial write is harder than just "punching out the unused delayed allocation extent". This is because mmap() based faults can race with writes, and if they land in the delalloc region that the write allocated, then punching out the delalloc region can cause data corruption. This data corruption problem is exposed by generic/346 when iomap is converted to detect stale iomaps during write() operations. Hence write failure in the filesytems needs to handle the fact that the write() in progress doesn't necessarily own the data in the page cache over the range of the delalloc extent it just allocated. As a result, we can't just truncate the page cache over the range the write() didn't reach and punch all the delalloc extent. We have to walk the page cache over the untouched range and skip over any dirty data region in the cache in that range. Which is .... non-trivial. That is, iterating the page cache has to handle partially populated folios (i.e. block size < page size) that contain data. The data might be discontiguous within a folio. Indeed, there might be *multiple* discontiguous data regions within a single folio. And to make matters more complex, multi-page folios mean we just don't know how many sub-folio regions we might have to iterate to find all these regions. All the corner cases between the conversions and rounding between filesystem block size, folio size and multi-page folio size combined with unaligned write offsets kept breaking my brain. However, if we convert the code to track the processed write regions by byte ranges instead of fileystem block or page cache index, we could simply use mapping_seek_hole_data() to find the start and end of each discrete data region within the range we needed to scan. SEEK_DATA finds the start of the cached data region, SEEK_HOLE finds the end of the region. These are byte based interfaces that understand partially uptodate folio regions, and so can iterate discrete sub-folio data regions directly. This largely solved the problem of discovering the dirty regions we need to keep the delalloc extent over. However, to use mapping_seek_hole_data() without needing to export it, we have to move all the delalloc extent cleanup to the iomap core and so now the iomap core can clean up delayed allocation extents in a safe, sane and filesystem neutral manner. With all this done, the original data corruption never occurs anymore, and we now have a generic mechanism for ensuring that page cache writes do not do the wrong thing when writeback and reclaim change the state of the physical extent and/or page cache contents whilst the write is in progress. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: drop write error injection is unfixable, remove it xfs: use iomap_valid method to detect stale cached iomaps iomap: write iomap validity checks xfs: xfs_bmap_punch_delalloc_range() should take a byte range iomap: buffered write failure should not truncate the page cache xfs,iomap: move delalloc punching to iomap xfs: use byte ranges for write cleanup ranges xfs: punching delalloc extents on write failure is racy xfs: write page faults in iomap are not buffered writes
2022-11-29xfs: drop write error injection is unfixable, remove itDave Chinner1-7/+5
With the changes to scan the page cache for dirty data to avoid data corruptions from partial write cleanup racing with other page cache operations, the drop writes error injection no longer works the same way it used to and causes xfs/196 to fail. This is because xfs/196 writes to the file and populates the page cache before it turns on the error injection and starts failing -overwrites-. The result is that the original drop-writes code failed writes only -after- overwriting the data in the cache, followed by invalidates the cached data, then punching out the delalloc extent from under that data. On the surface, this looks fine. The problem is that page cache invalidation *doesn't guarantee that it removes anything from the page cache* and it doesn't change the dirty state of the folio. When block size == page size and we do page aligned IO (as xfs/196 does) everything happens to align perfectly and page cache invalidation removes the single page folios that span the written data. Hence the followup delalloc punch pass does not find cached data over that range and it can punch the extent out. IOWs, xfs/196 "works" for block size == page size with the new code. I say "works", because it actually only works for the case where IO is page aligned, and no data was read from disk before writes occur. Because the moment we actually read data first, the readahead code allocates multipage folios and suddenly the invalidate code goes back to zeroing subfolio ranges without changing dirty state. Hence, with multipage folios in play, block size == page size is functionally identical to block size < page size behaviour, and drop-writes is manifestly broken w.r.t to this case. Invalidation of a subfolio range doesn't result in the folio being removed from the cache, just the range gets zeroed. Hence after we've sequentially walked over a folio that we've dirtied (via write data) and then invalidated, we end up with a dirty folio full of zeroed data. And because the new code skips punching ranges that have dirty folios covering them, we end up leaving the delalloc range intact after failing all the writes. Hence failed writes now end up writing zeroes to disk in the cases where invalidation zeroes folios rather than removing them from cache. This is a fundamental change of behaviour that is needed to avoid the data corruption vectors that exist in the old write fail path, and it renders the drop-writes injection non-functional and unworkable as it stands. As it is, I think the error injection is also now unnecessary, as partial writes that need delalloc extent are going to be a lot more common with stale iomap detection in place. Hence this patch removes the drop-writes error injection completely. xfs/196 can remain for testing kernels that don't have this data corruption fix, but those that do will report: xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-11-29xfs: use iomap_valid method to detect stale cached iomapsDave Chinner1-2/+4
Now that iomap supports a mechanism to validate cached iomaps for buffered write operations, hook it up to the XFS buffered write ops so that we can avoid data corruptions that result from stale cached iomaps. See: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ or the ->iomap_valid() introduction commit for exact details of the corruption vector. The validity cookie we store in the iomap is based on the type of iomap we return. It is expected that the iomap->flags we set in xfs_bmbt_to_iomap() is not perturbed by the iomap core and are returned to us in the iomap passed via the .iomap_valid() callback. This ensures that the validity cookie is always checking the correct inode fork sequence numbers to detect potential changes that affect the extent cached by the iomap. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-11-18treewide: use get_random_u32_below() instead of deprecated functionJason A. Donenfeld2-2/+2
This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-17xfs: fix sb write verify for lazysbcountLong Li1-1/+3
When lazysbcount is enabled, fsstress and loop mount/unmount test report the following problems: XFS (loop0): SB summary counter sanity check failed XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460, xfs_sb block 0x0 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed This corruption will shutdown the file system and the file system will no longer be mountable. The following script can reproduce the problem, but it may take a long time. #!/bin/bash device=/dev/sda testdir=/mnt/test round=0 function fail() { echo "$*" exit 1 } mkdir -p $testdir while [ $round -lt 10000 ] do echo "******* round $round ********" mkfs.xfs -f $device mount $device $testdir || fail "mount failed!" fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null & sleep 4 killall -w fsstress umount $testdir xfs_repair -e $device > /dev/null if [ $? -eq 2 ];then echo "ERR CODE 2: Dirty log exception during repair." exit 1 fi round=$(($round+1)) done With lazysbcount is enabled, There is no additional lock protection for reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the m_ifree, this will make the m_ifree greater than m_icount. For example, consider the following sequence and ifreedelta is postive: CPU0 CPU1 xfs_log_sb xfs_trans_unreserve_and_mod_sb ---------- ------------------------------ percpu_counter_sum(&mp->m_icount) percpu_counter_add_batch(&mp->m_icount, idelta, XFS_ICOUNT_BATCH) percpu_counter_add(&mp->m_ifree, ifreedelta); percpu_counter_sum(&mp->m_ifree) After this, incorrect inode count (sb_ifree > sb_icount) will be writen to the log. In the subsequent writing of sb, incorrect inode count (sb_ifree > sb_icount) will fail to pass the boundary check in xfs_validate_sb_write() that cause the file system shutdown. When lazysbcount is enabled, we don't need to guarantee that Lazy sb counters are completely correct, but we do need to guarantee that sb_ifree <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount must be satisfied any time that there /cannot/ be other threads allocating or freeing inode chunks. If the constraint is violated under these circumstances, sb_i{count,free} (the ondisk superblock inode counters) maybe incorrect and need to be marked sick at unmount, the count will be rebuilt on the next mount. Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-31xfs: rename XFS_REFC_COW_START to _COWFLAGDarrick J. Wong3-6/+6
We've been (ab)using XFS_REFC_COW_START as both an integer quantity and a bit flag, even though it's *only* a bit flag. Rename the variable to reflect its nature and update the cast target since we're not supposed to be comparing it to xfs_agblock_t now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: fix uninitialized list head in struct xfs_refcount_recoveryDarrick J. Wong1-4/+6
We're supposed to initialize the list head of an object before adding it to another list. Fix that, and stop using the kmem_{alloc,free} calls from the Irix days. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: fix agblocks check in the cow leftover recovery functionDarrick J. Wong1-1/+3
As we've seen, refcount records use the upper bit of the rc_startblock field to ensure that all the refcount records are at the right side of the refcount btree. This works because an AG is never allowed to have more than (1U << 31) blocks in it. If we ever encounter a filesystem claiming to have that many blocks, we absolutely do not want reflink touching it at all. However, this test at the start of xfs_refcount_recover_cow_leftovers is slightly incorrect -- it /should/ be checking that agblocks isn't larger than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the constant is never large enough to conflict with that CoW flag. Note that the V5 superblock verifier has not historically rejected filesystems where agblocks >= XFS_MAX_CRC_AG_BLOCKS, which is why this ended up in the COW recovery routine. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: check record domain when accessing refcount recordsDarrick J. Wong1-12/+41
Now that we've separated the startblock and CoW/shared extent domain in the incore refcount record structure, check the domain whenever we retrieve a record to ensure that it's still in the domain that we want. Depending on the circumstances, a change in domain either means we're done processing or that we've found a corruption and need to fail out. The refcount check in xchk_xref_is_cow_staging is redundant since _get_rec has done that for a long time now, so we can get rid of it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: remove XFS_FIND_RCEXT_SHARED and _COWDarrick J. Wong1-31/+17
Now that we have an explicit enum for shared and CoW staging extents, we can get rid of the old FIND_RCEXT flags. Omit a couple of conversions that disappear in the next patches. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: refactor domain and refcount checkingDarrick J. Wong2-4/+13
Create a helper function to ensure that CoW staging extent records have a single refcount and that shared extent records have more than 1 refcount. We'll put this to more use in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: report refcount domain in tracepointsDarrick J. Wong1-0/+4
Now that we've broken out the startblock and shared/cow domain in the incore refcount extent record structure, update the tracepoints to report the domain. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: track cow/shared record domains explicitly in xfs_refcount_irecDarrick J. Wong4-54/+141
Just prior to committing the reflink code into upstream, the xfs maintainer at the time requested that I find a way to shard the refcount records into two domains -- one for records tracking shared extents, and a second for tracking CoW staging extents. The idea here was to minimize mount time CoW reclamation by pushing all the CoW records to the right edge of the keyspace, and it was accomplished by setting the upper bit in rc_startblock. We don't allow AGs to have more than 2^31 blocks, so the bit was free. Unfortunately, this was a very late addition to the codebase, so most of the refcount record processing code still treats rc_startblock as a u32 and pays no attention to whether or not the upper bit (the cow flag) is set. This is a weakness is theoretically exploitable, since we're not fully validating the incoming metadata records. Fuzzing demonstrates practical exploits of this weakness. If the cow flag of a node block key record is corrupted, a lookup operation can go to the wrong record block and start returning records from the wrong cow/shared domain. This causes the math to go all wrong (since cow domain is still implicit in the upper bit of rc_startblock) and we can crash the kernel by tricking xfs into jumping into a nonexistent AG and tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL. To fix this, start tracking the domain as an explicit part of struct xfs_refcount_irec, adjust all refcount functions to check the domain of a returned record, and alter the function definitions to accept them where necessary. Found by fuzzing keys[2].cowflag = add in xfs/464. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: move _irec structs to xfs_types.hDarrick J. Wong2-20/+20
Structure definitions for incore objects do not belong in the ondisk format header. Move them to the incore types header where they belong. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: check deferred refcount op continuation parametersDarrick J. Wong1-2/+36
If we're in the middle of a deferred refcount operation and decide to roll the transaction to avoid overflowing the transaction space, we need to check the new agbno/aglen parameters that we're about to record in the new intent. Specifically, we need to check that the new extent is completely within the filesystem, and that continuation does not put us into a different AG. If the keys of a node block are wrong, the lookup to resume an xfs_refcount_adjust_extents operation can put us into the wrong record block. If this happens, we might not find that we run out of aglen at an exact record boundary, which will cause the loop control to do the wrong thing. The previous patch should take care of that problem, but let's add this extra sanity check to stop corruption problems sooner than later. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: create a predicate to verify per-AG extentsDarrick J. Wong4-17/+19
Create a predicate function to verify that a given agbno/blockcount pair fit entirely within a single allocation group and don't suffer mathematical overflows. Refactor the existng open-coded logic; we're going to add more calls to this function in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: make sure aglen never goes negative in xfs_refcount_adjust_extentsDarrick J. Wong1-3/+17
Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen such that the end of the range would not be in the middle of a refcount record. If this is no longer the case, something is seriously wrong with the btree. Bail out with a corruption error. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: refactor all the EFI/EFD log item sizeof logicDarrick J. Wong1-0/+48
Refactor all the open-coded sizeof logic for EFI/EFD log item and log format structures into common helper functions whose names reflect the struct names. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-31xfs: fix memcpy fortify errors in EFI log format copyingDarrick J. Wong1-6/+6
Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of memcpy. Since we're already fixing problems with BUI item copying, we should fix it everything else. An extra difficulty here is that the ef[id]_extents arrays are declared as single-element arrays. This is not the convention for flex arrays in the modern kernel, and it causes all manner of problems with static checking tools, since they often cannot tell the difference between a single element array and a flex array. So for starters, change those array[1] declarations to array[] declarations to signal that they are proper flex arrays and adjust all the "size-1" expressions to fit the new declaration style. Next, refactor the xfs_efi_copy_format function to handle the copying of the head and the flex array members separately. While we're at it, fix a minor validation deficiency in the recovery function. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-10-26xfs: increase rename inode reservationAllison Henderson1-2/+2
xfs_rename can update up to 5 inodes: src_dp, target_dp, src_ip, target_ip and wip. So we need to increase the inode reservation to match. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-20xfs: fix exception caused by unexpected illegal bestcount in leaf dirGuo Xuenan1-2/+7
For leaf dir, In most cases, there should be as many bestfree slots as the dir data blocks that can fit under i_size (except for [1]). Root cause is we don't examin the number bestfree slots, when the slots number less than dir data blocks, if we need to allocate new dir data block and update the bestfree array, we will use the dir block number as index to assign bestfree array, while we did not check the leaf buf boundary which may cause UAF or other memory access problem. This issue can also triggered with test cases xfs/473 from fstests. According to Dave Chinner & Darrick's suggestion, adding buffer verifier to detect this abnormal situation in time. Simplify the testcase for fstest xfs/554 [1] The error log is shown as follows: ================================================================== BUG: KASAN: use-after-free in xfs_dir2_leaf_addname+0x1995/0x1ac0 Write of size 2 at addr ffff88810168b000 by task touch/1552 CPU: 5 PID: 1552 Comm: touch Not tainted 6.0.0-rc3+ #101 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4d/0x66 print_report.cold+0xf6/0x691 kasan_report+0xa8/0x120 xfs_dir2_leaf_addname+0x1995/0x1ac0 xfs_dir_createname+0x58c/0x7f0 xfs_create+0x7af/0x1010 xfs_generic_create+0x270/0x5e0 path_openat+0x270b/0x3450 do_filp_open+0x1cf/0x2b0 do_sys_openat2+0x46b/0x7a0 do_sys_open+0xb7/0x130 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fe4d9e9312b Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b RDX: 0000000000000941 RSI: 00007ffda4c17f33 RDI: 00000000ffffff9c RBP: 00007ffda4c17f33 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941 R13: 00007fe4d9f631a4 R14: 00007ffda4c17f33 R15: 0000000000000000 </TASK> The buggy address belongs to the physical page: page:ffffea000405a2c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10168b flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff) raw: 002fffff80000000 ffffea0004057788 ffffea000402dbc8 0000000000000000 raw: 0000000000000000 0000000000170000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88810168af00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88810168af80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88810168b000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88810168b080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810168b100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint 00000000: 58 44 44 33 5b 53 35 c2 00 00 00 00 00 00 00 78 XDD3[S5........x XFS (sdb): Internal error xfs_dir2_data_use_free at line 1200 of file fs/xfs/libxfs/xfs_dir2_data.c. Caller xfs_dir2_data_use_free+0x28a/0xeb0 CPU: 5 PID: 1552 Comm: touch Tainted: G B 6.0.0-rc3+ Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4d/0x66 xfs_corruption_error+0x132/0x150 xfs_dir2_data_use_free+0x198/0xeb0 xfs_dir2_leaf_addname+0xa59/0x1ac0 xfs_dir_createname+0x58c/0x7f0 xfs_create+0x7af/0x1010 xfs_generic_create+0x270/0x5e0 path_openat+0x270b/0x3450 do_filp_open+0x1cf/0x2b0 do_sys_openat2+0x46b/0x7a0 do_sys_open+0xb7/0x130 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fe4d9e9312b Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 RSP: 002b:00007ffda4c16c20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fe4d9e9312b RDX: 0000000000000941 RSI: 00007ffda4c17f46 RDI: 00000000ffffff9c RBP: 00007ffda4c17f46 R08: 0000000000000000 R09: 0000000000000001 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941 R13: 00007fe4d9f631a4 R14: 00007ffda4c17f46 R15: 0000000000000000 </TASK> XFS (sdb): Corruption detected. Unmount and run xfs_repair [1] https://lore.kernel.org/all/20220928095355.2074025-1-guoxuenan@huawei.com/ Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-12treewide: use get_random_u32() when possibleJason A. Donenfeld1-1/+1
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-12treewide: use prandom_u32_max() when possible, part 1Jason A. Donenfeld2-2/+2
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-04xfs: rearrange the logic and remove the broken comment for xfs_dir2_isxxShida Zhang2-22/+32
xfs_dir2_isleaf is used to see if the directory is a single-leaf form directory instead, as commented right above the function. Besides getting rid of the broken comment, we rearrange the logic by converting everything over to standard formatting and conventions, at the same time, to make it easier to understand and self documenting. Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-10-04xfs: trim the mapp array accordingly in xfs_da_grow_inode_intShida Zhang1-1/+1
Take a look at the for-loop in xfs_da_grow_inode_int: ====== for(){ nmap = min(XFS_BMAP_MAX_NMAP, count); ... error = xfs_bmapi_write(...,&mapp[mapi], &nmap);//(..., $1, $2) ... mapi += nmap; } ===== where $1 stands for the start address of the array, while $2 is used to indicate the size of the array. The array $1 will advance by $nmap in each iteration after the allocation of extents. But the size $2 still remains unchanged, which is determined by min(XFS_BMAP_MAX_NMAP, count). It seems that it has forgotten to trim the mapp array after each iteration, so change it. Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-09-18xfs: Remove the unneeded result variableye xingchen1-3/+1
Return the value xfs_dir_cilookup_result() directly instead of storing it in another redundant variable. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-09-18xfs: clean up "%Ld/%Lu" which doesn't meet C standardZeng Heng2-3/+3
The "%Ld" specifier, which represents long long unsigned, doesn't meet C language standard, and even more, it makes people easily mistake with "%ld", which represent long unsigned. So replace "%Ld" with "lld". Do the same with "%Lu". Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-08-11xfs: fix inode reservation space for removing transactionhexiaole1-1/+1
In 'fs/xfs/libxfs/xfs_trans_resv.c', the comment for transaction of removing a directory entry writes: /* fs/xfs/libxfs/xfs_trans_resv.c begin */ /* * For removing a directory entry we can modify: * the parent directory inode: inode size * the removed inode: inode size ... xfs_calc_remove_reservation( struct xfs_mount *mp) { return XFS_DQUOT_LOGRES(mp) + xfs_calc_iunlink_add_reservation(mp) + max((xfs_calc_inode_res(mp, 1) + ... /* fs/xfs/libxfs/xfs_trans_resv.c end */ There has 2 inode size of space to be reserverd, but the actual code for inode reservation space writes. There only count for 1 inode size to be reserved in 'xfs_calc_inode_res(mp, 1)', rather than 2. Signed-off-by: hexiaole <hexiaole@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: remove redundant code citations] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-22xfs: Fix typo 'the the' in commentSlark Xiao1-1/+1
Replace 'the the' with 'the' in the comment. Signed-off-by: Slark Xiao <slark_xiao@163.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-21xfs: don't leak memory when attr fork loading failsDarrick J. Wong2-5/+1
I observed the following evidence of a memory leak while running xfs/399 from the xfs fsck test suite (edited for brevity): XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs] CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G W 5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:assfail+0x46/0x4a [xfs] Call Trace: <TASK> xfs_ifork_zap_attr+0x7c/0xb0 xfs_iformat_attr_fork+0x86/0x110 xfs_inode_from_disk+0x41d/0x480 xfs_iget+0x389/0xd70 xfs_bulkstat_one_int+0x5b/0x540 xfs_bulkstat_iwalk+0x1e/0x30 xfs_iwalk_ag_recs+0xd1/0x160 xfs_iwalk_run_callbacks+0xb9/0x180 xfs_iwalk_ag+0x1d8/0x2e0 xfs_iwalk+0x141/0x220 xfs_bulkstat+0x105/0x180 xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130 xfs_file_ioctl+0xa5f/0xef0 __x64_sys_ioctl+0x82/0xa0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This newly-added assertion checks that there aren't any incore data structures hanging off the incore fork when we're trying to reset its contents. From the call trace, it is evident that iget was trying to construct an incore inode from the ondisk inode, but the attr fork verifier failed and we were trying to undo all the memory allocations that we had done earlier. The three assertions in xfs_ifork_zap_attr check that the caller has already called xfs_idestroy_fork, which clearly has not been done here. As the zap function then zeroes the pointers, we've effectively leaked the memory. The shortest change would have been to insert an extra call to xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork call into _zap_attr, since all other callsites call _idestroy_fork immediately prior to calling _zap_attr. IOWs, it eliminates one way to fail. Note: This change only applies cleanly to 2ed5b09b3e8f, since we just reworked the attr fork lifetime. However, I think this memory leak has existed since 0f45a1b20cd8, since the chain xfs_iformat_attr_fork -> xfs_iformat_local -> xfs_init_local_fork will allocate ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails, xfs_iformat_attr_fork will free i_afp without freeing any of the stuff hanging off i_afp. The solution for older kernels I think is to add the missing call to xfs_idestroy_fork just prior to calling kmem_cache_free. Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399. Fixes: 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode") Probably-Fixes: 0f45a1b20cd8 ("xfs: improve local fork verification") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-21xfs: delete unnecessary NULL checksDan Carpenter1-2/+1
These NULL check are no long needed after commit 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode"). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>