Age | Commit message (Collapse) | Author | Files | Lines |
|
When calling fdget() in xfs_ioc_swapext(), we need to verify that
the file descriptors passed into the ioctl point to XFS inodes
before we start operations on them. If we don't do this, we could be
referencing arbitrary kernel memory as an XFS inode. THis could lead
to memory corruption and/or performing locking operations on
attacker-chosen structures in kernel memory.
[dchinner: rewrite commit message ]
[dchinner: add comment explaining new check ]
Signed-off-by: Jann Horn <jann@thejh.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
Create a common function to calculate the maximum height of a per-AG
btree. This will eventually be used by the rmapbt and refcountbt
code to calculate appropriate maxlevels values for each. This is
important because the verifiers and the transaction block
reservations depend on accurate estimates of how many blocks are
needed to satisfy a btree split.
We were mistakenly using the max bnobt height for all the btrees,
which creates a dangerous situation since the larger records and
keys in an rmapbt make it very possible that the rmapbt will be
taller than the bnobt and so we can run out of transaction block
reservation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
In struct xfs_bmap_free, convert the open-coded free extent list to
a regular list, then use list_sort to sort it prior to processing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Break up xfs_free_extent() into a helper that fixes the freelist.
This helper will be used subsequently to ensure the freelist during
deferred rmap processing.
[darrick: refactor to put this at the head of the patchset]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
This is already in xfsprogs' libxfs, so port it to the kernel.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Currently we don't check the error_tag when someone's trying to set up
error injection testing. If userspace passes in a value we don't know
about, send back an error. This will help xfstests to _notrun a test
that uses error injection to test things like log replay.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Create a second buf_trylock tracepoint so that we can distinguish
between a successful and a failed trylock. With this piece, we can
use a script to look at the ftrace output to detect buffer deadlocks.
[dchinner: update to if/else as per hch's suggestion]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Some of the directory/attr structures contain variable-length objects,
so the enclosing structure doesn't have a meaningful fixed size at
compile time. We can check the offsets of the members before the
variable-length member, so do those.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_reserve_blocks() is responsible to update the XFS reserved block
pool count at mount time or based on user request. When the caller
requests to increase the reserve pool, blocks must be allocated from
the global counters such that they are no longer available for
general purpose use. If the requested reserve pool size is too
large, XFS reserves what blocks are available. The implementation
requires looking at the percpu counters and making an educated guess
as to how many blocks to try and allocate from xfs_mod_fdblocks(),
which can return -ENOSPC if the guess was not accurate due to
counters being modified in parallel.
xfs_reserve_blocks() retries the guess in this scenario until the
allocation succeeds or it is determined that there is no space
available in the fs. While not easily reproducible in the current
form, the retry code doesn't actually work correctly if
xfs_mod_fdblocks() actually fails. The problem is that the percpu
calculations use the m_resblks counter to determine how many blocks
to allocate, but unconditionally update m_resblks before the block
allocation has actually succeeded. Therefore, if xfs_mod_fdblocks()
fails, the code jumps to the retry label and uses the already
updated m_resblks value to determine how many blocks to try and
allocate. If the percpu counters previously suggested that the
entire request was available, fdblocks_delta could end up set to 0.
In that case, m_resblks is updated to the requested value, yet no
blocks have been reserved at all.
Refactor xfs_reserve_blocks() to use an explicit loop and make the
code easier to follow. Since we have to drop the spinlock across the
xfs_mod_fdblocks() call, use a delta value for m_resblks as well and
only apply the delta once allocation succeeds.
[dchinner: convert to do {} while() loop]
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The filesystem quiesce sequence performs the operations necessary to
drain all background work, push pending transactions through the log
infrastructure and wait on I/O resulting from the final AIL push. We
have had reports of remount,ro hangs in xfs_log_quiesce() ->
xfs_wait_buftarg(), however, and some instrumentation code to detect
transaction commits at this point in the quiesce sequence has inculpated
the eofblocks background scanner as a cause.
While higher level remount code generally prevents user modifications by
the time the filesystem has made it to xfs_log_quiesce(), the background
scanner may still be alive and can perform pending work at any time. If
this occurs between the xfs_log_force() and xfs_wait_buftarg() calls
within xfs_log_quiesce(), this can lead to an indefinite lockup in
xfs_wait_buftarg().
To prevent this problem, cancel the background eofblocks scan worker
during the remount read-only quiesce sequence. This suspends background
trimming when a filesystem is remounted read-only. This is only done in
the remount path because the freeze codepath has already locked out new
transactions by the time the filesystem attempts to quiesce (and thus
waiting on an active work item could deadlock). Kick the eofblocks
worker to pick up where it left off once an fs is remounted back to
read-write.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
|
|
Instead punch the whole first, and the use the our zeroing helper
to punch out the edge blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
We now skip holes in it, so no need to have the caller do it as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
We'll want to use this code for large offsets now that we're
skipping holes and unwritten extents efficiently. Also rename it to
xfs_zero_range to be a bit more descriptive, and tell the caller if
we actually did any zeroing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Note that this removes support for the untested FIEMAP_FLAG_XATTR. It
could be added relatively easily with iomap ops for the attr fork, but
without test coverage I don't feel safe doing this.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Convert XFS to use the new iomap based multipage write path. This involves
implementing the ->iomap_begin and ->iomap_end methods, and switching the
buffered file write, page_mkwrite and xfs_iozero paths to the new iomap
helpers.
With this change __xfs_get_blocks will never be used for buffered writes,
and the code handling them can be removed.
Based on earlier code from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Currently zeroing out blocks and waiting for writeout is a bit of a mess in
truncate. This patch gives it a clear order in preparation for the iomap
path:
(1) we first wait for any direct I/O to complete to prevent any races
for it
(2) we then perform the actual zeroing, and only use the truncate_page
helpers for truncating down. The truncate up case already is
handled by the separate call to xfs_zero_eof.
(3) only then we write back dirty data, as zeroing block may cause
dirty pages when using either xfs_zero_eof or the new iomap
infrastructure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
And ensure it works for RT subvolume files an set the block device,
both of which will be needed to be able to use the function in the
buffered write path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
When we have a lot of metadata to flush from the AIL, the buffer
list can get very long. The current submission code tries to batch
submission to optimise IO order of the metadata (i.e. ascending
block order) to maximise block layer merging or IO to adjacent
metadata blocks.
Unfortunately, the method used can result in long lock times
occurring as buffers locked early on in the buffer list might not be
dispatched until the end of the IO licst processing. This is because
sorting does not occur util after the buffer list has been processed
and the buffers that are going to be submitted are locked. Hence
when the buffer list is several thousand buffers long, the lock hold
times before IO dispatch can be significant.
To fix this, sort the buffer list before we start trying to lock and
submit buffers. This means we can now submit buffers immediately
after they are locked, allowing merging to occur immediately on the
plug and dispatch to occur as quickly as possible. This means there
is minimal delay between locking the buffer and IO submission
occuring, hence reducing the worst case lock hold times seen during
delayed write buffer IO submission signficantly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
And the same for XFS_IOC_THAW. Just because we now have a common
version of the ioctl we still need to provide the old name for it
for anyone using those.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Al Viro noticed that xfs_lock_inodes should be static, and
that led to ... a few more.
These are just the easy ones, others require moving functions
higher in source files, so that's not done here to keep
this review simple.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The static checker reports that after commit 8d99fe92fed0 ("xfs: fix
efi/efd error handling to avoid fs shutdown hangs"), the code has been
reworked such that error == -EFSCORRUPTED is not possible in this
codepath.
Remove the spurious error check and just use SHUTDOWN_META_IO_ERROR
unconditionally.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Multi-block buffers are logged based on buffer offset in
xfs_trans_log_buf(). xfs_buf_item_log() ultimately walks each mapping in
the buffer and marks the associated range to be logged in the
xfs_buf_log_format bitmap for that mapping. This code is broken,
however, in that it marks the actual buffer offsets of the associated
range in each bitmap rather than shifting to the byte range for that
particular mapping.
For example, on a 4k fsb fs, buffer offset 4096 refers to the first byte
of the second mapping in the buffer. This means byte 0 of the second log
format bitmap should be tagged as dirty. Instead, the current code marks
byte offset 4096 of the second log format bitmap, which is invalid and
potentially out of range of the mapping.
As a result of this, the log item format code invoked at transaction
commit time is not be able to correctly identify what parts of the
buffer to copy into log vectors. This can lead to NULL log vector
pointer dereferences in CIL push context if the item format code was not
able to locate any dirty ranges at all. This crash has been reproduced
on a 4k FSB filesystem using 16k directory blocks where an unlink
operation happened not to log anything in the first block of the
mapping. The logged offsets were all over 4k, marked as such in the
subsequent log format mappings, and thus left the transaction with an
xfs_log_item that is marked DIRTY but without any logged regions.
Further, even when the logged regions are marked correctly in the buffer
log format bitmaps, the format code doesn't copy the correct ranges of
the buffer into the log. This means that any logged region beyond the
first block of a multi-block buffer is subject to corruption after a
crash and log recovery sequence. This is due to a failure to convert the
mapping bm_len field from basic blocks to bytes in the buffer offset
tracking code in xfs_buf_item_format().
Update xfs_buf_item_log() to convert buffer offsets to segment relative
offsets when logging multi-block buffers. This ensures that the modified
regions of a buffer are logged correctly and avoids the aforementioned
crash. Also update xfs_buf_item_format() to correctly track the source
offset into the buffer for the log vector formatting code. This ensures
that the correct data is copied into the log.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"Followups to the parallel lookup work:
- update docs
- restore killability of the places that used to take ->i_mutex
killably now that we have down_write_killable() merged
- Additionally, it turns out that I missed a prerequisite for
security_d_instantiate() stuff - ->getxattr() wasn't the only thing
that could be called before dentry is attached to inode; with smack
we needed the same treatment applied to ->setxattr() as well"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->setxattr() to passing dentry and inode separately
switch xattr_handler->set() to passing dentry and inode separately
restore killability of old mutex_lock_killable(&inode->i_mutex) users
add down_write_killable_nested()
update D/f/directory-locking
|
|
preparation for similar switch in ->setxattr() (see the next commit for
rationale).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"
* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"A pretty average collection of fixes, cleanups and improvements in
this request.
Summary:
- fixes for mount line parsing, sparse warnings, read-only compat
feature remount behaviour
- allow fast path symlink lookups for inline symlinks.
- attribute listing cleanups
- writeback goes direct to bios rather than indirecting through
bufferheads
- transaction allocation cleanup
- optimised kmem_realloc
- added configurable error handling for metadata write errors,
changed default error handling behaviour from "retry forever" to
"retry until unmount then fail"
- fixed several inode cluster writeback lookup vs reclaim race
conditions
- fixed inode cluster writeback checking wrong inode after lookup
- fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
- cleaned up inode reclaim tagging"
* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
xfs: fix warning in xfs_finish_page_writeback for non-debug builds
xfs: move reclaim tagging functions
xfs: simplify inode reclaim tagging interfaces
xfs: rename variables in xfs_iflush_cluster for clarity
xfs: xfs_iflush_cluster has range issues
xfs: mark reclaimed inodes invalid earlier
xfs: xfs_inode_free() isn't RCU safe
xfs: optimise xfs_iext_destroy
xfs: skip stale inodes in xfs_iflush_cluster
xfs: fix inode validity check in xfs_iflush_cluster
xfs: xfs_iflush_cluster fails to abort on error
xfs: remove xfs_fs_evict_inode()
xfs: add "fail at unmount" error handling configuration
xfs: add configuration handlers for specific errors
xfs: add configuration of error failure speed
xfs: introduce table-based init for error behaviors
xfs: add configurable error support to metadata buffers
xfs: introduce metadata IO error class
xfs: configurable error behavior via sysfs
xfs: buffer ->bi_end_io function requires irq-safe lock
...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
blockmask is unused if ASSERTs are disabled.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
dax_clear_sectors() cannot handle poisoned blocks. These must be
zeroed using the BIO interface instead. Convert ext2 and XFS to use
only sb_issue_zerout().
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
[vishal: Also remove the dax_clear_sectors function entirely]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel lookup fixups from Al Viro:
"Fix for xfs parallel readdir (turns out the cxfs exposure was not
enough to catch all problems), and a reversion of btrfs back to
->iterate() until the fs/btrfs/delayed-inode.c gets fixed"
* 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xfs: concurrent readdir hangs on data buffer locks
Revert "btrfs: switch to ->iterate_shared()"
|
|
There's a three-process deadlock involving shared/exclusive barriers
and inverted lock orders in the directory readdir implementation.
It's a pre-existing problem with lock ordering, exposed by the
VFS parallelisation code.
process 1 process 2 process 3
--------- --------- ---------
readdir
iolock(shared)
get_leaf_dents
iterate entries
ilock(shared)
map, lock and read buffer
iunlock(shared)
process entries in buffer
.....
readdir
iolock(shared)
get_leaf_dents
iterate entries
ilock(shared)
map, lock buffer
<blocks>
finish ->iterate_shared
file_accessed()
->update_time
start transaction
ilock(excl)
<blocks>
.....
finishes processing buffer
get next buffer
ilock(shared)
<blocks>
And that's the deadlock.
Fix this by dropping the current buffer lock in process 1 before
trying to map the next buffer. This means we keep the lock order of
ilock -> buffer lock intact and hence will allow process 3 to make
progress and drop it's ilock(shared) once it is done.
Reported-by: Xiong Zhou <xzhou@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Rearrange the inode tagging functions so that they are higher up in
xfs_cache.c and so there is no need for forward prototypes to be
defined. This is purely code movement, no other change.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
|
|
Inode radix tree tagging for reclaim passes a lot of unnecessary
variables around. Over time the xfs-perag has grown a xfs_mount
backpointer, and an internal agno so we don't need to pass other
variables into the tagging functions to supply this information.
Rework the functions to pass the minimal variable set required
and simplify the internal logic and flow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The cluster inode variable uses unconventional naming - iq - which
makes it hard to distinguish it between the inode passed into the
function - ip - and that is a vector for mistakes to be made.
Rename all the cluster inode variables to use a more conventional
prefixes to reduce potential future confusion (cilist, cilist_size,
cip).
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
it can find inodes beyond the current cluster if there is sparse
cache population. gang lookups return results in ascending index
order, so stop trying to cluster inodes once the first inode outside
the cluster mask is detected.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The last thing we do before using call_rcu() on an xfs_inode to be
freed is mark it as invalid. This means there is a window between
when we know for certain that the inode is going to be freed and
when we do actually mark it as "freed".
This is important in the context of RCU lookups - we can look up the
inode, find that it is valid, and then use it as such not realising
that it is in the final stages of being freed.
As such, mark the inode as being invalid the moment we know it is
going to be reclaimed. This can be done while we still hold the
XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
it occurs well before we remove it from the radix tree, and that
the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
synchronisation points for detecting that an inode is about to go
away.
For defensive purposes, this allows us to add a further check to
xfs_iflush_cluster to ensure we skip inodes that are being freed
after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
if the inode number if valid while we have these locks held we know
that it has not progressed through reclaim to the point where it is
clean and is about to be freed.
[bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it
had already been zeroed.]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The xfs_inode freed in xfs_inode_free() has multiple allocated
structures attached to it. We free these in xfs_inode_free() before
we mark the inode as invalid, and before we run call_rcu() to queue
the structure for freeing.
Unfortunately, this freeing can race with other accesses that are in
the RCU current grace period that have found the inode in the radix
tree with a valid state. This includes xfs_iflush_cluster(), which
calls xfs_inode_clean(), and that accesses the inode log item on the
xfs_inode.
The log item structure is freed in xfs_inode_free(), so there is the
possibility we can be accessing freed memory in xfs_iflush_cluster()
after validating the xfs_inode structure as being valid for this RCU
context. Hence we can get spuriously incorrect clean state returned
from such checks. This can lead to use thinking the inode is dirty
when it is, in fact, clean, and so incorrectly attaching it to the
buffer for IO and completion processing.
This then leads to use-after-free situations on the xfs_inode itself
if the IO completes after the current RCU grace period expires. The
buffer callbacks will access the xfs_inode and try to do all sorts
of things it shouldn't with freed memory.
IOWs, xfs_iflush_cluster() only works correctly when racing with
inode reclaim if the inode log item is present and correctly stating
the inode is clean. If the inode is being freed, then reclaim has
already made sure the inode is clean, and hence xfs_iflush_cluster
can skip it. However, we are accessing the inode inode under RCU
read lock protection and so also must ensure that all dynamically
allocated memory we reference in this context is not freed until the
RCU grace period expires.
To fix this, move all the potential memory freeing into
xfs_inode_free_callback() so that we are guarantee RCU protected
lookup code will always have the memory structures it needs
available during the RCU grace period that lookup races can occur
in.
Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|