summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2016-07-20xfs: fix type confusion in xfs_ioc_swapextJann Horn1-0/+11
When calling fdget() in xfs_ioc_swapext(), we need to verify that the file descriptors passed into the ioctl point to XFS inodes before we start operations on them. If we don't do this, we could be referencing arbitrary kernel memory as an XFS inode. THis could lead to memory corruption and/or performing locking operations on attacker-chosen structures in kernel memory. [dchinner: rewrite commit message ] [dchinner: add comment explaining new check ] Signed-off-by: Jann Horn <jann@thejh.net> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21Merge branch 'xfs-4.8-misc-fixes-2' into for-nextDave Chinner18-166/+246
2016-06-21xfs: refactor btree maxlevels computationDarrick J. Wong4-27/+28
Create a common function to calculate the maximum height of a per-AG btree. This will eventually be used by the rmapbt and refcountbt code to calculate appropriate maxlevels values for each. This is important because the verifiers and the transaction block reservations depend on accurate estimates of how many blocks are needed to satisfy a btree split. We were mistakenly using the max bnobt height for all the btrees, which creates a dangerous situation since the larger records and keys in an rmapbt make it very possible that the rmapbt will be taller than the bnobt and so we can run out of transaction block reservation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: convert list of extents to free into a regular listDarrick J. Wong5-44/+47
In struct xfs_bmap_free, convert the open-coded free extent list to a regular list, then use list_sort to sort it prior to processing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: separate freelist fixing into a separate helperDave Chinner2-30/+56
Break up xfs_free_extent() into a helper that fixes the freelist. This helper will be used subsequently to ensure the freelist during deferred rmap processing. [darrick: refactor to put this at the head of the patchset] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: rearrange xfs_bmap_add_free parametersDarrick J. Wong4-14/+13
This is already in xfsprogs' libxfs, so port it to the kernel. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: check for a valid error_tag in errortag_addDarrick J. Wong1-0/+3
Currently we don't check the error_tag when someone's trying to set up error injection testing. If userspace passes in a value we don't know about, send back an error. This will help xfstests to _notrun a test that uses error injection to test things like log replay. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: enable buffer deadlock postmortem diagnosis via ftraceDarrick J. Wong2-3/+6
Create a second buf_trylock tracepoint so that we can distinguish between a successful and a failed trylock. With this piece, we can use a script to look at the ftrace output to detect buffer deadlocks. [dchinner: update to if/else as per hch's suggestion] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: check offsets of variable length structuresDarrick J. Wong1-2/+23
Some of the directory/attr structures contain variable-length objects, so the enclosing structure doesn't have a meaningful fixed size at compile time. We can check the offsets of the members before the variable-length member, so do those. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: refactor xfs_reserve_blocks() to handle ENOSPC correctlyBrian Foster1-45/+60
xfs_reserve_blocks() is responsible to update the XFS reserved block pool count at mount time or based on user request. When the caller requests to increase the reserve pool, blocks must be allocated from the global counters such that they are no longer available for general purpose use. If the requested reserve pool size is too large, XFS reserves what blocks are available. The implementation requires looking at the percpu counters and making an educated guess as to how many blocks to try and allocate from xfs_mod_fdblocks(), which can return -ENOSPC if the guess was not accurate due to counters being modified in parallel. xfs_reserve_blocks() retries the guess in this scenario until the allocation succeeds or it is determined that there is no space available in the fs. While not easily reproducible in the current form, the retry code doesn't actually work correctly if xfs_mod_fdblocks() actually fails. The problem is that the percpu calculations use the m_resblks counter to determine how many blocks to allocate, but unconditionally update m_resblks before the block allocation has actually succeeded. Therefore, if xfs_mod_fdblocks() fails, the code jumps to the retry label and uses the already updated m_resblks value to determine how many blocks to try and allocate. If the percpu counters previously suggested that the entire request was available, fdblocks_delta could end up set to 0. In that case, m_resblks is updated to the requested value, yet no blocks have been reserved at all. Refactor xfs_reserve_blocks() to use an explicit loop and make the code easier to follow. Since we have to drop the spinlock across the xfs_mod_fdblocks() call, use a delta value for m_resblks as well and only apply the delta once allocation succeeds. [dchinner: convert to do {} while() loop] Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: cancel eofblocks background trimming on remount read-onlyBrian Foster3-1/+10
The filesystem quiesce sequence performs the operations necessary to drain all background work, push pending transactions through the log infrastructure and wait on I/O resulting from the final AIL push. We have had reports of remount,ro hangs in xfs_log_quiesce() -> xfs_wait_buftarg(), however, and some instrumentation code to detect transaction commits at this point in the quiesce sequence has inculpated the eofblocks background scanner as a cause. While higher level remount code generally prevents user modifications by the time the filesystem has made it to xfs_log_quiesce(), the background scanner may still be alive and can perform pending work at any time. If this occurs between the xfs_log_force() and xfs_wait_buftarg() calls within xfs_log_quiesce(), this can lead to an indefinite lockup in xfs_wait_buftarg(). To prevent this problem, cancel the background eofblocks scan worker during the remount read-only quiesce sequence. This suspends background trimming when a filesystem is remounted read-only. This is only done in the remount path because the freeze codepath has already locked out new transactions by the time the filesystem attempts to quiesce (and thus waiting on an active work item could deadlock). Kick the eofblocks worker to pick up where it left off once an fs is remounted back to read-write. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21Merge branch 'xfs-4.8-iomap-write' into for-nextDave Chinner10-776/+367
2016-06-21Merge branch 'fs-4.8-iomap-infrastructure' into for-nextDave Chinner1-0/+1
2016-06-21xfs: kill xfs_zero_remaining_bytesChristoph Hellwig1-119/+14
Instead punch the whole first, and the use the our zeroing helper to punch out the edge blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: split xfs_free_file_space in manageable piecesChristoph Hellwig1-115/+137
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: use xfs_zero_range in xfs_zero_eofChristoph Hellwig1-127/+1
We now skip holes in it, so no need to have the caller do it as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: handle 64-bit length in xfs_iozeroChristoph Hellwig2-6/+8
We'll want to use this code for large offsets now that we're skipping holes and unwritten extents efficiently. Also rename it to xfs_zero_range to be a bit more descriptive, and tell the caller if we actually did any zeroing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: use iomap infrastructure for DAX zeroingChristoph Hellwig2-41/+3
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: use iomap fiemap implementationChristoph Hellwig1-75/+5
Note that this removes support for the untested FIEMAP_FLAG_XATTR. It could be added relatively easily with iomap ops for the attr fork, but without test coverage I don't feel safe doing this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: remove buffered write support from __xfs_get_blocksChristoph Hellwig1-52/+19
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: implement iomap based buffered write pathChristoph Hellwig7-258/+187
Convert XFS to use the new iomap based multipage write path. This involves implementing the ->iomap_begin and ->iomap_end methods, and switching the buffered file write, page_mkwrite and xfs_iozero paths to the new iomap helpers. With this change __xfs_get_blocks will never be used for buffered writes, and the code handling them can be removed. Based on earlier code from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: reorder zeroing and flushing sequence in truncateChristoph Hellwig1-14/+19
Currently zeroing out blocks and waiting for writeout is a bit of a mess in truncate. This patch gives it a clear order in preparation for the iomap path: (1) we first wait for any direct I/O to complete to prevent any races for it (2) we then perform the actual zeroing, and only use the truncate_page helpers for truncating down. The truncate up case already is handled by the separate call to xfs_zero_eof. (3) only then we write back dirty data, as zeroing block may cause dirty pages when using either xfs_zero_eof or the new iomap infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: make xfs_bmbt_to_iomap available outside of xfs_pnfs.cChristoph Hellwig3-26/+31
And ensure it works for RT subvolume files an set the block device, both of which will be needed to be able to use the function in the buffered write path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21fs: move struct iomap from exportfs.h to a separate headerChristoph Hellwig1-0/+1
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-01xfs: reduce lock hold times in buffer writebackDave Chinner1-25/+35
When we have a lot of metadata to flush from the AIL, the buffer list can get very long. The current submission code tries to batch submission to optimise IO order of the metadata (i.e. ascending block order) to maximise block layer merging or IO to adjacent metadata blocks. Unfortunately, the method used can result in long lock times occurring as buffers locked early on in the buffer list might not be dispatched until the end of the IO licst processing. This is because sorting does not occur util after the buffer list has been processed and the buffers that are going to be submitted are locked. Hence when the buffer list is several thousand buffers long, the lock hold times before IO dispatch can be significant. To fix this, sort the buffer list before we start trying to lock and submit buffers. This means we can now submit buffers immediately after they are locked, allowing merging to occur immediately on the plug and dispatch to occur as quickly as possible. This means there is minimal delay between locking the buffer and IO submission occuring, hence reducing the worst case lock hold times seen during delayed write buffer IO submission signficantly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-01xfs: define XFS_IOC_FREEZE even if FIFREEZE is definedChristoph Hellwig1-6/+2
And the same for XFS_IOC_THAW. Just because we now have a common version of the ioctl we still need to provide the old name for it for anyone using those. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-01xfs: make several functions staticEric Sandeen15-50/+14
Al Viro noticed that xfs_lock_inodes should be static, and that led to ... a few more. These are just the easy ones, others require moving functions higher in source files, so that's not done here to keep this review simple. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-01xfs: remove spurious shutdown type check from xfs_bmap_finish()Brian Foster1-3/+1
The static checker reports that after commit 8d99fe92fed0 ("xfs: fix efi/efd error handling to avoid fs shutdown hangs"), the code has been reworked such that error == -EFSCORRUPTED is not possible in this codepath. Remove the spurious error check and just use SHUTDOWN_META_IO_ERROR unconditionally. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-01xfs: fix broken multi-fsb buffer loggingBrian Foster1-5/+13
Multi-block buffers are logged based on buffer offset in xfs_trans_log_buf(). xfs_buf_item_log() ultimately walks each mapping in the buffer and marks the associated range to be logged in the xfs_buf_log_format bitmap for that mapping. This code is broken, however, in that it marks the actual buffer offsets of the associated range in each bitmap rather than shifting to the byte range for that particular mapping. For example, on a 4k fsb fs, buffer offset 4096 refers to the first byte of the second mapping in the buffer. This means byte 0 of the second log format bitmap should be tagged as dirty. Instead, the current code marks byte offset 4096 of the second log format bitmap, which is invalid and potentially out of range of the mapping. As a result of this, the log item format code invoked at transaction commit time is not be able to correctly identify what parts of the buffer to copy into log vectors. This can lead to NULL log vector pointer dereferences in CIL push context if the item format code was not able to locate any dirty ranges at all. This crash has been reproduced on a 4k FSB filesystem using 16k directory blocks where an unlink operation happened not to log anything in the first block of the mapping. The logged offsets were all over 4k, marked as such in the subsequent log format mappings, and thus left the transaction with an xfs_log_item that is marked DIRTY but without any logged regions. Further, even when the logged regions are marked correctly in the buffer log format bitmaps, the format code doesn't copy the correct ranges of the buffer into the log. This means that any logged region beyond the first block of a multi-block buffer is subject to corruption after a crash and log recovery sequence. This is due to a failure to convert the mapping bm_len field from basic blocks to bytes in the buffer offset tracking code in xfs_buf_item_format(). Update xfs_buf_item_log() to convert buffer offsets to segment relative offsets when logging multi-block buffers. This ensures that the modified regions of a buffer are logged correctly and avoids the aforementioned crash. Also update xfs_buf_item_format() to correctly track the source offset into the buffer for the log vector formatting code. This ensures that the correct data is copied into the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-28Merge branch 'for-linus' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Followups to the parallel lookup work: - update docs - restore killability of the places that used to take ->i_mutex killably now that we have down_write_killable() merged - Additionally, it turns out that I missed a prerequisite for security_d_instantiate() stuff - ->getxattr() wasn't the only thing that could be called before dentry is attached to inode; with smack we needed the same treatment applied to ->setxattr() as well" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch ->setxattr() to passing dentry and inode separately switch xattr_handler->set() to passing dentry and inode separately restore killability of old mutex_lock_killable(&inode->i_mutex) users add down_write_killable_nested() update D/f/directory-locking
2016-05-27switch xattr_handler->set() to passing dentry and inode separatelyAl Viro1-4/+5
preparation for similar switch in ->setxattr() (see the next commit for rationale). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-27Merge tag 'dax-misc-for-4.7' of ↵Linus Torvalds3-22/+12
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26Merge tag 'xfs-for-linus-4.7-rc1' of ↵Linus Torvalds48-1134/+1293
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "A pretty average collection of fixes, cleanups and improvements in this request. Summary: - fixes for mount line parsing, sparse warnings, read-only compat feature remount behaviour - allow fast path symlink lookups for inline symlinks. - attribute listing cleanups - writeback goes direct to bios rather than indirecting through bufferheads - transaction allocation cleanup - optimised kmem_realloc - added configurable error handling for metadata write errors, changed default error handling behaviour from "retry forever" to "retry until unmount then fail" - fixed several inode cluster writeback lookup vs reclaim race conditions - fixed inode cluster writeback checking wrong inode after lookup - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe - cleaned up inode reclaim tagging" * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: fix warning in xfs_finish_page_writeback for non-debug builds xfs: move reclaim tagging functions xfs: simplify inode reclaim tagging interfaces xfs: rename variables in xfs_iflush_cluster for clarity xfs: xfs_iflush_cluster has range issues xfs: mark reclaimed inodes invalid earlier xfs: xfs_inode_free() isn't RCU safe xfs: optimise xfs_iext_destroy xfs: skip stale inodes in xfs_iflush_cluster xfs: fix inode validity check in xfs_iflush_cluster xfs: xfs_iflush_cluster fails to abort on error xfs: remove xfs_fs_evict_inode() xfs: add "fail at unmount" error handling configuration xfs: add configuration handlers for specific errors xfs: add configuration of error failure speed xfs: introduce table-based init for error behaviors xfs: add configurable error support to metadata buffers xfs: introduce metadata IO error class xfs: configurable error behavior via sysfs xfs: buffer ->bi_end_io function requires irq-safe lock ...
2016-05-20Merge branch 'xfs-4.7-inode-reclaim' into for-nextDave Chinner4-199/+250
2016-05-20Merge branch 'xfs-4.7-error-cfg' into for-nextDave Chinner8-54/+450
2016-05-20Merge branch 'xfs-4.7-misc-fixes' into for-nextDave Chinner10-29/+32
2016-05-20Merge branch 'xfs-4.7-cleanup-attr-listent' into for-nextDave Chinner3-67/+39
2016-05-20Merge branch 'xfs-4.7-optimise-inline-symlinks' into for-nextDave Chinner9-84/+125
2016-05-20Merge branch 'xfs-4.7-trans-type-cleanup' into for-nextDave Chinner28-508/+200
2016-05-20Merge branch 'xfs-4.7-writeback-bio' into for-nextDave Chinner3-191/+184
2016-05-20xfs: fix warning in xfs_finish_page_writeback for non-debug buildsChristoph Hellwig1-3/+2
blockmask is unused if ASSERTs are disabled. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-05-18dax: use sb_issue_zerout instead of calling dax_clear_sectorsMatthew Wilcox1-11/+4
dax_clear_sectors() cannot handle poisoned blocks. These must be zeroed using the BIO interface instead. Convert ext2 and XFS to use only sb_issue_zerout(). Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> [vishal: Also remove the dax_clear_sectors function entirely] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18Merge branch 'work.lookups' of ↵Linus Torvalds1-9/+14
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull parallel lookup fixups from Al Viro: "Fix for xfs parallel readdir (turns out the cxfs exposure was not enough to catch all problems), and a reversion of btrfs back to ->iterate() until the fs/btrfs/delayed-inode.c gets fixed" * 'work.lookups' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xfs: concurrent readdir hangs on data buffer locks Revert "btrfs: switch to ->iterate_shared()"
2016-05-18xfs: concurrent readdir hangs on data buffer locksDave Chinner1-9/+14
There's a three-process deadlock involving shared/exclusive barriers and inverted lock orders in the directory readdir implementation. It's a pre-existing problem with lock ordering, exposed by the VFS parallelisation code. process 1 process 2 process 3 --------- --------- --------- readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock and read buffer iunlock(shared) process entries in buffer ..... readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock buffer <blocks> finish ->iterate_shared file_accessed() ->update_time start transaction ilock(excl) <blocks> ..... finishes processing buffer get next buffer ilock(shared) <blocks> And that's the deadlock. Fix this by dropping the current buffer lock in process 1 before trying to map the next buffer. This means we keep the lock order of ilock -> buffer lock intact and hence will allow process 3 to make progress and drop it's ilock(shared) once it is done. Reported-by: Xiong Zhou <xzhou@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-18xfs: move reclaim tagging functionsDave Chinner1-118/+116
Rearrange the inode tagging functions so that they are higher up in xfs_cache.c and so there is no need for forward prototypes to be defined. This is purely code movement, no other change. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-05-18xfs: simplify inode reclaim tagging interfacesDave Chinner1-48/+50
Inode radix tree tagging for reclaim passes a lot of unnecessary variables around. Over time the xfs-perag has grown a xfs_mount backpointer, and an internal agno so we don't need to pass other variables into the tagging functions to supply this information. Rework the functions to pass the minimal variable set required and simplify the internal logic and flow. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: rename variables in xfs_iflush_cluster for clarityDave Chinner1-37/+37
The cluster inode variable uses unconventional naming - iq - which makes it hard to distinguish it between the inode passed into the function - ip - and that is a vector for mistakes to be made. Rename all the cluster inode variables to use a more conventional prefixes to reduce potential future confusion (cilist, cilist_size, cip). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: xfs_iflush_cluster has range issuesDave Chinner1-2/+11
xfs_iflush_cluster() does a gang lookup on the radix tree, meaning it can find inodes beyond the current cluster if there is sparse cache population. gang lookups return results in ascending index order, so stop trying to cluster inodes once the first inode outside the cluster mask is detected. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: mark reclaimed inodes invalid earlierDave Chinner2-12/+47
The last thing we do before using call_rcu() on an xfs_inode to be freed is mark it as invalid. This means there is a window between when we know for certain that the inode is going to be freed and when we do actually mark it as "freed". This is important in the context of RCU lookups - we can look up the inode, find that it is valid, and then use it as such not realising that it is in the final stages of being freed. As such, mark the inode as being invalid the moment we know it is going to be reclaimed. This can be done while we still hold the XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that it occurs well before we remove it from the radix tree, and that the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as synchronisation points for detecting that an inode is about to go away. For defensive purposes, this allows us to add a further check to xfs_iflush_cluster to ensure we skip inodes that are being freed after we grab the XFS_ILOCK_SHARED and the flush lock - we know that if the inode number if valid while we have these locks held we know that it has not progressed through reclaim to the point where it is clean and is about to be freed. [bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it had already been zeroed.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: xfs_inode_free() isn't RCU safeDave Chinner1-7/+7
The xfs_inode freed in xfs_inode_free() has multiple allocated structures attached to it. We free these in xfs_inode_free() before we mark the inode as invalid, and before we run call_rcu() to queue the structure for freeing. Unfortunately, this freeing can race with other accesses that are in the RCU current grace period that have found the inode in the radix tree with a valid state. This includes xfs_iflush_cluster(), which calls xfs_inode_clean(), and that accesses the inode log item on the xfs_inode. The log item structure is freed in xfs_inode_free(), so there is the possibility we can be accessing freed memory in xfs_iflush_cluster() after validating the xfs_inode structure as being valid for this RCU context. Hence we can get spuriously incorrect clean state returned from such checks. This can lead to use thinking the inode is dirty when it is, in fact, clean, and so incorrectly attaching it to the buffer for IO and completion processing. This then leads to use-after-free situations on the xfs_inode itself if the IO completes after the current RCU grace period expires. The buffer callbacks will access the xfs_inode and try to do all sorts of things it shouldn't with freed memory. IOWs, xfs_iflush_cluster() only works correctly when racing with inode reclaim if the inode log item is present and correctly stating the inode is clean. If the inode is being freed, then reclaim has already made sure the inode is clean, and hence xfs_iflush_cluster can skip it. However, we are accessing the inode inode under RCU read lock protection and so also must ensure that all dynamically allocated memory we reference in this context is not freed until the RCU grace period expires. To fix this, move all the potential memory freeing into xfs_inode_free_callback() so that we are guarantee RCU protected lookup code will always have the memory structures it needs available during the RCU grace period that lookup races can occur in. Discovered-by: Brain Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>