summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_mount.h
AgeCommit message (Collapse)AuthorFilesLines
2016-05-26Merge tag 'xfs-for-linus-4.7-rc1' of ↵Linus Torvalds1-0/+34
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "A pretty average collection of fixes, cleanups and improvements in this request. Summary: - fixes for mount line parsing, sparse warnings, read-only compat feature remount behaviour - allow fast path symlink lookups for inline symlinks. - attribute listing cleanups - writeback goes direct to bios rather than indirecting through bufferheads - transaction allocation cleanup - optimised kmem_realloc - added configurable error handling for metadata write errors, changed default error handling behaviour from "retry forever" to "retry until unmount then fail" - fixed several inode cluster writeback lookup vs reclaim race conditions - fixed inode cluster writeback checking wrong inode after lookup - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe - cleaned up inode reclaim tagging" * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: fix warning in xfs_finish_page_writeback for non-debug builds xfs: move reclaim tagging functions xfs: simplify inode reclaim tagging interfaces xfs: rename variables in xfs_iflush_cluster for clarity xfs: xfs_iflush_cluster has range issues xfs: mark reclaimed inodes invalid earlier xfs: xfs_inode_free() isn't RCU safe xfs: optimise xfs_iext_destroy xfs: skip stale inodes in xfs_iflush_cluster xfs: fix inode validity check in xfs_iflush_cluster xfs: xfs_iflush_cluster fails to abort on error xfs: remove xfs_fs_evict_inode() xfs: add "fail at unmount" error handling configuration xfs: add configuration handlers for specific errors xfs: add configuration of error failure speed xfs: introduce table-based init for error behaviors xfs: add configurable error support to metadata buffers xfs: introduce metadata IO error class xfs: configurable error behavior via sysfs xfs: buffer ->bi_end_io function requires irq-safe lock ...
2016-05-18xfs: add "fail at unmount" error handling configurationCarlos Maiolino1-0/+2
If we take "retry forever" literally on metadata IO errors, we can hang at unmount, once it retries those writes forever. This is the default behavior, unfortunately. Add an error configuration option for this behavior and default it to "fail" so that an unmount will trigger actuall errors, a shutdown and allow the unmount to succeed. It will be noisy, though, as it will log the errors and shutdown that occurs. To fix this, we need to mark the filesystem as being in the process of unmounting. Do this with a mount flag that is added at the appropriate time (i.e. before the blocking AIL sync). We also need to add this flag if mount fails after the initial phase of log recovery has been run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: add configuration handlers for specific errorsCarlos Maiolino1-0/+3
now most of the infrastructure is in place, we can start adding support for configuring specific errors such as ENODEV, ENOSPC, EIO, etc. Add these error configurations and configure them all to have appropriate behaviours. That is, all will be configured to retry forever by default, except for ENODEV, which is an unrecoverable error, so it will be configured to not retry on error Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: add configuration of error failure speedCarlos Maiolino1-0/+3
On reception of an error, we can fail immediately, perform some bound amount of retries or retry indefinitely. The current behaviour we have is to retry forever. However, we'd like the ability to choose how long the filesystem should try after an error, it can either fail immediately, retry a few times, or retry forever. This is implemented by using max_retries sysfs attribute, to hold the amount of times we allow the filesystem to retry after an error. Being -1 a special case where the filesystem will retry indefinitely. Add both a maximum retry count and a retry timeout so that we can bound by time and/or physical IO attempts. Finally, plumb these into xfs_buf_iodone error processing so that the error behaviour follows the selected configuration. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: add configurable error support to metadata buffersCarlos Maiolino1-0/+3
With the error configuration handle for async metadata write errors in place, we can now add initial support to the IO error processing in xfs_buf_iodone_error(). Add an infrastructure function to look up the configuration handle, and rearrange the error handling to prepare the way for different error handling conigurations to be used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: introduce metadata IO error classCarlos Maiolino1-0/+3
Now we have the basic infrastructure, add the first error class so we can build up the infrastructure in a meaningful way. Add the metadata async write IO error class and sysfs entry, and introduce a default configuration that matches the existing "retry forever" behavior for async write metadata buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: configurable error behavior via sysfsCarlos Maiolino1-0/+20
We need to be able to change the way XFS behaviours in error conditions depending on the type of underlying storage. This is necessary for handling non-traditional block devices with extended error cases, such as thin provisioned devices that can return ENOSPC as an IO error. Introduce the basic sysfs infrastructure needed to define and configure error behaviours. This is done to be generic enough to extend to configuring behaviour in other error conditions, such as ENOMEM, which also has different desired behaviours according to machine configuration. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov1-2/+2
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15Merge branch 'xfs-misc-fixes-4.6-4' into for-nextDave Chinner1-0/+25
2016-03-15xfs: debug mode forced buffered write failureBrian Foster1-0/+25
Add a DEBUG mode-only sysfs knob to enable forced buffered write failure. An additional side effect of this mode is brute force killing of delayed allocation blocks in the range of the write. The latter is the prime motiviation behind this patch, as userspace test infrastructure requires a reliable mechanism to create and split delalloc extents without causing extent conversion. Certain fallocate operations (i.e., zero range) were used for this in the past, but the implementations have changed such that delalloc extents are flushed and converted to real blocks, rendering the test useless. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07Merge branch 'xfs-misc-fixes-4.6-2' into for-nextDave Chinner1-3/+2
2016-03-02xfs: fix up inode32/64 (re)mount handlingEric Sandeen1-3/+2
inode32/inode64 allocator behavior with respect to mount, remount and growfs is a little tricky. The inode32 mount option should only enable the inode32 allocator heuristics if the filesystem is large enough for 64-bit inodes to exist. Today, it has this behavior on the initial mount, but a remount with inode32 unconditionally changes the allocation heuristics, even for a small fs. Also, an inode32 mounted small filesystem should transition to the inode32 allocator if the filesystem is subsequently grown to a sufficient size. Today that does not happen. This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a single new function, and moves the "is the maximum inode number big enough to matter" test into that function, so it doesn't rely on the caller to get it right - which remount did not do, previously. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08xfs: remove unused function definitionsEric Sandeen1-1/+0
Old leftovers. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03Merge branch 'xfs-dax-updates' into for-nextDave Chinner1-0/+3
2015-11-03Merge branch 'xfs-misc-fixes-for-4.4-2' into for-nextDave Chinner1-0/+1
2015-11-03xfs: don't leak uuid table on rmmodDarrick J. Wong1-0/+1
Don't leak the UUID table when the module is unloaded. (Found with kmemleak.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: introduce BMAPI_ZERO for allocating zeroed extentsDave Chinner1-0/+3
To enable DAX to do atomic allocation of zeroed extents, we need to drive the block zeroing deep into the allocator. Because xfs_bmapi_write() can return merged extents on allocation that were only partially allocated (i.e. requested range spans allocated and hole regions, allocation into the hole was contiguous), we cannot zero the extent returned from xfs_bmapi_write() as that can overwrite existing data with zeros. Hence we have to drive the extent zeroing into the allocation code, prior to where we merge the extents into the BMBT and return the resultant map. This means we need to propagate this need down to the xfs_alloc_vextent() and issue the block zeroing at this point. While this functionality is being introduced for DAX, there is no reason why it is specific to DAX - we can per-zero blocks during the allocation transaction on any type of device. It's just slow (and usually slower than unwritten allocation and conversion) on traditional block devices so doesn't tend to get used. We can, however, hook hardware zeroing optimisations via sb_issue_zeroout() to this operation, so it may be useful in future and hence the "allocate zeroed blocks" API needs to be implementation neutral. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12xfs: per-filesystem stats in sysfsBill O'Donnell1-0/+1
This patch implements per-filesystem stats objects in sysfs. It depends on the application of the previous patch series that develops the infrastructure to support both xfs global stats and xfs per-fs stats in sysfs. Stats objects are instantiated when an xfs filesystem is mounted and deleted on unmount. With this patch, the stats directory is created and populated with the familiar stats and stats_clear files. Example: /sys/fs/xfs/sda9/stats/stats /sys/fs/xfs/sda9/stats/stats_clear With this patch, the individual counts within the new per-fs stats file(s) remain at zero. Functions that use the the macros to increment, decrement, and add-to the per-fs stats counts will be covered in a separate new patch to follow this one. Note that the counts within the global stats file (/sys/fs/xfs/stats/stats) advance normally and can be cleared as it was prior to this patch. [dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so it is down before/after any path that uses the per-mount stats. ] Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04Merge branch 'xfs-dax-support' into for-nextDave Chinner1-0/+2
2015-06-04xfs: add initial DAX supportDave Chinner1-0/+2
Add initial DAX support to XFS. To do this we need a new mount option to turn DAX on filesystem, and we need to propagate this into the inode flags whenever an inode is instantiated so that the per-inode checks throughout the code Do The Right Thing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: use sparse chunk alignment for min. inode allocation requirementBrian Foster1-0/+2
xfs_ialloc_ag_select() iterates through the allocation groups looking for free inodes or free space to determine whether to allow an inode allocation to proceed. If no free inodes are available, it assumes that an AG must have an extent longer than mp->m_ialloc_blks. Sparse inode chunk support currently allows for allocations smaller than the traditional inode chunk size specified in m_ialloc_blks. The current minimum sparse allocation is set in the superblock sb_spino_align field at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use this to represent the minimum supported allocation size for inode chunks. Initialize m_ialloc_min_blks at mount time based on whether sparse inodes are supported. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: remove xfs_mod_incore_sb APIDave Chinner1-2/+1
Now that there are no users of the bitfield based incore superblock modification API, just remove the whole damn lot of it, including all the bitfield definitions. This finally removes a lot of cruft that has been around for a long time. Credit goes to Christoph Hellwig for providing a great patch connecting all the dots to enale us to do this. This patch is derived from that work. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: replace xfs_mod_incore_sb_batchedDave Chinner1-11/+0
Introduce helper functions for modifying fields in the superblock into xfs_trans.c, the only caller of xfs_mod_incore_sb_batch(). We can then use these directly in xfs_trans_unreserve_and_mod_sb() and so remove another user of the xfs_mode_incore_sb() API without losing any functionality or scalability of the transaction commit code.. Based on a patch from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: introduce xfs_mod_frextentsDave Chinner1-0/+2
Add a new helper to modify the incore counter of free realtime extents. This matches the helpers used for inode and data block counters, and removes a significant users of the xfs_mod_incore_sb() interface. Based on a patch originally from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: Remove icsb infrastructureDave Chinner1-67/+0
Now that the in-core superblock infrastructure has been replaced with generic per-cpu counters, we don't need it anymore. Nuke it from orbit so we are sure that it won't haunt us again... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: use generic percpu counters for free block counterDave Chinner1-0/+3
XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. The free block counter is special in that it is used for ENOSPC detection outside transaction contexts for for delayed allocation. This means that the counter needs to be accurate at zero. The current per-cpu counter code jumps through lots of hoops to ensure we never run past zero, but we don't need to make all those jumps with the generic counter implementation. The generic counter implementation allows us to pass a "batch" threshold at which the addition/subtraction to the counter value will be folded back into global value under lock. We can use this feature to reduce the batch size as we approach 0 in a very similar manner to the existing counters and their rebalance algorithm. If we use a batch size of 1 as we approach 0, then every addition and subtraction will be done against the global value and hence allow accurate detection of zero threshold crossing. Hence we can replace the handrolled, accurate-at-zero counters with generic percpu counters. Note: this removes just enough of the icsb infrastructure to compile without warnings. The rest will go in subsequent commits. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: use generic percpu counters for free inode counterDave Chinner1-0/+2
XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. The free inode counter is not used for any limit enforcement - the per-AG free inode counters are used during allocation to determine if there are inode available for allocation. Hence we don't need any of the complexity of the hand-rolled counters and we can simply replace them with generic per-cpu counters similar to the inode counter. This version introduces a xfs_mod_ifree() helper function from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: use generic percpu counters for inode counterDave Chinner1-2/+5
XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. There are some warts around the use of them for the inode counter as the hand rolled counter is designed to be accurate at zero, but has no specific accurracy at any other value. This design causes problems for the maximum inode count threshold enforcement, as there is no trigger that balances the counters as they get close tothe maximum threshold. Instead of designing new triggers for balancing, just replace the handrolled per-cpu counter with a generic counter. This enables us to update the counter through the normal superblock modification funtions, but rather than do that we add a xfs_mod_icount() helper function (from Christoph Hellwig) and keep the percpu counter outside the superblock in the struct xfs_mount. This means we still need to initialise the per-cpu counter specifically when we read the superblock, and vice versa when we log/write it, but it does mean that we don't need to change any other code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-16xfs: implement pNFS export operationsChristoph Hellwig1-0/+11
Add operations to export pNFS block layouts from an XFS filesystem. See the previous commit adding the operations for an explanation of them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: consolidate superblock logging functionsDave Chinner1-2/+1
We now have several superblock loggin functions that are identical except for the transaction reservation and whether it shoul dbe a synchronous transaction or not. Consolidate these all into a single function, a single reserveration and a sync flag and call it xfs_sync_sb(). Also, xfs_mod_sb() is not really a modification function - it's the operation of logging the superblock buffer. hence change the name of it to reflect this. Note that we have to change the mp->m_update_flags that are passed around at mount time to a boolean simply to indicate a superblock update is needed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: remove bitfield based superblock updatesDave Chinner1-1/+1
When we log changes to the superblock, we first have to write them to the on-disk buffer, and then log that. Right now we have a complex bitfield based arrangement to only write the modified field to the buffer before we log it. This used to be necessary as a performance optimisation because we logged the superblock buffer in every extent or inode allocation or freeing, and so performance was extremely important. We haven't done this for years, however, ever since the lazy superblock counters pulled the superblock logging out of the transaction commit fast path. Hence we have a bunch of complexity that is not necessary that makes writing the in-core superblock to disk much more complex than it needs to be. We only need to log the superblock now during management operations (e.g. during mount, unmount or quota control operations) so it is not a performance critical path anymore. As such, remove the complex field based logging mechanism and replace it with a simple conversion function similar to what we use for all other on-disk structures. This means we always log the entirity of the superblock, but again because we rarely modify the superblock this is not an issue for log bandwidth or CPU time. Indeed, if we do log the superblock frequently, delayed logging will minimise the impact of this overhead. [Fixed gquota/pquota inode sharing regression noticed by bfoster.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28Merge branch 'xfs-consolidate-format-defs' into for-nextDave Chinner1-4/+1
2014-11-28xfs: merge xfs_ag.h into xfs_format.hChristoph Hellwig1-4/+1
More on-disk format consolidation. A few declarations that weren't on-disk format related move into better suitable spots. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: allow lazy sb counter sync during filesystem freeze sequenceBrian Foster1-1/+1
The expectation since the introduction the lazy superblock counters is that the counters are synced and superblock logged appropriately as part of the filesystem freeze sequence. This does not occur, however, due to the logic in xfs_fs_writable() that prevents progress when the fs is in any state other than SB_UNFROZEN. While this is a bug, it has not been exposed to date because the last thing XFS does during freeze is dirty the log. The log recovery process recalculates the counters from AGI/AGF metadata to ensure everything is correct. Therefore should a crash occur while an fs is frozen, the subsequent log recovery puts everything back in order. See the following commit for reference: 92821e2b [XFS] Lazy Superblock Counters We might not always want to rely on dirtying the log on a frozen fs. Modify xfs_log_sbcount() to proceed when the filesystem is freezing but not once the freeze process has completed. Modify xfs_fs_writable() to accept the minimum freeze level for which modifications should be blocked to support various codepaths. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: replace global xfslogd wq with per-mount wqBrian Foster1-0/+1
The xfslogd workqueue is a global, single-job workqueue for buffer ioend processing. This means we allow for a single work item at a time for all possible XFS mounts on a system. fsstress testing in loopback XFS over XFS configurations has reproduced xfslogd deadlocks due to the single threaded nature of the queue and dependencies introduced between the separate XFS instances by online discard (-o discard). Discard over a loopback device converts the discard request to a hole punch (fallocate) on the underlying file. Online discard requests are issued synchronously and from xfslogd context in XFS, hence the xfslogd workqueue is blocked in the upper fs waiting on a hole punch request to be servied in the lower fs. If the lower fs issues I/O that depends on xfslogd to complete, both filesystems end up hung indefinitely. This is reproduced reliabily by generic/013 on XFS->loop->XFS test devices with the '-o discard' mount option. Further, docker implementations appear to use this kind of configuration for container instance filesystems by default (container fs->dm-> loop->base fs) and therefore are subject to this deadlock when running on XFS. Replace the global xfslogd workqueue with a per-mount variant. This guarantees each mount access to a single worker and prevents deadlocks due to inter-fs dependencies introduced by discard. Since the queue is only responsible for buffer iodone processing at this point in time, rename xfslogd to xfs-buf. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15xfs: add xfs_mount sysfs kobjectBrian Foster1-0/+1
Embed a base kobject into xfs_mount. This creates a kobject associated with each XFS mount and a subdirectory in sysfs with the name of the filesystem. The subdirectory lifecycle matches that of the mount. Also add the new xfs_sysfs.[c,h] source files with some XFS sysfs infrastructure to facilitate attribute creation. Note that there are currently no attributes exported as part of the xfs_mount kobject. It exists solely to serve as a per-mount container for child objects. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06xfs: move node entry counts to xfs_da_geometryDave Chinner1-2/+0
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06xfs: convert dir/attr btree threshold to xfs_da_geometryDave Chinner1-2/+0
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06xfs: convert m_dirblksize to xfs_da_geometryDave Chinner1-1/+0
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06xfs: convert m_dirblkfsbs to xfs_da_geometryDave Chinner1-1/+0
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06xfs: convert directory segment limits to xfs_da_geometryDave Chinner1-3/+0
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06xfs: introduce directory geometry structureDave Chinner1-0/+3
The directory code has a dependency on the struct xfs_mount to supply the directory block geometry. Block size, block log size, and other parameters are pre-caclulated in the struct xfs_mount or access directly from the superblock embedded in the struct xfs_mount. Extract all of this geometry information out of the struct xfs_mount and superblock and place it into a new struct xfs_da_geometry defined by the directory code. Allocate and initialise it at mount time, and attach it to the struct xfs_mount so it canbe passed back into the directory code appropriately rather than using the struct xfs_mount. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2013-11-18xfs: increase inode cluster size for v5 filesystemsDave Chinner1-1/+1
v5 filesystems use 512 byte inodes as a minimum, so read inodes in clusters that are effectively half the size of a v4 filesystem with 256 byte inodes. For v5 fielsystems, scale the inode cluster size with the size of the inode so that we keep a constant 32 inodes per cluster ratio for all inode IO. This only works if mkfs.xfs sets the inode alignment appropriately for larger inode clusters, so this functionality is made conditional on mkfs doing the right thing. xfs_repair needs to know about the inode alignment changes, too. Wall time: create bulkstat find+stat ls -R unlink v4 237s 161s 173s 201s 299s v5 235s 163s 205s 31s 356s patched 234s 160s 182s 29s 317s System time: create bulkstat find+stat ls -R unlink v4 2601s 2490s 1653s 1656s 2960s v5 2637s 2497s 1681s 20s 3216s patched 2613s 2451s 1658s 20s 3007s So, wall time same or down across the board, system time same or down across the board, and cache hit rates all improve except for the ls -R case which is a pure cold cache directory read workload on v5 filesystems... So, this patch removes most of the performance and CPU usage differential between v4 and v5 filesystems on traversal related workloads. Note: while this patch is currently for v5 filesystems only, there is no reason it can't be ported back to v4 filesystems. This hasn't been done here because bringing the code back to v4 requires forwards and backwards kernel compatibility testing. i.e. to deterine if older kernels(*) do the right thing with larger inode alignments but still only using 8k inode cluster sizes. None of this testing and validation on v4 filesystems has been done, so for the moment larger inode clusters is limited to v5 superblocks. (*) a current default config v4 filesystem should mount just fine on 2.6.23 (when lazy-count support was introduced), and so if we change the alignment emitted by mkfs without a feature bit then we have to make sure it works properly on all kernels since 2.6.23. And if we allow it to be changed when the lazy-count bit is not set, then it's all kernels since v2 logs were introduced that need to be tested for compatibility... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30xfs: vectorise DA btree operationsDave Chinner1-0/+1
The remaining non-vectorised code for the directory structure is the node format blocks. This is shared with the attribute tree, and so is slightly more complex to vectorise. Introduce a "non-directory" directory ops structure that is attached to all non-directory inodes so that attribute operations can be vectorised for all inodes. Once we do this, we can vectorise all the da btree operations. Because this patch adds more infrastructure than it removes the binary size does not decrease: text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2 789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3 789005 96802 1096 886903 d8997 fs/xfs/xfs.o.p4 789061 96802 1096 886959 d88af fs/xfs/xfs.o.p5 789733 96802 1096 887631 d8b4f fs/xfs/xfs.o.p6 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30xfs: abstract the differences in dir2/dir3 via an ops vectorDave Chinner1-0/+2
Lots of the dir code now goes through switches to determine what is the correct on-disk format to parse. It generally involves a "xfs_sbversion_hasfoo" check, deferencing the superblock version and feature fields and hence touching several cache lines per operation in the process. Some operations do multiple checks because they nest conditional operations and they don't pass the information in a direct fashion between each other. Hence, add an ops vector to the xfs_inode structure that is configured when the inode is initialised to point to all the correct decode and encoding operations. This will significantly reduce the branchiness and cacheline footprint of the directory object decoding and encoding. This is the first patch in a series of conversion patches. It will introduce the ops structure, the setup of it and add the first operation to the vector. Subsequent patches will convert directory ops one at a time to keep the changes simple and obvious. Just this patch shows the benefit of such an approach on code size. Just converting the two shortform dir operations as this patch does decreases the built binary size by ~1500 bytes: $ size fs/xfs/xfs.o.orig fs/xfs/xfs.o.p1 text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 $ That's a significant decrease in the instruction cache footprint of the directory code for such a simple change, and indicates that this approach is definitely worth pursuing further. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13xfs: Introduce a new structure to hold transaction reservation itemsJie Liu1-1/+1
Introduce a new structure xfs_trans_res to hold transaction reservation item info per log ticket. We also need to improve xfs_trans_resv_calc() by initializing the log count as well as log flags for permanent log reservation. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13xfs: make struct xfs_perag kernel onlyDave Chinner1-0/+52
The struct xfs_perag has many kernel-only definitions in it, requiring a __KERNEL__ guard so userspace can use it to. Move it to xfs_mount.h so that it it kernel-only, and let userspace redefine it's own version of the structure containing only what it needs. This gets rid of another __KERNEL__ check in the XFS header files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13xfs: introduce xfs_sb.c for sharing with libxfsDave Chinner1-17/+2
xfs_mount.c is shared with userspace, but the only functions that are shared are to do with physical superblock manipulations. This means that less than 25% of the xfs_mount.c code is actually shared with userspace. Move all the superblock functions to xfs_sb.c and share that instead with libxfs. Note that this will leave all the in-core transaction related superblock counter modifications in xfs_mount.c as none of that is shared with userspace. With a few more small changes, xfs_mount.h won't need to be shared with userspace anymore, either. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13xfs: split out transaction reservation codeDave Chinner1-40/+2
The transaction reservation size calculations is used by both kernel and userspace, but most of the transaction code in xfs_trans.c is kernel specific. Split all the transaction reservation code out into it's own files to make sharing with userspace simpler. This just leaves kernel-only definitions in xfs_trans.h, so it doesn't need to be shared with userspace anymore, either. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-19xfs: Remove XFS_MOUNT_RETERRJie Liu1-2/+0
XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)" branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so we always be emitting a warning and returning an error. Hence, we can remove it and get rid of a couple of redundant check up against it at xfs_upate_alignment(). Thanks Dave Chinner for the suggestions of simplify the code in xfs_parseargs(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>