summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_aops.c
AgeCommit message (Collapse)AuthorFilesLines
2019-04-16xfs: merge adjacent io completions of the same typeDarrick J. Wong1-0/+86
It's possible for pagecache writeback to split up a large amount of work into smaller pieces for throttling purposes or to reduce the amount of time a writeback operation is pending. Whatever the reason, XFS can end up with a bunch of IO completions that call for the same operation to be performed on a contiguous extent mapping. Since mappings are extent based in XFS, we'd prefer to run fewer transactions when we can. When we're processing an ioend on the list of io completions, check to see if the next items on the list are both adjacent and of the same type. If so, we can merge the completions to reduce transaction overhead. On fast storage this doesn't seem to make much of a difference in performance, though the number of transactions for an overnight xfstests run seems to drop by ~5%. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-16xfs: implement per-inode writeback completion queuesDarrick J. Wong1-11/+38
When scheduling writeback of dirty file data in the page cache, XFS uses IO completion workqueue items to ensure that filesystem metadata only updates after the write completes successfully. This is essential for converting unwritten extents to real extents at the right time and performing COW remappings. Unfortunately, XFS queues each IO completion work item to an unbounded workqueue, which means that the kernel can spawn dozens of threads to try to handle the items quickly. These threads need to take the ILOCK to update file metadata, which results in heavy ILOCK contention if a large number of the work items target a single file, which is inefficient. Worse yet, the writeback completion threads get stuck waiting for the ILOCK while holding transaction reservations, which can use up all available log reservation space. When that happens, metadata updates to other parts of the filesystem grind to a halt, even if the filesystem could otherwise have handled it. Even worse, if one of the things grinding to a halt happens to be a thread in the middle of a defer-ops finish holding the same ILOCK and trying to obtain more log reservation having exhausted the permanent reservation, we now have an ABBA deadlock - writeback completion has a transaction reserved and wants the ILOCK, and someone else has the ILOCK and wants a transaction reservation. Therefore, we create a per-inode writeback io completion queue + work item. When writeback finishes, it can add the ioend to the per-inode queue and let the single worker item process that queue. This dramatically cuts down on the number of kworkers and ILOCK contention in the system, and seems to have eliminated an occasional deadlock I was seeing while running generic/476. Testing with a program that simulates a heavy random-write workload to a single file demonstrates that the number of kworkers drops from approximately 120 threads per file to 1, without dramatically changing write bandwidth or pagecache access latency. Note that we leave the xfs-conv workqueue's max_active alone because we still want to be able to run ioend processing for as many inodes as the system can handle. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-03-09Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+5
Pull block layer updates from Jens Axboe: "Not a huge amount of changes in this round, the biggest one is that we finally have Mings multi-page bvec support merged. Apart from that, this pull request contains: - Small series that avoids quiescing the queue for sysfs changes that match what we currently have (Aleksei) - Series of bcache fixes (via Coly) - Series of lightnvm fixes (via Mathias) - NVMe pull request from Christoph. Nothing major, just SPDX/license cleanups, RR mp policy (Hannes), and little fixes (Bart, Chaitanya). - BFQ series (Paolo) - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection for the fast path (Jianchao) - fops->iopoll() added for async IO polling, this is a feature that the upcoming io_uring interface will use (Christoph, me) - Partition scan loop fixes (Dongli) - mtip32xx conversion from managed resource API (Christoph) - cdrom registration race fix (Guenter) - MD pull from Song, two minor fixes. - Various documentation fixes (Marcos) - Multi-page bvec feature. This brings a lot of nice improvements with it, like more efficient splitting, larger IOs can be supported without growing the bvec table size, and so on. (Ming) - Various little fixes to core and drivers" * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits) block: fix updating bio's front segment size block: Replace function name in string with __func__ nbd: propagate genlmsg_reply return code floppy: remove set but not used variable 'q' null_blk: fix checking for REQ_FUA block: fix NULL pointer dereference in register_disk fs: fix guard_bio_eod to check for real EOD errors blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map block: optimize bvec iteration in bvec_iter_advance block: introduce mp_bvec_for_each_page() for iterating over page block: optimize blk_bio_segment_split for single-page bvec block: optimize __blk_segment_map_sg() for single-page bvec block: introduce bvec_nth_page() iomap: wire up the iopoll method block: add bio_set_polled() helper block: wire up block device iopoll method fs: add an iopoll method to struct file_operations loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() loop: do not print warn message if partition scan is successful block: bounce: make sure that bvec table is updated ...
2019-02-21xfs: introduce an always_cow modeChristoph Hellwig1-1/+1
Add a mode where XFS never overwrites existing blocks in place. This is to aid debugging our COW code, and also put infatructure in place for things like possible future support for zoned block devices, which can't support overwrites. This mode is enabled globally by doing a: echo 1 > /sys/fs/xfs/debug/always_cow Note that the parameter is global to allow running all tests in xfstests easily in this mode, which would not easily be possible with a per-fs sysfs file. In always_cow mode persistent preallocations are disabled, and fallocate will fail when called with a 0 mode (with our without FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE. There are a few interesting xfstests failures when run in always_cow mode: - generic/392 fails because the bytes used in the file used to test hole punch recovery are less after the log replay. This is because the blocks written and then punched out are only freed with a delay due to the logging mechanism. - xfs/170 will fail as the already fragile file streams mechanism doesn't seem to interact well with the COW allocator - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim the file system is badly fragmented, but there is not much we can do to avoid that when always writing out of place - xfs/205 fails because overwriting a file in always_cow mode will require new space allocation and the assumption in the test thus don't work anymore. - xfs/326 fails to modify the file at all in always_cow mode after injecting the refcount error, leading to an unexpected md5sum after the remount, but that again is expected Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21xfs: also truncate holes covered by COW blocksChristoph Hellwig1-15/+16
This only matters if we want to write data through the COW fork that is not actually an overwrite of existing data. Reasons for that are speculative COW fork allocations using the cowextsize, or a mode where we always write through the COW fork. Currently both can't actually happen, but I plan to enable them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: retry COW fork delalloc conversion when no extent was foundChristoph Hellwig1-2/+16
While we can only truncate a block under the page lock for the current page, there is no high-level synchronization for moving extents from the COW to the data fork. This means that for example we can have another thread doing a direct I/O completion that moves extents from the COW to the data fork race with writeback. While this race is very hard to hit the always_cow seems to reproduce it reasonably well, and it also exists without that. Because of that there is a chance that a delalloc conversion for the COW fork might not find any extents to convert. In that case we should retry the whole block lookup and now find the blocks in the data fork. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: remove the truncate short cut in xfs_map_blocksChristoph Hellwig1-20/+0
Now that we properly handle the race with truncate in the delalloc allocator there is no need to short cut this exceptional case earlier on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: move xfs_iomap_write_allocate to xfs_aops.cChristoph Hellwig1-6/+45
This function is a small wrapper only used by the writeback code, so move it together with the writeback code and simplify it down to the glorified do { } while loop that is now is. A few bits intentionally got lost here: no need to call xfs_qm_dqattach because quotas are always attached when we create the delalloc reservation, and no need for the imap->br_startblock == 0 check given that xfs_bmapi_convert_delalloc already has a WARN_ON_ONCE for exactly that condition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: remove the s_maxbytes checks in xfs_map_blocksChristoph Hellwig1-6/+2
We already ensure all data fits into s_maxbytes in the write / fault path. The only reason we have them here is that they were copy and pasted from xfs_bmapi_read when we stopped using that function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17xfs: remove the io_type field from the writeback context and ioendChristoph Hellwig1-51/+42
The io_type field contains what is basically a summary of information from the inode fork and the imap. But we can just as easily use that information directly, simplifying a few bits here and there and improving the trace points. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-15Merge tag 'v5.0-rc6' into for-5.1/blockJens Axboe1-0/+2
Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c. This is needed since io_uring is now based on the block branch, to avoid a conflict between the multi-page bvecs and the bits of io_uring that touch the core block parts. * tag 'v5.0-rc6': (525 commits) Linux 5.0-rc6 x86/mm: Make set_pmd_at() paravirt aware MAINTAINERS: Update the ocores i2c bus driver maintainer, etc blk-mq: remove duplicated definition of blk_mq_freeze_queue Blk-iolatency: warn on negative inflight IO counter blk-iolatency: fix IO hang due to negative inflight counter MAINTAINERS: unify reference to xen-devel list x86/mm/cpa: Fix set_mce_nospec() futex: Handle early deadlock return correctly futex: Fix barrier comment net: dsa: b53: Fix for failure when irq is not defined in dt blktrace: Show requests without sector mips: cm: reprime error cause mips: loongson64: remove unreachable(), fix loongson_poweroff(). sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach() geneve: should not call rt6_lookup() when ipv6 was disabled KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221) KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222) kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) signal: Better detection of synchronous signals ...
2019-02-15block: enable multipage bvecsMing Lei1-2/+2
This patch pulls the trigger for multi-page bvecs. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: allow bio_for_each_segment_all() to iterate over multi-page bvecMing Lei1-2/+3
This patch introduces one extra iterator variable to bio_for_each_segment_all(), then we can allow bio_for_each_segment_all() to iterate over multi-page bvec. Given it is just one mechannical & simple change on all bio_for_each_segment_all() users, this patch does tree-wide change in one single patch, so that we can avoid to use a temporary helper for this conversion. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-12xfs: remove superfluous writeback mapping eof trimmingBrian Foster1-15/+0
Now that the cached writeback mapping is explicitly invalidated on data fork changes, the EOF trimming band-aid is no longer necessary. Remove xfs_trim_extent_eof() as well since it has no other users. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-12xfs: validate writeback mapping using data fork seq counterBrian Foster1-13/+47
The writeback code caches the current extent mapping across multiple xfs_do_writepage() calls to avoid repeated lookups for sequential pages backed by the same extent. This is known to be slightly racy with extent fork changes in certain difficult to reproduce scenarios. The cached extent is trimmed to within EOF to help avoid the most common vector for this problem via speculative preallocation management, but this is a band-aid that does not address the fundamental problem. Now that we have an xfs_ifork sequence counter mechanism used to facilitate COW writeback, we can use the same mechanism to validate consistency between the data fork and cached writeback mappings. On its face, this is somewhat of a big hammer approach because any change to the data fork invalidates any mapping currently cached by a writeback in progress regardless of whether the data fork change overlaps with the range under writeback. In practice, however, the impact of this approach is minimal in most cases. First, data fork changes (delayed allocations) caused by sustained sequential buffered writes are amortized across speculative preallocations. This means that a cached mapping won't be invalidated by each buffered write of a common file copy workload, but rather only on less frequent allocation events. Second, the extent tree is always entirely in-core so an additional lookup of a usable extent mostly costs a shared ilock cycle and in-memory tree lookup. This means that a cached mapping reval is relatively cheap compared to the I/O itself. Third, spurious invalidations don't impact ioend construction. This means that even if the same extent is revalidated multiple times across multiple writepage instances, we still construct and submit the same size ioend (and bio) if the blocks are physically contiguous. Update struct xfs_writepage_ctx with a new field to hold the sequence number of the data fork associated with the currently cached mapping. Check the wpc seqno against the data fork when the mapping is validated and reestablish the mapping whenever the fork has changed since the mapping was cached. This ensures that writeback always uses a valid extent mapping and thus prevents lost writebacks and stale delalloc block problems. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-04xfs: eof trim writeback mapping as soon as it is cachedBrian Foster1-0/+2
The cached writeback mapping is EOF trimmed to try and avoid races between post-eof block management and writeback that result in sending cached data to a stale location. The cached mapping is currently trimmed on the validation check, which leaves a race window between the time the mapping is cached and when it is trimmed against the current inode size. For example, if a new mapping is cached by delalloc conversion on a blocksize == page size fs, we could cycle various locks, perform memory allocations, etc. in the writeback codepath before the associated mapping is eventually trimmed to i_size. This leaves enough time for a post-eof truncate and file append before the cached mapping is trimmed. The former event essentially invalidates a range of the cached mapping and the latter bumps the inode size such the trim on the next writepage event won't trim all of the invalid blocks. fstest generic/464 reproduces this scenario occasionally and causes a lost writeback and stale delalloc blocks warning on inode inactivation. To work around this problem, trim the cached writeback mapping as soon as it is cached in addition to on subsequent validation checks. This is a minor tweak to tighten the race window as much as possible until a proper invalidation mechanism is available. Fixes: 40214d128e07 ("xfs: trim writepage mapping to within eof") Cc: <stable@vger.kernel.org> # v4.14+ Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-10-18xfs: remove XFS_IO_INVALIDChristoph Hellwig1-2/+2
The invalid state isn't any different from a hole, so merge the two states. Use the more descriptive hole name, but keep it as the first value of the enum to catch uninitialized fields. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-08-07xfs: use WRITE_ONCE to update if_seqChristoph Hellwig1-2/+2
This adds ordering of the updates and makes sure we always see the if_seq update before the extent tree is modified. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-31xfs: avoid COW fork extent lookups in writeback if the fork didn't changeChristoph Hellwig1-5/+33
Used the per-fork sequence counter to avoid lookups in the writeback code unless the COW fork actually changed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30xfs: introduce a new xfs_inode_has_cow_data helperChristoph Hellwig1-2/+2
We have a few places that already check if an inode has actual data in the COW fork to avoid work on reflink inodes that do not actually have outstanding COW blocks. There are a few more places that can avoid working if doing the same check, so add a documented helper for this condition and use it in all places where it makes sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: update my copyrights for the writeback and iomap codeChristoph Hellwig1-0/+1
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: add support for sub-pagesize writeback without buffer_headsChristoph Hellwig1-434/+58
Switch to using the iomap_page structure for checking sub-page uptodate status and track sub-page I/O completion status, and remove large quantities of boilerplate code working around buffer heads. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: allow writeback on pages without buffer headsChristoph Hellwig1-14/+37
Disable the IOMAP_F_BUFFER_HEAD flag on file systems with a block size equal to the page size, and deal with pages without buffer heads in writeback. Thanks to the previous refactoring this is basically trivial now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: refactor the tail of xfs_writepage_mapChristoph Hellwig1-33/+32
Rejuggle how we deal with the different error vs non-error and have ioends vs not have ioend cases to keep the fast path streamlined, and the duplicate code at a minimum. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: remove xfs_start_page_writebackChristoph Hellwig1-26/+20
This helper only has two callers, one of them with a constant error argument. Remove it to make pending changes to the code a little easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: move all writeback buffer_head manipulation into xfs_map_at_offsetChristoph Hellwig1-17/+5
This keeps it in a single place so it can be made otional more easily. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: don't look at buffer heads in xfs_add_to_ioendChristoph Hellwig1-36/+32
Calculate all information for the bio based on the passed in information without requiring a buffer_head structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: remove the imap_valid flagChristoph Hellwig1-51/+38
Simplify the way we check for a valid imap - we know we have a valid mapping after xfs_map_blocks returned successfully, and we know we can call xfs_imap_valid on any imap, as it will always fail on a zero-initialized map. We can also remove the xfs_imap_valid function and fold it into xfs_map_blocks now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directlyChristoph Hellwig1-14/+5
xfs_bmapi_read adds zero value in xfs_map_blocks. Replace it with a direct call to the low-level extent lookup function. Note that we now always pass a 0 length to the trace points as we ask for an unspecified len. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: remove xfs_reflink_find_cow_mappingChristoph Hellwig1-6/+13
We only have one caller left, and open coding the simple extent list lookup in it allows us to make the code both more understandable and reuse calculations and variables already present. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: make xfs_writepage_map extent map centricDave Chinner1-52/+36
xfs_writepage_map() iterates over the bufferheads on a page to decide what sort of IO to do and what actions to take. However, when it comes to reflink and deciding when it needs to execute a COW operation, we no longer look at the bufferhead state but instead we ignore than and look up internal state held in the COW fork extent list. This means xfs_writepage_map() is somewhat confused. It does stuff, then ignores it, then tries to handle the impedence mismatch by shovelling the results inside the existing mapping code. It works, but it's a bit of a mess and it makes it hard to fix the cached map bug that the writepage code currently has. To unify the two different mechanisms, we first have to choose a direction. That's already been set - we're de-emphasising bufferheads so they are no longer a control structure as we need to do taht to allow for eventual removal. Hence we need to move away from looking at bufferhead state to determine what operations we need to perform. We can't completely get rid of bufferheads yet - they do contain some state that is absolutely necessary, such as whether that part of the page contains valid data or not (buffer_uptodate()). Other state in the bufferhead is redundant: BH_dirty - the page is dirty, so we can ignore this and just write it BH_delay - we have delalloc extent info in the DATA fork extent tree BH_unwritten - same as BH_delay BH_mapped - indicates we've already used it once for IO and it is mapped to a disk address. Needs to be ignored for COW blocks. The BH_mapped flag is an interesting case - it's supposed to indicate that it's already mapped to disk and so we can just use it "as is". In theory, we don't even have to do an extent lookup to find where to write it too, but we have to do that anyway to determine we are actually writing over a valid extent. Hence it's not even serving the purpose of avoiding a an extent lookup during writeback, and so we can pretty much ignore it. Especially as we have to ignore it for COW operations... Therefore, use the extent map as the source of information to tell us what actions we need to take and what sort of IO we should perform. The first step is to have xfs_map_blocks() set the io type according to what it looks up. This means it can easily handle both normal overwrite and COW cases. The only thing we also need to add is the ability to return hole mappings. We need to return and cache hole mappings now for the case of multiple blocks per page. We no longer use the BH_mapped to indicate a block over a hole, so we have to get that info from xfs_map_blocks(). We cache it so that holes that span two pages don't need separate lookups. This allows us to avoid ever doing write IO over a hole, too. Now that we have xfs_map_blocks() returning both a cached map and the type of IO we need to perform, we can rewrite xfs_writepage_map() to drop all the bufferhead control. It's also much simplified because it doesn't need to explicitly handle COW operations. Instead of iterating bufferheads, it iterates blocks within the page and then looks up what per-block state is required from the appropriate bufferhead. It then validates the cached map, and if it's not valid, we get a new map. If we don't get a valid map or it's over a hole, we skip the block. At this point, we have to remap the bufferhead via xfs_map_at_offset(). As previously noted, we had to do this even if the buffer was already mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN and XFS_IO_COW IO types. With xfs_map_blocks() now controlling the type, even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet- written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE. Bufferheads that span such regions still need their BH_Delay flags cleared and their block numbers calculated, so we now unconditionally map each bufferhead before submission. But wait! There's more - remember the old "treat unwritten extents as holes on read" hack? Yeah, that means we can have a dirty page with unmapped, unwritten bufferheads that contain data! What makes these so special is that the unwritten "hole" bufferheads do not have a valid block device pointer, so if we attempt to write them xfs_add_to_ioend() blows up. So we make xfs_map_at_offset() do the "realtime or data device" lookup from the inode and ignore what was or wasn't put into the bufferhead when the buffer was instantiated. The astute reader will have realised by now that this code treats unwritten extents in multiple-blocks-per-page situations differently. If we get any combination of unwritten blocks on a dirty page that contain valid data in the page, we're going to convert them to real extents. This can actually be a win, because it means that pages with interleaving unwritten and written blocks will get converted to a single written extent with zeros replacing the interspersed unwritten blocks. This is actually good for reducing extent list and conversion overhead, and it means we issue a contiguous IO instead of lots of little ones. The downside is that we use up a little extra IO bandwidth. Neither of these seem like a bad thing given that spinning disks are seek sensitive, and SSDs/pmem have bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger IOs will result in better performance on them... As a result of all this, the only state we actually care about from the bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to pass some information to the bio via xfs_add_to_ioend(), but that is trivial to separate and pass explicitly. This means we really only need 1 bit of state per block per page from the buffered write path in the writeback path. Everything else we do with the bufferhead is purely to make the buffered IO front end continue to work correctly. i.e we've pretty much marginalised bufferheads in the writeback path completely. Signed-off-By: Dave Chinner <dchinner@redhat.com> [hch: forward port, refactor and split off bits into other commits] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: rename the offset variable in xfs_writepage_mapChristoph Hellwig1-10/+10
Calling it file_offset makes the usage more clear, especially with a new poffset variable that will be added soon for the offset inside the page. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: remove xfs_map_cowChristoph Hellwig1-98/+97
We can handle the existing cow mapping case as a special case directly in xfs_writepage_map, and share code for allocating delalloc blocks with regular I/O in xfs_map_blocks. This means we need to always call xfs_map_blocks for reflink inodes, but we can still skip most of the work if it turns out that there is no COW mapping overlapping the current block. As a subtle detail we need to start caching holes in the wpc to deal with the case of COW reservations between EOF. But we'll need that infrastructure later anyway, so this is no big deal. Based on a patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: remove xfs_reflink_trim_irec_to_next_cowChristoph Hellwig1-7/+0
We already have to check for overlapping COW extents everytime we come back to a page in xfs_writepage_map / xfs_map_cow, so this additional trim is not required. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocksChristoph Hellwig1-4/+1
We want to be able to use the extent state as a reliably indicator for the type of I/O, and stop using the buffer head state. For this we need to stop using the XFS_BMAPI_IGSTATE so that we don't see merged extents of different types. Based on a patch from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: don't clear imap_valid for a non-uptodate buffersChristoph Hellwig1-7/+2
Finding a buffer that isn't uptodate doesn't invalidate the mapping for any given block. The last_sector check will already take care of starting another ioend as soon as we find any non-update buffer, and if the current mapping doesn't include the next uptodate buffer the xfs_imap_valid check will take care of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: do not set the page uptodate in xfs_writepage_mapChristoph Hellwig1-6/+0
We already track the page uptodate status based on the buffer uptodate status, which is updated whenever reading or zeroing blocks. This code has been there since commit a ptool commit in 2002, which claims to: "merge" the 2.4 fsx fix for block size < page size to 2.5. This needed major changes to actually fit. and isn't present in other writepage implementations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: move locking into xfs_bmap_punch_delalloc_rangeChristoph Hellwig1-2/+0
Both callers want the same looking, so do it only once. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: simplify xfs_aops_discard_pageChristoph Hellwig1-76/+9
Instead of looking at the buffer heads to see if a block is delalloc just call xfs_bmap_punch_delalloc_range on the whole page - this will leave any non-delalloc block intact and handle the iteration for us. As a side effect one more place stops caring about buffer heads and we can remove the xfs_check_page_type function entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-12xfs: use iomap for blocksize == PAGE_SIZE readpage and readpagesChristoph Hellwig1-0/+4
For file systems with a block size that equals the page size we never do partial reads, so we can use the buffer_head-less iomap versions of readpage and readpages without conflicting with the buffer_head structures create later in write_begin. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-13Merge tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-13/+12
Pull more xfs updates from Darrick Wong: "Here's the second round of patches for XFS for 4.18. Most of the commits are small cleanups, bug fixes, and continued strengthening of metadata verifiers; the bulk of the diff is the conversion of the fs/xfs/ tree to use SPDX tags. This series has been run through a full xfstests run over the weekend and through a quick xfstests run against this morning's master, with no major failures reported. Summary: - Strengthen metadata checking to avoid ASSERTing on bad disk contents - Validate btree records that are being retrieved for clients - Strengthen root inode verification - Convert license blurbs to SPDX tags - Enable changing DAX flag on directories - Fix some writeback deadlocks in reflink - Refactor out some old xfs helpers - Move type verifiers to a separate file - Fix some fuzzer crashes - Various other bug fixes" * tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits) xfs: update incore per-AG inode count xfs: replace do_mod with native operations xfs: don't call xfs_da_shrink_inode with NULL bp xfs: clean up MIN/MAX xfs: move various type verifiers to common file xfs: xfs_reflink_convert_cow() memory allocation deadlock xfs: setup VFS i_rwsem lockdep state correctly xfs: fix string handling in label get/set functions xfs: convert to SPDX license tags xfs: validate btree records on retrieval xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode() xfs: verify root inode more thoroughly xfs: verify COW extent size hint is valid in inode verifier xfs: verify extent size hint is valid in inode verifier xfs: catch bad stripe alignment configurations iomap: fsync swap files before iterating mappings xfs: use xfs_trans_getsb in xfs_sync_sb_buf xfs: don't assert on corrupted unlinked inode list xfs: explicitly pass buffer size to xfs_corruption_error xfs: don't assert when on-disk btree pointers are garbage ...
2018-06-08xfs: xfs_reflink_convert_cow() memory allocation deadlockDave Chinner1-0/+11
xfs_reflink_convert_cow() manipulates the incore extent list in GFP_KERNEL context in the IO submission path whilst holding locked pages under writeback. This is a memory reclaim deadlock vector. This code is not in a transaction, so any memory allocations it makes aren't protected via the memalloc_nofs_save() context that transactions carry. Hence we need to run this call under memalloc_nofs_save() context to prevent potential memory allocations from being run as GFP_KERNEL and deadlocking. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-07xfs: convert to SPDX license tagsDave Chinner1-13/+1
Remove the verbose license text from XFS files and replace them with SPDX tags. This does not change the license of any of the code, merely refers to the common, up-to-date license files in LICENSES/ This change was mostly scripted. fs/xfs/Makefile and fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected and modified by the following command: for f in `git grep -l "GNU General" fs/xfs/` ; do echo $f cat $f | awk -f hdr.awk > $f.new mv -f $f.new $f done And the hdr.awk script that did the modification (including detecting the difference between GPL-2.0 and GPL-2.0+ licenses) is as follows: $ cat hdr.awk BEGIN { hdr = 1.0 tag = "GPL-2.0" str = "" } /^ \* This program is free software/ { hdr = 2.0; next } /any later version./ { tag = "GPL-2.0+" next } /^ \*\// { if (hdr > 0.0) { print "// SPDX-License-Identifier: " tag print str print $0 str="" hdr = 0.0 next } print $0 next } /^ \* / { if (hdr > 1.0) next if (hdr > 0.0) { if (str != "") str = str "\n" str = str $0 next } print $0 next } /^ \*/ { if (hdr > 0.0) next print $0 next } // { if (hdr > 0.0) { if (str != "") str = str "\n" str = str $0 next } print $0 } END { } $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-05Merge tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-6/+15
Pull xfs updates from Darrick Wong: "New features this cycle include the ability to relabel mounted filesystems, support for fallocated swapfiles, and using FUA for pure data O_DSYNC directio writes. With this cycle we begin to integrate online filesystem repair and refactor the growfs code in preparation for eventual subvolume support, though the road ahead for both features is quite long. There are also numerous refactorings of the iomap code to remove unnecessary log overhead, to disentangle some of the quota code, and to prepare for buffer head removal in a future upstream kernel. Metadata validation continues to improve, both in the hot path veifiers and the online filesystem check code. I anticipate sending a second pull request in a few days with more metadata validation improvements. This series has been run through a full xfstests run over the weekend and through a quick xfstests run against this morning's master, with no major failures reported. Summary: - Strengthen inode number and structure validation when allocating inodes. - Reduce pointless buffer allocations during cache miss - Use FUA for pure data O_DSYNC directio writes - Various iomap refactorings - Strengthen quota metadata verification to avoid unfixable broken quota - Make AGFL block freeing a deferred operation to avoid blowing out transaction reservations when running complex operations - Get rid of the log item descriptors to reduce log overhead - Fix various reflink bugs where inodes were double-joined to transactions - Don't issue discards when trimming unwritten extents - Refactor incore dquot initialization and retrieval interfaces - Fix some locking problmes in the quota scrub code - Strengthen btree structure checks in scrub code - Rewrite swapfile activation to use iomap and support unwritten extents - Make scrub exit to userspace sooner when corruptions or cross-referencing problems are found - Make scrub invoke the data fork scrubber directly on metadata inodes - Don't do background reclamation of post-eof and cow blocks when the fs is suspended - Fix secondary superblock buffer lifespan hinting - Refactor growfs to use table-dispatched functions instead of long stringy functions - Move growfs code to libxfs - Implement online fs label getting and setting - Introduce online filesystem repair (in a very limited capacity) - Fix unit conversion problems in the realtime freemap iteration functions - Various refactorings and cleanups in preparation to remove buffer heads in a future release - Reimplement the old bmap call with iomap - Remove direct buffer head accesses from seek hole/data - Various bug fixes" * tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits) fs: use ->is_partially_uptodate in page_cache_seek_hole_data fs: remove the buffer_unwritten check in page_seek_hole_data fs: move page_cache_seek_hole_data to iomap.c xfs: use iomap_bmap iomap: add an iomap-based bmap implementation iomap: add a iomap_sector helper iomap: use __bio_add_page in iomap_dio_zero iomap: move IOMAP_F_BOUNDARY to gfs2 iomap: fix the comment describing IOMAP_NOWAIT iomap: inline data should be an iomap type, not a flag mm: split ->readpages calls to avoid non-contiguous pages lists mm: return an unsigned int from __do_page_cache_readahead mm: give the 'ret' variable a better name __do_page_cache_readahead block: add a lower-level bio_add_page interface xfs: fix error handling in xfs_refcount_insert() xfs: fix xfs_rtalloc_rec units xfs: strengthen rtalloc query range checks xfs: xfs_rtbuf_get should check the bmapi_read results xfs: xfs_rtword_t should be unsigned, not signed dax: change bdev_dax_supported() to support boolean returns ...
2018-06-02xfs: use iomap_bmapChristoph Hellwig1-6/+3
Switch to the iomap based bmap implementation to get rid of one of the last users of xfs_get_blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-31xfs: convert to bioset_init()/mempool_init()Kent Overstreet1-1/+1
Convert XFS to embedded bio sets. Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-16iomap: add a swapfile activation functionDarrick J. Wong1-0/+12
Add a new iomap_swapfile_activate function so that filesystems can activate swap files without having to use the obsolete and slow bmap function. This enables XFS to support fallocate'd swap files and swap files on realtime devices. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz>
2018-04-11export __set_page_dirtyMatthew Wilcox1-13/+2
XFS currently contains a copy-and-paste of __set_page_dirty(). Export it from buffer.c instead. Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Dave Chinner <david@fromorbit.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-10Merge tag 'libnvdimm-for-4.17' of ↵Linus Torvalds1-16/+18
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "This cycle was was not something I ever want to repeat as there were several late changes that have only now just settled. Half of the branch up to commit d2c997c0f145 ("fs, dax: use page->mapping to warn...") have been in -next for several releases. The of_pmem driver and the address range scrub rework were late arrivals, and the dax work was scaled back at the last moment. The of_pmem driver missed a previous merge window due to an oversight. A sense of obligation to rectify that miss is why it is included for 4.17. It has acks from PowerPC folks. Stephen reported a build failure that only occurs when merging it with your latest tree, for now I have fixed that up by disabling modular builds of of_pmem. A test merge with your tree has received a build success report from the 0day robot over 156 configs. An initial version of the ARS rework was submitted before the merge window. It is self contained to libnvdimm, a net code reduction, and passing all unit tests. The filesystem-dax changes are based on the wait_var_event() functionality from tip/sched/core. However, late review feedback showed that those changes regressed truncate performance to a large degree. The branch was rewound to drop the truncate behavior change and now only includes preparation patches and cleanups (with full acks and reviews). The finalization of this dax-dma-vs-trnucate work will need to wait for 4.18. Summary: - A rework of the filesytem-dax implementation provides for detection of unmap operations (truncate / hole punch) colliding with in-progress device-DMA. A fix for these collisions remains a work-in-progress pending resolution of truncate latency and starvation regressions. - The of_pmem driver expands the users of libnvdimm outside of x86 and ACPI to describe an implementation of persistent memory on PowerPC with Open Firmware / Device tree. - Address Range Scrub (ARS) handling is completely rewritten to account for the fact that ARS may run for 100s of seconds and there is no platform defined way to cancel it. ARS will now no longer block namespace initialization. - The NVDIMM Namespace Label implementation is updated to handle label areas as small as 1K, down from 128K. - Miscellaneous cleanups and updates to unit test infrastructure" * tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits) libnvdimm, of_pmem: workaround OF_NUMA=n build error nfit, address-range-scrub: add module option to skip initial ars nfit, address-range-scrub: rework and simplify ARS state machine nfit, address-range-scrub: determine one platform max_ars value powerpc/powernv: Create platform devs for nvdimm buses doc/devicetree: Persistent memory region bindings libnvdimm: Add device-tree based driver libnvdimm: Add of_node to region and bus descriptors libnvdimm, region: quiet region probe libnvdimm, namespace: use a safe lookup for dimm device name libnvdimm, dimm: fix dpa reservation vs uninitialized label area libnvdimm, testing: update the default smart ctrl_temperature libnvdimm, testing: Add emulation for smart injection commands nfit, address-range-scrub: introduce nfit_spa->ars_state libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device' nfit, address-range-scrub: fix scrub in-progress reporting dax, dm: allow device-mapper to operate without dax support dax: introduce CONFIG_DAX_DRIVER fs, dax: use page->mapping to warn if truncate collides with a busy page ext2, dax: introduce ext2_dax_aops ...
2018-04-06Merge branch 'work.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff, including Christoph's I_DIRTY patches" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: move I_DIRTY_INODE to fs.h ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls fs: fold open_check_o_direct into do_dentry_open vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents vfs: make sure struct filename->iname is word-aligned get rid of pointless includes of fs_struct.h [poll] annotate SAA6588_CMD_POLL users