summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_aops.c
AgeCommit message (Collapse)AuthorFilesLines
2016-03-16mm: simplify lock_page_memcg()Johannes Weiner1-4/+3
Now that migration doesn't clear page->mem_cgroup of live pages anymore, it's safe to make lock_page_memcg() and the memcg stat functions take pages, and spare the callers from memcg objects. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-16mm: memcontrol: generalize locking for the page->mem_cgroup bindingJohannes Weiner1-4/+4
These patches tag the page cache radix tree eviction entries with the memcg an evicted page belonged to, thus making per-cgroup LRU reclaim work properly and be as adaptive to new cache workingsets as global reclaim already is. This should have been part of the original thrash detection patch series, but was deferred due to the complexity of those patches. This patch (of 5): So far the only sites that needed to exclude charge migration to stabilize page->mem_cgroup have been per-cgroup page statistics, hence the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection will add another site that needs to ensure page->mem_cgroup lifetime. Rename these locking functions to the more generic lock_page_memcg() and unlock_page_memcg(). Since charge migration is a cgroup1 feature only, we might be able to delete it at some point, and these now easy to identify locking sites along with it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27dax: move writeback calls into the filesystemsRoss Zwisler1-0/+4
Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27dax: give DAX clearing code correct bdevRoss Zwisler1-1/+1
dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its arguments to take a bdev and a sector instead of an inode and a block. This better reflects what the function does, and it allows the filesystem and raw block device code to pass in an appropriate struct block_device. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-08xfs: add tracepoints to readpage callsDave Chinner1-0/+2
This allows us to see page cache driven readahead in action as it passes through XFS. This helps to understand buffered read throughput problems such as readahead IO IO sizes being too small for the underlying device to reach max throughput. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03Merge branch 'xfs-dax-updates' into for-nextDave Chinner1-46/+50
2015-11-03xfs: DAX does not use IO completion callbacksDave Chinner1-39/+0
For DAX, we are now doing block zeroing during allocation. This means we no longer need a special DAX fault IO completion callback to do unwritten extent conversion. Because mmap never extends the file size (it SEGVs the process) we don't need a callback to update the file size, either. Hence we can remove the completion callbacks from the __dax_fault and __dax_mkwrite calls. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: Don't use unwritten extents for DAXDave Chinner1-4/+9
DAX has a page fault serialisation problem with block allocation. Because it allows concurrent page faults and does not have a page lock to serialise faults to the same page, it can get two concurrent faults to the page that race. When two read faults race, this isn't a huge problem as the data underlying the page is not changing and so "detect and drop" works just fine. The issues are to do with write faults. When two write faults occur, we serialise block allocation in get_blocks() so only one faul will allocate the extent. It will, however, be marked as an unwritten extent, and that is where the problem lies - the DAX fault code cannot differentiate between a block that was just allocated and a block that was preallocated and needs zeroing. The result is that both write faults end up zeroing the block and attempting to convert it back to written. The problem is that the first fault can zero and convert before the second fault starts zeroing, resulting in the zeroing for the second fault overwriting the data that the first fault wrote with zeros. The second fault then attempts to convert the unwritten extent, which is then a no-op because it's already written. Data loss occurs as a result of this race. Because there is no sane locking construct in the page fault code that we can use for serialisation across the page faults, we need to ensure block allocation and zeroing occurs atomically in the filesystem. This means we can still take concurrent page faults and the only time they will serialise is in the filesystem mapping/allocation callback. The page fault code will always see written, initialised extents, so we will be able to remove the unwritten extent handling from the DAX code when all filesystems are converted. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: fix inode size update overflow in xfs_map_direct()Dave Chinner1-6/+44
Both direct IO and DAX pass an offset and count into get_blocks that will overflow a s64 variable when an IO goes into the last supported block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen from the tracing: xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096 0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that overflows the s64 offset and we fail to detect the need for a filesize update and an ioend is not allocated. This is *mostly* avoided for direct IO because such extending IOs occur with full block allocation, and so the "IS_UNWRITTEN()" check still evaluates as true and we get an ioend that way. However, doing single sector extending IOs to this last block will expose the fact that file size updates will not occur after the first allocating direct IO as the overflow will then be exposed. There is one further complexity: the DAX page fault path also exposes the same issue in block allocation. However, page faults cannot extend the file size, so in this case we want to allocate the block but do not want to allocate an ioend to enable file size update at IO completion. Hence we now need to distinguish between the direct IO patch allocation and dax fault path allocation to avoid leaking ioend structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12xfs: add missing ilock around dio write last extent alignmentBrian Foster1-5/+5
The iomap codepath (via get_blocks()) acquires and release the inode lock in the case of a direct write that requires block allocation. This is because xfs_iomap_write_direct() allocates a transaction, which means the ilock must be dropped and reacquired after the transaction is allocated and reserved. xfs_iomap_write_direct() invokes xfs_iomap_eof_align_last_fsb() before the transaction is created and thus before the ilock is reacquired. This can lead to calls to xfs_iread_extents() and reads of the in-core extent list without any synchronization (via xfs_bmap_eof() and xfs_bmap_last_extent()). xfs_iread_extents() assert fails if the ilock is not held, but this is not currently seen in practice as the current callers had already invoked xfs_bmapi_read(). What has been seen in practice are reports of crashes down in the xfs_bmap_eof() codepath on direct writes due to seemingly bogus pointer references from xfs_iext_get_ext(). While an explicit reproducer is not currently available to confirm the cause of the problem, crash analysis and code inspection from David Jeffrey had identified the insufficient locking. xfs_iomap_eof_align_last_fsb() is called from other contexts with the inode lock already held, so we cannot acquire it therein. __xfs_get_blocks() acquires and drops the ilock with variable flags to cover the event that the extent list must be read in. The common case is that __xfs_get_blocks() acquires the shared ilock. To provide locking around the last extent alignment call without adding more lock cycles to the dio path, update xfs_iomap_write_direct() to expect the shared ilock held on entry and do the extent alignment under its protection. Demote the lock, if necessary, from __xfs_get_blocks() and push the xfs_qm_dqattach() call outside of the shared lock critical section. Also, add an assert to document that the extent list is always expected to be present in this path. Otherwise, we risk a call to xfs_iread_extents() while under the shared ilock. This is safe as all current callers have executed an xfs_bmapi_read() call under the current iolock context. Reported-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12cancel the setfilesize transation when io error happenZhaohongjiang1-2/+11
When I ran xfstest/073 case, the remount process was blocked to wait transactions to be zero. I found there was a io error happened, and the setfilesize transaction was not released properly. We should add the changes to cancel the io error in this case. Reproduction steps: 1. dd if=/dev/zero of=xfs1.img bs=1M count=2048 2. mkfs.xfs xfs1.img 3. losetup -f ./xfs1.img /dev/loop0 4. mount -t xfs /dev/loop0 /home/test_dir/ 5. mkdir /home/test_dir/test 6. mkfs.xfs -dfile,name=image,size=2g 7. mount -t xfs -o loop image /home/test_dir/test 8. cp a file bigger than 2g to /home/test_dir/test 9. mount -t xfs -o remount,ro /home/test_dir/test [ dchinner: moved io error detection to xfs_setfilesize_ioend() after transaction context restoration. ] Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-09-07Merge tag 'xfs-for-linus-4.3' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "There isn't a whole lot to this update - it's mostly bug fixes and they are spread pretty much all over XFS. There are some corruption fixes, some fixes for log recovery, some fixes that prevent unount from hanging, a lockdep annotation rework for inode locking to prevent false positives and the usual random bunch of cleanups and minor improvements. Deatils: - large rework of EFI/EFD lifecycle handling to fix log recovery corruption issues, crashes and unmount hangs - separate metadata UUID on disk to enable changing boot label UUID for v5 filesystems - fixes for gcc miscompilation on certain platforms and optimisation levels - remote attribute allocation and recovery corruption fixes - inode lockdep annotation rework to fix bugs with too many subclasses - directory inode locking changes to prevent lockdep false positives - a handful of minor corruption fixes - various other small cleanups and bug fixes" * tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits) xfs: fix error gotos in xfs_setattr_nonsize xfs: add mssing inode cache attempts counter increment xfs: return errors from partial I/O failures to files libxfs: bad magic number should set da block buffer error xfs: fix non-debug build warnings xfs: collapse allocsize and biosize mount option handling xfs: Fix file type directory corruption for btree directories xfs: lockdep annotations throw warnings on non-debug builds xfs: Fix uninitialized return value in xfs_alloc_fix_freelist() xfs: inode lockdep annotations broke non-lockdep build xfs: flush entire file on dio read/write to cached file xfs: Fix xfs_attr_leafblock definition libxfs: readahead of dir3 data blocks should use the read verifier xfs: stop holding ILOCK over filldir callbacks xfs: clean up inode lockdep annotations xfs: swap leaf buffer into path struct atomically during path shift xfs: relocate sparse inode mount warning xfs: dquots should be stamped with sb_meta_uuid xfs: log recovery needs to validate against sb_meta_uuid xfs: growfs not aware of sb_meta_uuid ...
2015-09-06Merge branch 'for-linus' of ↵Linus Torvalds1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "In this one: - d_move fixes (Eric Biederman) - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error handling ought to be fixed) - switch of sb_writers to percpu rwsem (Oleg Nesterov) - superblock scalability (Josef Bacik and Dave Chinner) - swapon(2) race fix (Hugh Dickins)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits) vfs: Test for and handle paths that are unreachable from their mnt_root dcache: Reduce the scope of i_lock in d_splice_alias dcache: Handle escaped paths in prepend_path mm: fix potential data race in SyS_swapon inode: don't softlockup when evicting inodes inode: rename i_wb_list to i_io_list sync: serialise per-superblock sync operations inode: convert inode_sb_list_lock to per-sb inode: add hlist_fake to avoid the inode hash lock in evict writeback: plug writeback at a high level change sb_writers to use percpu_rw_semaphore shift percpu_counter_destroy() into destroy_super_work() percpu-rwsem: kill CONFIG_PERCPU_RWSEM percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() percpu-rwsem: introduce percpu_down_read_trylock() document rwsem_release() in sb_wait_write() fix the broken lockdep logic in __sb_start_write() introduce __sb_writers_{acquired,release}() helpers ufs_inode_get{frag,block}(): get rid of 'phys' argument ufs_getfrag_block(): tidy up a bit ...
2015-08-28xfs: return errors from partial I/O failures to filesDavid Jeffery1-1/+2
There is an issue with xfs's error reporting in some cases of I/O partially failing and partially succeeding. Calls like fsync() can report success even though not all I/O was successful in partial-failure cases such as one disk of a RAID0 array being offline. The issue can occur when there are more than one bio per xfs_ioend struct. Each call to xfs_end_bio() for a bio completing will write a value to ioend->io_error. If a successful bio completes after any failed bio, no error is reported do to it writing 0 over the error code set by any failed bio. The I/O error information is now lost and when the ioend is completed only success is reported back up the filesystem stack. xfs_end_bio() should only set ioend->io_error in the case of BIO_UPTODATE being clear. ioend->io_error is initialized to 0 at allocation so only needs to be updated by a failed bio. Also check that ioend->io_error is 0 so that the first error reported will be the error code returned. Cc: stable@vger.kernel.org Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-15introduce __sb_writers_{acquired,release}() helpersOleg Nesterov1-4/+2
Preparation to hide the sb->s_writers internals from xfs and btrfs. Add 2 trivial define's they can use rather than play with ->s_writers directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jan Kara <jack@suse.com>
2015-08-13block: remove bio_get_nr_vecs()Kent Overstreet1-2/+1
We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig1-3/+2
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-01Merge tag 'xfs-for-linus-4.2-rc1' of ↵Linus Torvalds1-45/+113
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pul xfs updates from Dave Chinner: "There's a couple of small API changes to the core DAX code which required small changes to the ext2 and ext4 code bases, but otherwise everything is within the XFS codebase. This update contains: - A new sparse on-disk inode record format to allow small extents to be used for inode allocation when free space is fragmented. - DAX support. This includes minor changes to the DAX core code to fix problems with lock ordering and bufferhead mapping abuse. - transaction commit interface cleanup - removal of various unnecessary XFS specific type definitions - cleanup and optimisation of freelist preparation before allocation - various minor cleanups - bug fixes for - transaction reservation leaks - incorrect inode logging in unwritten extent conversion - mmap lock vs freeze ordering - remote symlink mishandling - attribute fork removal issues" * tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits) xfs: don't truncate attribute extents if no extents exist xfs: clean up XFS_MIN_FREELIST macros xfs: sanitise error handling in xfs_alloc_fix_freelist xfs: factor out free space extent length check xfs: xfs_alloc_fix_freelist() can use incore perag structures xfs: remove xfs_caddr_t xfs: use void pointers in log validation helpers xfs: return a void pointer from xfs_buf_offset xfs: remove inst_t xfs: remove __psint_t and __psunsigned_t xfs: fix remote symlinks on V5/CRC filesystems xfs: fix xfs_log_done interface xfs: saner xfs_trans_commit interface xfs: remove the flags argument to xfs_trans_cancel xfs: pass a boolean flag to xfs_trans_free_items xfs: switch remaining xfs_trans_dup users to xfs_trans_roll xfs: check min blks for random debug mode sparse allocations xfs: fix sparse inodes 32-bit compile failure xfs: add initial DAX support xfs: add DAX IO path support ...
2015-06-04Merge branch 'xfs-commit-cleanup' into for-nextDave Chinner1-3/+3
Conflicts: fs/xfs/xfs_attr_inactive.c
2015-06-04xfs: saner xfs_trans_commit interfaceChristoph Hellwig1-1/+1
The flags argument to xfs_trans_commit is not useful for most callers, as a commit of a transaction without a permanent log reservation must pass 0 here, and all callers for a transaction with a permanent log reservation except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove the flags argument from the public xfs_trans_commit interfaces, and introduce low-level __xfs_trans_commit variant just for xfs_trans_roll that regrants a log reservation instead of releasing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04xfs: remove the flags argument to xfs_trans_cancelChristoph Hellwig1-2/+2
xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and XFS_TRANS_ABORT. Both of them are a direct product of the transaction state, and can be deducted: - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled, and XFS_TRANS_ABORT is a noop for a transaction that is not dirty. - any transaction with a permanent log reservation needs XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent log reservation is invalid. So just remove the flags argument and do the right thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04xfs: add DAX IO path supportDave Chinner1-9/+27
DAX does not do buffered IO (can't buffer direct access!) and hence all read/write IO is vectored through the direct IO path. Hence we need to add the DAX IO path callouts to the direct IO infrastructure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04xfs: add DAX file operations supportDave Chinner1-33/+83
Add the initial support for DAX file operations to XFS. This includes the necessary block allocation and mmap page fault hooks for DAX to function. Note that there are changes to the splice interfaces to ensure that for DAX splice avoids direct page cache manipulations and instead takes the DAX IO paths for read/write operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-02memcg: add per cgroup dirty page accountingGreg Thelen1-2/+10
When modifying PG_Dirty on cached file pages, update the new MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where global NR_FILE_DIRTY is managed. The new memcg stat is visible in the per memcg memory.stat cgroupfs file. The most recent past attempt at this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632 The new accounting supports future efforts to add per cgroup dirty page throttling and writeback. It also helps an administrator break down a container's memory usage and provides evidence to understand memcg oom kills (the new dirty count is included in memcg oom kill messages). The ability to move page accounting between memcg (memory.move_charge_at_immigrate) makes this accounting more complicated than the global counter. The existing mem_cgroup_{begin,end}_page_stat() lock is used to serialize move accounting with stat updates. Typical update operation: memcg = mem_cgroup_begin_page_stat(page) if (TestSetPageDirty()) { [...] mem_cgroup_update_page_stat(memcg) } mem_cgroup_end_page_stat(memcg) Summary of mem_cgroup_end_page_stat() overhead: - Without CONFIG_MEMCG it's a no-op - With CONFIG_MEMCG and no inter memcg task movement, it's just rcu_read_lock() - With CONFIG_MEMCG and inter memcg task movement, it's rcu_read_lock() + spin_lock_irqsave() A memcg parameter is added to several routines because their callers now grab mem_cgroup_begin_page_stat() which returns the memcg later needed by for mem_cgroup_update_page_stat(). Because mem_cgroup_begin_page_stat() may disable interrupts, some adjustments are needed: - move __mark_inode_dirty() from __set_page_dirty() to its caller. __mark_inode_dirty() locking does not want interrupts disabled. - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in __delete_from_page_cache(), replace_page_cache_page(), invalidate_complete_page2(), and __remove_mapping(). text data bss dec hex filename 8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before 8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after +192 text bytes 8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before 8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after +773 text bytes Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for all metrics, they're all wall clock or cycle counts. The read and write fault benchmarks just measure fault time, they do not include I/O time. * CONFIG_MEMCG not set: baseline patched kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples) dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03% dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99% dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77% read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples) write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples) * CONFIG_MEMCG=y root_memcg: baseline patched kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples) dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90% dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33% dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00% read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples) write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples) * CONFIG_MEMCG=y non-root_memcg: baseline patched kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples) dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82% dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27% dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52% read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples) write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples) As expected anon page faults are not affected by this patch. tj: Updated to apply on top of the recent cancel_dirty_page() changes. Signed-off-by: Sha Zhengju <handai.szj@gmail.com> Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05bio: skip atomic inc/dec of ->bi_cnt for most use casesJens Axboe1-1/+0
Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-24Merge tag 'xfs-for-linus-4.1-rc1' of ↵Linus Torvalds1-78/+192
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs update from Dave Chinner: "This update contains: - RENAME_WHITEOUT support - conversion of per-cpu superblock accounting to use generic counters - new inode mmap lock so that we can lock page faults out of truncate, hole punch and other direct extent manipulation functions to avoid racing mmap writes from causing data corruption - rework of direct IO submission and completion to solve data corruption issue when running concurrent extending DIO writes. Also solves problem of running IO completion transactions in interrupt context during size extending AIO writes. - FALLOC_FL_INSERT_RANGE support for inserting holes into a file via direct extent manipulation to avoid needing to copy data within the file - attribute block header field overflow fix for 64k block size filesystems - Lots of changes to log messaging to be more informative and concise when errors occur. Also prevent a lot of unnecessary log spamming due to cascading failures in error conditions. - lots of cleanups and bug fixes One thing of note is the direct IO fixes that we merged last week after the window opened. Even though a little late, they fix a user reported data corruption and have been pretty well tested. I figured there was not much point waiting another 2 weeks for -rc1 to be released just so I could send them to you..." * tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits) xfs: using generic_file_direct_write() is unnecessary xfs: direct IO EOF zeroing needs to drain AIO xfs: DIO write completion size updates race xfs: DIO writes within EOF don't need an ioend xfs: handle DIO overwrite EOF update completion correctly xfs: DIO needs an ioend for writes xfs: move DIO mapping size calculation xfs: factor DIO write mapping from get_blocks xfs: unlock i_mutex in xfs_break_layouts xfs: kill unnecessary firstused overflow check on attr3 leaf removal xfs: use larger in-core attr firstused field and detect overflow xfs: pass attr geometry to attr leaf header conversion functions xfs: disallow ro->rw remount on norecovery mount xfs: xfs_shift_file_space can be static xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate fs: Add support FALLOC_FL_INSERT_RANGE for fallocate xfs: Fix incorrect positive ENOMEM return xfs: xfs_mru_cache_insert() should use GFP_NOFS xfs: %pF is only for function pointers xfs: fix shadow warning in xfs_da3_root_split() ...
2015-04-16xfs: DIO write completion size updates raceDave Chinner1-0/+7
xfs_end_io_direct_write() can race with other IO completions when updating the in-core inode size. The IO completion processing is not serialised for direct IO - they are done either under the IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all during AIO DIO completion. Hence the non-atomic test-and-set update of the in-core inode size is racy and can result in the in-core inode size going backwards if the race if hit just right. If the inode size goes backwards, this can trigger the EOF zeroing code to run incorrectly on the next IO, which then will zero data that has successfully been written to disk by a previous DIO. To fix this bug, we need to serialise the test/set updates of the in-core inode size. This first patch introduces locking around the relevant updates and checks in the DIO path. Because we now have an ioend in xfs_end_io_direct_write(), we know exactly then we are doing an IO that requires an in-core EOF update, and we know that they are not running in interrupt context. As such, we do not need to use irqsave() spinlock variants to protect against interrupts while the lock is held. Hence we can use an existing spinlock in the inode to do this serialisation and so not need to grow the struct xfs_inode just to work around this problem. This patch does not address the test/set EOF update in generic_file_write_direct() for various reasons - that will be done as a followup with separate explanation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: DIO writes within EOF don't need an ioendDave Chinner1-30/+39
DIO writes that lie entirely within EOF have nothing to do in IO completion. In this case, we don't need no steekin' ioend, and so we can avoid allocating an ioend until we have a mapping that spans EOF. This means that IO completion has two contexts - deferred completion to the dio workqueue that uses an ioend, and interrupt completion that does nothing because there is nothing that can be done in this context. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: handle DIO overwrite EOF update completion correctlyDave Chinner1-31/+30
Currently a DIO overwrite that extends the EOF (e.g sub-block IO or write into allocated blocks beyond EOF) requires a transaction for the EOF update. Thi is done in IO completion context, but we aren't explicitly handling this situation properly and so it can run in interrupt context. Ensure that we defer IO that spans EOF correctly to the DIO completion workqueue, and now that we have an ioend in IO completion we can use the common ioend completion path to do all the work. Note: we do not preallocate the append transaction as we can have multiple mapping and allocation calls per direct IO. hence preallocating can still leave us with nested transactions by attempting to map and allocate more blocks after we've preallocated an append transaction. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: DIO needs an ioend for writesDave Chinner1-10/+82
Currently we can only tell DIO completion that an IO requires unwritten extent completion. This is done by a hacky non-null private pointer passed to Io completion, but the private pointer does not actually contain any information that is used. We also need to pass to IO completion the fact that the IO may be beyond EOF and so a size update transaction needs to be done. This is currently determined by checks in the io completion, but we need to determine if this is necessary at block mapping time as we need to defer the size update transactions to a completion workqueue, just like unwritten extent conversion. To do this, first we need to allocate and pass an ioend to to IO completion. Add this for unwritten extent conversion; we'll do the EOF updates in the next commit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: move DIO mapping size calculationDave Chinner1-33/+46
The mapping size calculation is done last in __xfs_get_blocks(), but we are going to need the actual mapping size we will use to map the direct IO correctly in xfs_map_direct(). Factor out the calculation for code clarity, and move the call to be the first operation in mapping the extent to the returned buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: factor DIO write mapping from get_blocksDave Chinner1-13/+27
Clarify and separate the buffer mapping logic so that the direct IO mapping is not tangled up in propagating the extent status to teh mapping buffer. This makes it easier to extend the direct IO mapping to use an ioend in future. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-12direct_IO: remove rw from a_ops->direct_IO()Omar Sandoval1-1/+0
Now that no one is using rw, remove it completely. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-12direct_IO: use iov_iter_rw() instead of rw everywhereOmar Sandoval1-1/+1
The rw parameter to direct_IO is redundant with iov_iter->type, and treated slightly differently just about everywhere it's used: some users do rw & WRITE, and others do rw == WRITE where they should be doing a bitwise check. Simplify this with the new iov_iter_rw() helper, which always returns either READ or WRITE. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-12Remove rw from {,__,do_}blockdev_direct_IO()Omar Sandoval1-5/+4
Most filesystems call through to these at some point, so we'll start here. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-26fs: move struct kiocb to fs.hChristoph Hellwig1-1/+0
struct kiocb now is a generic I/O container, so move it to fs.h. Also do a #include diet for aio.h while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-02xfs: don't allocate an ioend for direct I/O completionsChristoph Hellwig1-88/+61
Back in the days when the direct I/O ->end_io callback could be called from interrupt context for AIO we needed a structure to hand off to the workqueue, and reused the ioend structure for this purpose. These days ->end_io is always called from user or workqueue context, which allows us to avoid this memory allocation and simplify the code significantly. [dchinner: removed now unused xfs_finish_ioend_sync() function after Brian Foster did an initial review. ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: move most of xfs_sb.h to xfs_format.hChristoph Hellwig1-1/+0
More on-disk format consolidation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: merge xfs_ag.h into xfs_format.hChristoph Hellwig1-1/+0
More on-disk format consolidation. A few declarations that weren't on-disk format related move into better suitable spots. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: merge xfs_dinode.h into xfs_format.hChristoph Hellwig1-1/+0
More consolidatation for the on-disk format defintions. Note that the XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related to the on disk format, but depends on a CONFIG_ option. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-13Merge branch 'xfs-misc-fixes-for-3.18-3' into for-nextDave Chinner1-0/+7
2014-10-02xfs: restore buffer_head unwritten bit on ioend cancelBrian Foster1-0/+7
xfs_vm_writepage() walks each buffer_head on the page, maps to the block on disk and attaches to a running ioend structure that represents the I/O submission. A new ioend is created when the type of I/O (unwritten, delayed allocation or overwrite) required for a particular buffer_head differs from the previous. If a buffer_head is a delalloc or unwritten buffer, the associated bits are cleared by xfs_map_at_offset() once the buffer_head is added to the ioend. The process of mapping each buffer_head occurs in xfs_map_blocks() and acquires the ilock in blocking or non-blocking mode, depending on the type of writeback in progress. If the lock cannot be acquired for non-blocking writeback, we cancel the ioend, redirty the page and return. Writeback will revisit the page at some later point. Note that we acquire the ilock for each buffer on the page. Therefore during non-blocking writeback, it is possible to add an unwritten buffer to the ioend, clear the unwritten state, fail to acquire the ilock when mapping a subsequent buffer and cancel the ioend. If this occurs, the unwritten status of the buffer sitting in the ioend has been lost. The page will eventually hit writeback again, but xfs_vm_writepage() submits overwrite I/O instead of unwritten I/O and does not perform unwritten extent conversion at I/O completion. This leads to data corruption because unwritten extents are treated as holes on reads and zeroes are returned instead of reading from disk. Modify xfs_cancel_ioend() to restore the buffer unwritten bit for ioends of type XFS_IO_UNWRITTEN. This ensures that unwritten extent conversion occurs once the page is eventually written back. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: ensure WB_SYNC_ALL writeback handles partial pages correctlyDave Chinner1-2/+14
XFS has been having trouble with stray delayed allocation extents beyond EOF for a long time. Recent changes to the collapse range code has triggered erroneous EBUSY errors on page invalidtion for block size smaller than page size filesystems. These have been caused by dirty buffers beyond EOF on a partial page which do not get written to disk during a sync. The issue is that write-ahead in xfs_cluster_write() finds such a partial page and handles it by leaving the page dirty but pushing it into a writeback state. This used to work just fine, as the write_cache_pages() code would then find the dirty partial page in the next mapping tree lookup as the dirty tag is still set. Unfortunately, when we moved to a mark and sweep approach to writeback to fix other writeback sync issues, we broken this. THe act of marking the page as under writeback now clears the TOWRITE tag in the radix tree, even though the page is still dirty. This causes the TOWRITE tag to be cleared, and hence the next lookup on the mapping tree does not find the dirty partial page and so doesn't try to write it again. This same writeback bug was found recently in ext4 and fixed in commit 1c8349a ("ext4: fix data integrity sync in ordered mode") without communication to the wider filesystem community. We can use exactly the same fix here so the TOWRITE flag is not cleared on partial page writes. cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e Root-cause-found-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't dirty buffers beyond EOFDave Chinner1-0/+61
generic/263 is failing fsx at this point with a page spanning EOF that cannot be invalidated. The operations are: 1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes) 1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes) 1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes) where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO write attempts to invalidate the cached page over this range, it fails with -EBUSY and so any attempt to do page invalidation fails. The real question is this: Why can't that page be invalidated after it has been written to disk and cleaned? Well, there's data on the first two buffers in the page (1k block size, 4k page), but the third buffer on the page (i.e. beyond EOF) is failing drop_buffers because it's bh->b_state == 0x3, which is BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say what? OK, set_buffer_dirty() is called on all buffers from __set_page_buffers_dirty(), regardless of whether the buffer is beyond EOF or not, which means that when we get to ->writepage, we have buffers marked dirty beyond EOF that we need to clean. So, we need to implement our own .set_page_dirty method that doesn't dirty buffers beyond EOF. This is messy because the buffer code is not meant to be shared and it has interesting locking issues on the buffer dirty bits. So just copy and paste it and then modify it to suit what we need. Note: the solutions the other filesystems and generic block code use of marking the buffers clean in ->writepage does not work for XFS. It still leaves dirty buffers beyond EOF and invalidations still fail. Hence rather than play whack-a-mole, this patch simply prevents those buffers from being dirtied in the first place. cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25xfs: global error sign conversionDave Chinner1-6/+6
Convert all the errors the core XFs code to negative error signs like the rest of the kernel and remove all the sign conversion we do in the interface layers. Errors for conversion (and comparison) found via searches like: $ git grep " E" fs/xfs $ git grep "return E" fs/xfs $ git grep " E[A-Z].*;$" fs/xfs Negation points found via searches like: $ git grep "= -[a-z,A-Z]" fs/xfs $ git grep "return -[a-z,A-D,F-Z]" fs/xfs $ git grep " -[a-z].*;" fs/xfs [ with some bits I missed from Brian Foster ] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-22xfs: Nuke XFS_ERROR macroEric Sandeen1-5/+5
XFS_ERROR was designed long ago to trap return values, but it's not runtime configurable, it's not consistently used, and we can do similar error trapping with ftrace scripts and triggers from userspace. Just nuke XFS_ERROR and associated bits. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds1-10/+7
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
2014-06-10Merge branch 'xfs-misc-fixes-3-for-3.16' into for-nextDave Chinner1-3/+3
2014-06-06xfs: tone down writepage/releasepage WARN_ONsChristoph Hellwig1-3/+3
I recently ran into the issue fixed by "xfs: kill buffers over failed write ranges properly" which spams the log with lots of backtraces. Make debugging any issues like that easier by using WARN_ON_ONCE in the writeback code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20xfs: fix infinite loop at xfs_vm_writepage on 32bit systemJie Liu1-6/+43
Write to a file with an offset greater than 16TB on 32-bit system and then trigger page write-back via sync(1) will cause task hang. # block_size=4096 # offset=$(((2**32 - 1) * $block_size)) # xfs_io -f -c "pwrite $offset $block_size" /storage/test_file # sync INFO: task sync:2590 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. sync D c1064a28 0 2590 2097 0x00000000 ..... Call Trace: [<c1064a28>] ? ttwu_do_wakeup+0x18/0x130 [<c1066d0e>] ? try_to_wake_up+0x1ce/0x220 [<c1066dbf>] ? wake_up_process+0x1f/0x40 [<c104fc2e>] ? wake_up_worker+0x1e/0x30 [<c15b6083>] schedule+0x23/0x60 [<c15b3c2d>] schedule_timeout+0x18d/0x1f0 [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90 [<c10515f1>] ? __queue_delayed_work+0x91/0x150 [<c12a12ef>] ? do_raw_spin_lock+0x3f/0x100 [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90 [<c15b5b5d>] wait_for_completion+0x7d/0xc0 [<c1066d60>] ? try_to_wake_up+0x220/0x220 [<c116a4d2>] sync_inodes_sb+0x92/0x180 [<c116fb05>] sync_inodes_one_sb+0x15/0x20 [<c114a8f8>] iterate_supers+0xb8/0xc0 [<c116faf0>] ? fdatawrite_one_bdev+0x20/0x20 [<c116fc21>] sys_sync+0x31/0x80 [<c15be18d>] sysenter_do_call+0x12/0x28 This issue can be triggered via xfstests/generic/308. The reason is that the end_index is unsigned long with maximum value '2^32-1=4294967295' on 32-bit platform, and the given offset cause it wrapped to 0, so that the following codes will repeat again and again until the task schedule time out: end_index = offset >> PAGE_CACHE_SHIFT; last_index = (offset - 1) >> PAGE_CACHE_SHIFT; if (page->index >= end_index) { unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1); /* * Just skip the page if it is fully outside i_size, e.g. due * to a truncate operation that is in progress. */ if (page->index >= end_index + 1 || offset_into_page == 0) { ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unlock_page(page); return 0; } In order to check if a page is fully outsids i_size or not, we can fix the code logic as below: if (page->index > end_index || (page->index == end_index && offset_into_page == 0)) Secondly, there still has another similar issue when calculating the end offset for mapping the filesystem blocks to the file blocks for delalloc. With the same tests to above, run unmount(8) will cause kernel panic if CONFIG_XFS_DEBUG is enabled: XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \ ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964 kernel BUG at fs/xfs/xfs_message.c:108! invalid opcode: 0000 [#1] SMP task: edddc100 ti: ec6ee000 task.ti: ec6ee000 EIP: 0060:[<f83d87cb>] EFLAGS: 00010296 CPU: 1 EIP is at assfail+0x2b/0x30 [xfs] .............. Call Trace: [<f83d9cd4>] xfs_fs_destroy_inode+0x74/0x120 [xfs] [<c115ddf1>] destroy_inode+0x31/0x50 [<c115deff>] evict+0xef/0x170 [<c115dfb2>] dispose_list+0x32/0x40 [<c115ea3a>] evict_inodes+0xca/0xe0 [<c1149706>] generic_shutdown_super+0x46/0xd0 [<c11497b9>] kill_block_super+0x29/0x70 [<c1149a14>] deactivate_locked_super+0x44/0x70 [<c114a427>] deactivate_super+0x47/0x60 [<c1161c3d>] mntput_no_expire+0xcd/0x120 [<c1162ae8>] SyS_umount+0xa8/0x370 [<c1162dce>] SyS_oldumount+0x1e/0x20 [<c15be18d>] sysenter_do_call+0x12/0x28 That because the end_offset is evaluated to 0 which is the same reason to above, hence the mapping and covertion for dealloc file blocks to file system blocks did not happened. This patch just fixed both issues. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>