summaryrefslogtreecommitdiff
path: root/fs/iomap
AgeCommit message (Collapse)AuthorFilesLines
2020-08-24treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-2/+2
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-05iomap: fall back to buffered writes for invalidation failuresChristoph Hellwig2-5/+12
Failing to invalid the page cache means data in incoherent, which is a very bad state for the system. Always fall back to buffered I/O through the page cache if we can't invalidate mappings. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4 Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2 Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2020-08-05iomap: Only invalidate page cache pages on direct IO writesDave Chinner1-16/+15
The historic requirement for XFS to invalidate cached pages on direct IO reads has been lost in the twisty pages of history - it was inherited from Irix, which implemented page cache invalidation on read as a method of working around problems synchronising page cache state with uncached IO. XFS has carried this ever since. In the initial linux ports it was necessary to get mmap and DIO to play "ok" together and not immediately corrupt data. This was the state of play until the linux kernel had infrastructure to track unwritten extents and synchronise page faults with allocations and unwritten extent conversions (->page_mkwrite infrastructure). IOws, the page cache invalidation on DIO read was necessary to prevent trivial data corruptions. This didn't solve all the problems, though. There were peformance problems if we didn't invalidate the entire page cache over the file on read - we couldn't easily determine if the cached pages were over the range of the IO, and invalidation required taking a serialising lock (i_mutex) on the inode. This serialising lock was an issue for XFS, as it was the only exclusive lock in the direct Io read path. Hence if there were any cached pages, we'd just invalidate the entire file in one go so that subsequent IOs didn't need to take the serialising lock. This was a problem that prevented ranged invalidation from being particularly useful for avoiding the remaining coherency issues. This was solved with the conversion of i_mutex to i_rwsem and the conversion of the XFS inode IO lock to use i_rwsem. Hence we could now just do ranged invalidation and the performance problem went away. However, page cache invalidation was still needed to serialise sub-page/sub-block zeroing via direct IO against buffered IO because bufferhead state attached to the cached page could get out of whack when direct IOs were issued. We've removed bufferheads from the XFS code, and we don't carry any extent state on the cached pages anymore, and so this problem has gone away, too. IOWs, it would appear that we don't have any good reason to be invalidating the page cache on DIO reads anymore. Hence remove the invalidation on read because it is unnecessary overhead, not needed to maintain coherency between mmap/buffered access and direct IO anymore, and prevents anyone from using direct IO reads from intentionally invalidating the page cache of a file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-06iomap: Make sure iomap_end is called after iomap_beginAndreas Gruenbacher1-4/+9
Make sure iomap_end is always called when iomap_begin succeeds. Without this fix, iomap_end won't be called when a filesystem's iomap_begin operation returns an invalid mapping, bypassing any unlocking done in iomap_end. With this fix, the unlocking will still happen. This bug was found by Bob Peterson during code review. It's unlikely that such iomap_begin bugs will survive to affect users, so backporting this fix seems unnecessary. Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-13Merge tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+1
Pull iomap fix from Darrick Wong: "A single iomap bug fix for a variable type mistake on 32-bit architectures, fixing an integer overflow problem in the unshare actor" * tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: Fix unsharing of an extent >2GB on a 32-bit machine
2020-06-09iomap: Fix unsharing of an extent >2GB on a 32-bit machineMatthew Wilcox (Oracle)1-1/+1
Widen the type used for counting the number of bytes unshared. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-06-06Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-8/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of bug fixes and cleanups for ext4, including: - Fix performance problems found in dioread_nolock now that it is the default, caused by transaction leaks. - Clean up fiemap handling in ext4 - Clean up and refactor multiple block allocator (mballoc) code - Fix a problem with mballoc with a smaller file systems running out of blocks because they couldn't properly use blocks that had been reserved by inode preallocation. - Fixed a race in ext4_sync_parent() versus rename() - Simplify the error handling in the extent manipulation code - Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers. - Avoid passing an error pointer to brelse in ext4_xattr_set() - Fix race which could result to freeing an inode on the dirty last in data=journal mode. - Fix refcount handling if ext4_iget() fails - Fix a crash in generic/019 caused by a corrupted extent node" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: avoid unnecessary transaction starts during writeback ext4: don't block for O_DIRECT if IOCB_NOWAIT is set ext4: remove the access_ok() check in ext4_ioctl_get_es_cache fs: remove the access_ok() check in ioctl_fiemap fs: handle FIEMAP_FLAG_SYNC in fiemap_prep fs: move fiemap range validation into the file systems instances iomap: fix the iomap_fiemap prototype fs: move the fiemap definitions out of fs.h fs: mark __generic_block_fiemap static ext4: remove the call to fiemap_check_flags in ext4_fiemap ext4: split _ext4_fiemap ext4: fix fiemap size checks for bitmap files ext4: fix EXT4_MAX_LOGICAL_BLOCK macro add comment for ext4_dir_entry_2 file_type member jbd2: avoid leaking transaction credits when unreserving handle ext4: drop ext4_journal_free_reserved() ext4: mballoc: use lock for checking free blocks while retrying ext4: mballoc: refactor ext4_mb_good_group() ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling ext4: mballoc: refactor ext4_mb_discard_preallocations() ...
2020-06-04fs: handle FIEMAP_FLAG_SYNC in fiemap_prepChristoph Hellwig1-7/+1
By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is handled once instead of duplicated, but can still be done under fs locks, like xfs/iomap intended with its duplicate handling. Also make sure the error value of filemap_write_and_wait is propagated to user space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04fs: move fiemap range validation into the file systems instancesChristoph Hellwig1-1/+1
Replace fiemap_check_flags with a fiemap_prep helper that also takes the inode and mapped range, and performs the sanity check and truncation previously done in fiemap_check_range. This way the validation is inside the file system itself and thus properly works for the stacked overlayfs case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04iomap: fix the iomap_fiemap prototypeChristoph Hellwig1-1/+1
iomap_fiemap should take u64 start and len arguments, just like the ->fiemap prototype. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04fs: move the fiemap definitions out of fs.hChristoph Hellwig1-0/+1
No need to pull the fiemap definitions into almost every file in the kernel build. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03Merge tag 'for-5.8-tag' of ↵Linus Torvalds1-7/+10
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Highlights: - speedup dead root detection during orphan cleanup, eg. when there are many deleted subvolumes waiting to be cleaned, the trees are now looked up in radix tree instead of a O(N^2) search - snapshot creation with inherited qgroup will mark the qgroup inconsistent, requires a rescan - send will emit file capabilities after chown, this produces a stream that does not need postprocessing to set the capabilities again - direct io ported to iomap infrastructure, cleaned up and simplified code, notably removing last use of struct buffer_head in btrfs code Core changes: - factor out backreference iteration, to be used by ordinary backreferences and relocation code - improved global block reserve utilization * better logic to serialize requests * increased maximum available for unlink * improved handling on large pages (64K) - direct io cleanups and fixes * simplify layering, where cloned bios were unnecessarily created for some cases * error handling fixes (submit, endio) * remove repair worker thread, used to avoid deadlocks during repair - refactored block group reading code, preparatory work for new type of block group storage that should improve mount time on large filesystems Cleanups: - cleaned up (and slightly sped up) set/get helpers for metadata data structure members - root bit REF_COWS got renamed to SHAREABLE to reflect the that the blocks of the tree get shared either among subvolumes or with the relocation trees Fixes: - when subvolume deletion fails due to ENOSPC, the filesystem is not turned read-only - device scan deals with devices from other filesystems that changed ownership due to overwrite (mkfs) - fix a race between scrub and block group removal/allocation - fix long standing bug of a runaway balance operation, printing the same line to the syslog, caused by a stale status bit on a reloc tree that prevented progress - fix corrupt log due to concurrent fsync of inodes with shared extents - fix space underflow for NODATACOW and buffered writes when it for some reason needs to fallback to COW mode" * tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits) btrfs: fix space_info bytes_may_use underflow during space cache writeout btrfs: fix space_info bytes_may_use underflow after nocow buffered write btrfs: fix wrong file range cleanup after an error filling dealloc range btrfs: remove redundant local variable in read_block_for_search btrfs: open code key_search btrfs: split btrfs_direct_IO to read and write part btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK fs: remove dio_end_io() btrfs: switch to iomap_dio_rw() for dio iomap: remove lockdep_assert_held() iomap: add a filesystem hook for direct I/O bio submission fs: export generic_file_buffered_read() btrfs: turn space cache writeout failure messages into debug messages btrfs: include error on messages about failure to write space/inode caches btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks() btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums btrfs: make checksum item extension more efficient btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents btrfs: unexport btrfs_compress_set_level() btrfs: simplify iget helpers ...
2020-06-03Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block updates from Jens Axboe: "Core block changes that have been queued up for this release: - Remove dead blk-throttle and blk-wbt code (Guoqing) - Include pid in blktrace note traces (Jan) - Don't spew I/O errors on wouldblock termination (me) - Zone append addition (Johannes, Keith, Damien) - IO accounting improvements (Konstantin, Christoph) - blk-mq hardware map update improvements (Ming) - Scheduler dispatch improvement (Salman) - Inline block encryption support (Satya) - Request map fixes and improvements (Weiping) - blk-iocost tweaks (Tejun) - Fix for timeout failing with error injection (Keith) - Queue re-run fixes (Douglas) - CPU hotplug improvements (Christoph) - Queue entry/exit improvements (Christoph) - Move DMA drain handling to the few drivers that use it (Christoph) - Partition handling cleanups (Christoph)" * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits) block: mark bio_wouldblock_error() bio with BIO_QUIET blk-wbt: rename __wbt_update_limits to wbt_update_limits blk-wbt: remove wbt_update_limits blk-throttle: remove tg_drain_bios blk-throttle: remove blk_throtl_drain null_blk: force complete for timeout request blk-mq: drain I/O when all CPUs in a hctx are offline blk-mq: add blk_mq_all_tag_iter blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk-mq: use BLK_MQ_NO_TAG in more places blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG blk-mq: move more request initialization to blk_mq_rq_ctx_init blk-mq: simplify the blk_mq_get_request calling convention blk-mq: remove the bio argument to ->prepare_request nvme: force complete cancelled requests blk-mq: blk-mq: provide forced completion method block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds block: blk-crypto-fallback: remove redundant initialization of variable err block: reduce part_stat_lock() scope block: use __this_cpu_add() instead of access by smp_processor_id() ...
2020-06-02iomap: use attach/detach_page_privateGuoqing Jiang1-15/+4
Since the new pair function is introduced, we can call them to clean the code in iomap. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Link: http://lkml.kernel.org/r/20200517214718.468-7-guoqing.jiang@cloud.ionos.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02iomap: convert from readpages to readaheadMatthew Wilcox (Oracle)2-60/+32
Use the new readahead operation in iomap. Convert XFS and ZoneFS to use it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02fs: convert mpage_readpages to mpage_readaheadMatthew Wilcox (Oracle)1-1/+1
Implement the new readahead aop and convert all callers (block_dev, exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6, reiserfs & udf). The callers are all trivial except for GFS2 & OCFS2. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> # ocfs2 Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-25iomap: remove lockdep_assert_held()Goldwyn Rodrigues1-2/+0
Filesystems such as btrfs can perform direct I/O without holding the inode->i_rwsem in some of the cases like writing within i_size. So, remove the check for lockdep_assert_held() in iomap_dio_rw(). Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25iomap: add a filesystem hook for direct I/O bio submissionGoldwyn Rodrigues1-5/+10
This helps filesystems to perform tasks on the bio while submitting for I/O. This could be post-write operations such as data CRC or data replication for fs-handled RAID. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-13block: add blk_io_schedule() for avoiding task hung in sync dioMing Lei1-1/+1
Sync dio could be big, or may take long time in discard or in case of IO failure. We have prevented task hung in submit_bio_wait() and blk_execute_rq(), so apply the same trick for prevent task hung from happening in sync dio. Add helper of blk_io_schedule() and use io_schedule_timeout() to prevent task hung warning. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Cc: Salman Qazi <sqazi@google.com> Cc: Jesse Barnes <jsbarnes@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30fibmap: Warn and return an error in case of block > INT_MAXRitesh Harjani1-4/+1
We better warn the fibmap user and not return a truncated and therefore an incorrect block map address if the bmap() returned block address is greater than INT_MAX (since user supplied integer pointer). It's better to pr_warn() all user of ioctl_fibmap() and return a proper error code rather than silently letting a FS corruption happen if the user tries to fiddle around with the returned block map address. We fix this by returning an error code of -ERANGE and returning 0 as the block mapping address in case if it is > INT_MAX. Now iomap_bmap() could be called from either of these two paths. Either when a user is calling an ioctl_fibmap() interface to get the block mapping address or by some filesystem via use of bmap() internal kernel API. bmap() kernel API is well equipped with handling of u64 addresses. WARN condition in iomap_bmap_actor() was mainly added to warn all the fibmap users. But now that we have directly added this warning for all fibmap users and also made sure to return 0 as block map address in case if addr > INT_MAX. So we can now remove this logic from iomap_bmap_actor(). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-04-09Merge tag 'iomap-5.7-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-0/+8
Pull iomap fix from Darrick Wong: "Fix a problem in readahead where we can crash if we can't allocate a full bio due to GFP_NORETRY" * tag 'iomap-5.7-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: Handle memory allocation failure in readahead
2020-04-09Merge tag 'libnvdimm-for-5.7' of ↵Linus Torvalds1-8/+1
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and dax updates from Dan Williams: "There were multiple touches outside of drivers/nvdimm/ this round to add cross arch compatibility to the devm_memremap_pages() interface, enhance numa information for persistent memory ranges, and add a zero_page_range() dax operation. This cycle I switched from the patchwork api to Konstantin's b4 script for collecting tags (from x86, PowerPC, filesystem, and device-mapper folks), and everything looks to have gone ok there. This has all appeared in -next with no reported issues. Summary: - Add support for region alignment configuration and enforcement to fix compatibility across architectures and PowerPC page size configurations. - Introduce 'zero_page_range' as a dax operation. This facilitates filesystem-dax operation without a block-device. - Introduce phys_to_target_node() to facilitate drivers that want to know resulting numa node if a given reserved address range was onlined. - Advertise a persistence-domain for of_pmem and papr_scm. The persistence domain indicates where cpu-store cycles need to reach in the platform-memory subsystem before the platform will consider them power-fail protected. - Promote numa_map_to_online_node() to a cross-kernel generic facility. - Save x86 numa information to allow for node-id lookups for reserved memory ranges, deploy that capability for the e820-pmem driver. - Pick up some miscellaneous minor fixes, that missed v5.6-final, including a some smatch reports in the ioctl path and some unit test compilation fixups. - Fixup some flexible-array declarations" * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits) dax: Move mandatory ->zero_page_range() check in alloc_dax() dax,iomap: Add helper dax_iomap_zero() to zero a range dax: Use new dax zero page method for zeroing a page dm,dax: Add dax zero_page_range operation s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver dax, pmem: Add a dax operation zero_page_range pmem: Add functions for reading/writing page to/from pmem libnvdimm: Update persistence domain value for of_pmem and papr_scm device tools/test/nvdimm: Fix out of tree build libnvdimm/region: Fix build error libnvdimm/region: Replace zero-length array with flexible-array member libnvdimm/label: Replace zero-length array with flexible-array member ACPI: NFIT: Replace zero-length array with flexible-array member libnvdimm/region: Introduce an 'align' attribute libnvdimm/region: Introduce NDD_LABELING libnvdimm/namespace: Enforce memremap_compat_align() libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid libnvdimm: Out of bounds read in __nd_ioctl() acpi/nfit: improve bounds checking for 'func' mm/memremap_pages: Introduce memremap_compat_align() ...
2020-04-03dax,iomap: Add helper dax_iomap_zero() to zero a rangeVivek Goyal1-8/+1
Add a helper dax_ioamp_zero() to zero a range. This patch basically merges __dax_zero_page_range() and iomap_dax_zero(). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20200228163456.1587-7-vgoyal@redhat.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-04-02iomap: Handle memory allocation failure in readaheadMatthew Wilcox (Oracle)1-0/+8
bio_alloc() can fail when we use GFP_NORETRY. If it does, allocate a bio large enough for a single page like mpage_readpages() does. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-18iomap: fix comments in iomap_dio_rwyangerkun1-2/+2
Double 'three' exists in the comments of iomap_dio_rw. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-05iomap: Remove pgoff from tracepointsMatthew Wilcox (Oracle)2-19/+15
The 'pgoff' displayed by the tracepoints wasn't a pgoff at all; it was a byte offset from the start of the file. We already emit that in the form of the 'offset', so we can just remove pgoff. That means we can remove 'page' as an argument to the tracepoint, and rename this type of tracepoint from being a page class to being a range class. Fixes: 0b1b213fcf3a ("xfs: event tracing support") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-01-06fs: Fix page_mkwrite off-by-one errorsAndreas Gruenbacher1-13/+5
The check in block_page_mkwrite that is meant to determine whether an offset is within the inode size is off by one. This bug has been copied into iomap_page_mkwrite and several filesystems (ubifs, ext4, f2fs, ceph). Fix that by introducing a new page_mkwrite_check_truncate helper that checks for truncate and computes the bytes in the page up to EOF. Use the helper in iomap. NOTE from Darrick: The original patch fixed a number of filesystems, but then there were merge conflicts with the f2fs for-next tree; a subsequent re-submission of the patch had different btrfs changes with no explanation; and Christoph complained that each per-fs fix should be a separate patch. In my view that's too much risk to take on, so I decided to drop all the hunks except for iomap, since I've actually QA'd XFS. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: drop everything but the iomap parts] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-05iomap: stop using ioend after it's been freed in iomap_finish_ioend()Zorro Lang1-2/+3
This patch fixes the following KASAN report. The @ioend has been freed by dio_put(), but the iomap_finish_ioend() still trys to access its data. [20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0 [20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184 [20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1 [20563.653887] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019 [20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs] [20563.669547] Call trace: [20563.671993] dump_backtrace+0x0/0x370 [20563.675648] show_stack+0x1c/0x28 [20563.678958] dump_stack+0x138/0x1b0 [20563.682455] print_address_description.isra.9+0x60/0x378 [20563.687759] __kasan_report+0x1a4/0x2a8 [20563.691587] kasan_report+0xc/0x18 [20563.694985] __asan_report_load8_noabort+0x18/0x20 [20563.699769] iomap_finish_ioend+0x58c/0x5c0 [20563.703944] iomap_finish_ioends+0x110/0x270 [20563.708396] xfs_end_ioend+0x168/0x598 [xfs] [20563.712823] xfs_end_io+0x1e0/0x2d0 [xfs] [20563.716834] process_one_work+0x7f0/0x1ac8 [20563.720922] worker_thread+0x334/0xae0 [20563.724664] kthread+0x2c4/0x348 [20563.727889] ret_from_fork+0x10/0x18 [20563.732941] Allocated by task 83403: [20563.736512] save_stack+0x24/0xb0 [20563.739820] __kasan_kmalloc.isra.9+0xc4/0xe0 [20563.744169] kasan_slab_alloc+0x14/0x20 [20563.747998] slab_post_alloc_hook+0x50/0xa8 [20563.752173] kmem_cache_alloc+0x154/0x330 [20563.756185] mempool_alloc_slab+0x20/0x28 [20563.760186] mempool_alloc+0xf4/0x2a8 [20563.763845] bio_alloc_bioset+0x2d0/0x448 [20563.767849] iomap_writepage_map+0x4b8/0x1740 [20563.772198] iomap_do_writepage+0x200/0x8d0 [20563.776380] write_cache_pages+0x8a4/0xed8 [20563.780469] iomap_writepages+0x4c/0xb0 [20563.784463] xfs_vm_writepages+0xf8/0x148 [xfs] [20563.788989] do_writepages+0xc8/0x218 [20563.792658] __writeback_single_inode+0x168/0x18f8 [20563.797441] writeback_sb_inodes+0x370/0xd30 [20563.801703] wb_writeback+0x2d4/0x1270 [20563.805446] wb_workfn+0x344/0x1178 [20563.808928] process_one_work+0x7f0/0x1ac8 [20563.813016] worker_thread+0x334/0xae0 [20563.816757] kthread+0x2c4/0x348 [20563.819979] ret_from_fork+0x10/0x18 [20563.825028] Freed by task 22184: [20563.828251] save_stack+0x24/0xb0 [20563.831559] __kasan_slab_free+0x10c/0x180 [20563.835648] kasan_slab_free+0x10/0x18 [20563.839389] slab_free_freelist_hook+0xb4/0x1c0 [20563.843912] kmem_cache_free+0x8c/0x3e8 [20563.847745] mempool_free_slab+0x20/0x28 [20563.851660] mempool_free+0xd4/0x2f8 [20563.855231] bio_free+0x33c/0x518 [20563.858537] bio_put+0xb8/0x100 [20563.861672] iomap_finish_ioend+0x168/0x5c0 [20563.865847] iomap_finish_ioends+0x110/0x270 [20563.870328] xfs_end_ioend+0x168/0x598 [xfs] [20563.874751] xfs_end_io+0x1e0/0x2d0 [xfs] [20563.878755] process_one_work+0x7f0/0x1ac8 [20563.882844] worker_thread+0x334/0xae0 [20563.886584] kthread+0x2c4/0x348 [20563.889804] ret_from_fork+0x10/0x18 [20563.894855] The buggy address belongs to the object at fffffc0c54a36900 which belongs to the cache bio-1 of size 248 [20563.906844] The buggy address is located 40 bytes inside of 248-byte region [fffffc0c54a36900, fffffc0c54a369f8) [20563.918485] The buggy address belongs to the page: [20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300 [20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900 [20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000 [20563.948300] page dumped because: kasan: bad access detected [20563.955345] Memory state around the buggy address: [20563.960129] fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc [20563.967342] fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [20563.981766] ^ [20563.986288] fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc [20563.993501] fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [20564.000713] ================================================================== Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703 Signed-off-by: Zorro Lang <zlang@redhat.com> Fixes: 9cd0ed63ca514 ("iomap: enhance writeback error message") Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-12-04iomap: fix sub-page uptodate handlingChristoph Hellwig1-10/+25
bio completions can race when a page spans more than one file system block. Add a spinlock to synchronize marking the page uptodate. Fixes: 9dc55f1389f9 ("iomap: add support for sub-pagesize buffered I/O without buffer heads") Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-26iomap: remove unneeded variable in iomap_dio_rw()Johannes Thumshirn1-4/+4
The 'start' variable indicates the start of a filemap and is set to the iocb's position, which we have already cached as 'pos', upon function entry. 'pos' is used as a cursor indicating the current position and updated later in iomap_dio_rw(), but not before the last use of 'start'. Remove 'start' as it's synonym for 'pos' before we're entering the loop calling iomapp_apply(). Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-26iomap: Do not create fake iter in iomap_dio_bio_actor()Jan Kara1-13/+18
iomap_dio_bio_actor() copies iter to a local variable and then limits it to a file extent we have mapped. When IO is submitted, iomap_dio_bio_actor() advances the original iter while the copied iter is advanced inside bio_iov_iter_get_pages(). This logic is non-obvious especially because both iters still point to same shared structures (such as pipe info) so if iov_iter_advance() changes anything in the shared structure, this scheme breaks. Let's just truncate and reexpand the original iter as needed instead of playing games with copying iters and keeping them in sync. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-22iomap: Fix pipe page leakage during splicingJan Kara1-1/+8
When splicing using iomap_dio_rw() to a pipe, we may leak pipe pages because bio_iov_iter_get_pages() records that the pipe will have full extent worth of data however if file size is not block size aligned iomap_dio_rw() returns less than what bio_iov_iter_get_pages() set up and splice code gets confused leaking a pipe page with the file tail. Handle the situation similarly to the old direct IO implementation and revert iter to actually returned read amount which makes iter consistent with value returned from iomap_dio_rw() and thus the splice code is happy. Fixes: ff6a9292e6f6 ("iomap: implement direct I/O") CC: stable@vger.kernel.org Reported-by: syzbot+991400e8eba7e00a26e1@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-22iomap: trace iomap_appply resultsDarrick J. Wong2-0/+110
Add some tracepoints so that we can more easily debug what the filesystem is returning from ->iomap_begin. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-11iomap: fix return value of iomap_dio_bio_actor on 32bit systemsJan Stancek1-1/+3
Naresh reported LTP diotest4 failing for 32bit x86 and arm -next kernels on ext4. Same problem exists in 5.4-rc7 on xfs. The failure comes down to: openat(AT_FDCWD, "testdata-4.5918", O_RDWR|O_DIRECT) = 4 mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f7b000 read(4, 0xb7f7b000, 4096) = 0 // expects -EFAULT Problem is conversion at iomap_dio_bio_actor() return. Ternary operator has a return type and an attempt is made to convert each of operands to the type of the other. In this case "ret" (int) is converted to type of "copied" (unsigned long). Both have size of 4 bytes: size_t copied = 0; int ret = -14; long long actor_ret = copied ? copied : ret; On x86_64: actor_ret == -14; On x86 : actor_ret == 4294967282 Replace ternary operator with 2 return statements to avoid this unwanted conversion. Fixes: 4721a6010990 ("iomap: dio data corruption and spurious errors when pipes fill") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-08iomap: iomap_bmap should check iomap_apply return valueDarrick J. Wong1-1/+5
Coverity caught this fairly minor bug, but we should check the return value of iomap_apply regardless. Coverity-id: 1437065 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-11-07iomap: Fix overflow in iomap_page_mkwriteAndreas Gruenbacher1-4/+3
On architectures where loff_t is wider than pgoff_t, the expression ((page->index + 1) << PAGE_SHIFT) can overflow. Rewrite to use the page offset, which we already compute here anyway. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29fs/iomap: remove redundant check in iomap_dio_rw()Joseph Qi1-1/+1
We've already check if it is READ iov_iter, no need check again. Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: use a srcmap for a read-modify-write I/OGoldwyn Rodrigues6-42/+61
The srcmap is used to identify where the read is to be performed from. It is passed to ->iomap_begin, which can fill it in if we need to read data for partially written blocks from a different location than the write target. The srcmap is only supported for buffered writes so far. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> [hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as srcmap if not set, adjust length down to srcmap end as well] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2019-10-21iomap: use write_begin to read pages to unshareChristoph Hellwig1-33/+16
Use the existing iomap write_begin code to read the pages unshared by iomap_file_unshare. That avoids the extra ->readpage call and extent tree lookup currently done by read_mapping_page. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: move the zeroing case out of iomap_read_page_syncChristoph Hellwig1-17/+16
That keeps the function a little easier to understand, and easier to modify for pending enhancements. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: ignore non-shared or non-data blocks in xfs_file_dirtyChristoph Hellwig1-4/+11
xfs_file_dirty is used to unshare reflink blocks. Rename the function to xfs_file_unshare to better document that purpose, and skip iomaps that are not shared and don't need zeroing. This will allow to simplify the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: always use AOP_FLAG_NOFS in iomap_write_beginChristoph Hellwig1-9/+5
All callers pass AOP_FLAG_NOFS, so lift that flag to iomap_write_begin to allow reusing the flags arguments for an internal flags namespace soon. Also remove the local index variable that is only used once. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: remove the unused iomap argument to __iomap_write_endChristoph Hellwig1-2/+2
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: enhance writeback error messageDarrick J. Wong1-2/+3
If we encounter an IO error during writeback, log the inode, offset, and sector number of the failure, instead of forcing the user to do some sort of reverse mapping to figure out which file is affected. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-10-21iomap: pass a struct page to iomap_finish_page_writebackChristoph Hellwig1-5/+5
No need to pass the full bio_vec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: cleanup iomap_ioend_compareChristoph Hellwig1-4/+3
Move the initialization of ia and ib to the declaration line and remove a superflous else. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: move struct iomap_page out of iomap.hChristoph Hellwig1-0/+17
Now that all the writepage code is in the iomap code there is no need to keep this structure public. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: warn on inline maps in iomap_writepage_mapChristoph Hellwig1-0/+2
And inline mapping should never mark the page dirty and thus never end up in writepages. Add a check for that condition and warn if it happens. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-21iomap: lift the xfs writeback code to iomapChristoph Hellwig2-1/+551
Take the xfs writeback code and move it to fs/iomap. A new structure with three methods is added as the abstraction from the generic writeback code to the file system. These methods are used to map blocks, submit an ioend, and cancel a page that encountered an error before it was added to an ioend. Signed-off-by: Christoph Hellwig <hch@lst.de> [darrick: rename ->submit_ioend to ->prepare_ioend to clarify what it does] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-10-21iomap: lift common tracing code from xfs to iomapChristoph Hellwig4-7/+117
Lift the xfs code for tracing address space operations to the iomap layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>