summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2022-10-11Merge tag 'pull-tmpfile' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs tmpfile updates from Al Viro: "Miklos' ->tmpfile() signature change; pass an unopened struct file to it, let it open the damn thing. Allows to add tmpfile support to FUSE" * tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse: implement ->tmpfile() vfs: open inside ->tmpfile() vfs: move open right after ->tmpfile() vfs: make vfs_tmpfile() static ovl: use vfs_tmpfile_open() helper cachefiles: use vfs_tmpfile_open() helper cachefiles: only pass inode to *mark_inode_inuse() helpers cachefiles: tmpfile error handling cleanup hugetlbfs: cleanup mknod and tmpfile vfs: add vfs_tmpfile_open() helper
2022-10-07Merge tag 'ext4_for_linus' of ↵Linus Torvalds15-743/+903
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The first two changes involve files outside of fs/ext4: - submit_bh() can never return an error, so change it to return void, and remove the unused checks from its callers - fix I_DIRTY_TIME handling so it will be set even if the inode already has I_DIRTY_INODE Performance: - Always enable i_version counter (as btrfs and xfs already do). Remove some uneeded i_version bumps to avoid unnecessary nfs cache invalidations - Wake up journal waiters in FIFO order, to avoid some journal users from not getting a journal handle for an unfairly long time - In ext4_write_begin() allocate any necessary buffer heads before starting the journal handle - Don't try to prefetch the block allocation bitmaps for a read-only file system Bug Fixes: - Fix a number of fast commit bugs, including resources leaks and out of bound references in various error handling paths and/or if the fast commit log is corrupted - Avoid stopping the online resize early when expanding a file system which is less than 16TiB to a size greater than 16TiB - Fix apparent metadata corruption caused by a race with a metadata buffer head getting migrated while it was trying to be read - Mark the lazy initialization thread freezable to prevent suspend failures - Other miscellaneous bug fixes Cleanups: - Break up the incredibly long ext4_full_super() function by refactoring to move code into more understandable, smaller functions - Remove the deprecated (and ignored) noacl and nouser_attr mount option - Factor out some common code in fast commit handling - Other miscellaneous cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (53 commits) ext4: fix potential out of bound read in ext4_fc_replay_scan() ext4: factor out ext4_fc_get_tl() ext4: introduce EXT4_FC_TAG_BASE_LEN helper ext4: factor out ext4_free_ext_path() ext4: remove unnecessary drop path references in mext_check_coverage() ext4: update 'state->fc_regions_size' after successful memory allocation ext4: fix potential memory leak in ext4_fc_record_regions() ext4: fix potential memory leak in ext4_fc_record_modified_inode() ext4: remove redundant checking in ext4_ioctl_checkpoint jbd2: add miss release buffer head in fc_do_one_pass() ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts() ext4: remove useless local variable 'blocksize' ext4: unify the ext4 super block loading operation ext4: factor out ext4_journal_data_mode_check() ext4: factor out ext4_load_and_init_journal() ext4: factor out ext4_group_desc_init() and ext4_group_desc_free() ext4: factor out ext4_geometry_check() ext4: factor out ext4_check_feature_compatibility() ext4: factor out ext4_init_metadata_csum() ext4: factor out ext4_encoding_init() ...
2022-10-04Merge tag 'statx-dioalign-for-linus' of ↵Linus Torvalds3-11/+64
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull STATX_DIOALIGN support from Eric Biggers: "Make statx() support reporting direct I/O (DIO) alignment information. This provides a generic interface for userspace programs to determine whether a file supports DIO, and if so with what alignment restrictions. Specifically, STATX_DIOALIGN works on block devices, and on regular files when their containing filesystem has implemented support. An interface like this has been requested for years, since the conditions for when DIO is supported in Linux have gotten increasingly complex over time. Today, DIO support and alignment requirements can be affected by various filesystem features such as multi-device support, data journalling, inline data, encryption, verity, compression, checkpoint disabling, log-structured mode, etc. Further complicating things, Linux v6.0 relaxed the traditional rule of DIO needing to be aligned to the block device's logical block size; now user buffers (but not file offsets) only need to be aligned to the DMA alignment. The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was discarded in favor of creating a clean new interface with statx(). For more information, see the individual commits and the man page update[1]" Link: https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org [1] * tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: xfs: support STATX_DIOALIGN f2fs: support STATX_DIOALIGN f2fs: simplify f2fs_force_buffered_io() f2fs: move f2fs_force_buffered_io() into file.c ext4: support STATX_DIOALIGN fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN vfs: support STATX_DIOALIGN on block devices statx: add direct I/O alignment information
2022-10-04Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds1-4/+6
Pull fscrypt updates from Eric Biggers: "This release contains some implementation changes, but no new features: - Rework the implementation of the fscrypt filesystem-level keyring to not be as tightly coupled to the keyrings subsystem. This resolves several issues. - Eliminate most direct uses of struct request_queue from fs/crypto/, since struct request_queue is considered to be a block layer implementation detail. - Stop using the PG_error flag to track decryption failures. This is a prerequisite for freeing up PG_error for other uses" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: work on block_devices instead of request_queues fscrypt: stop holding extra request_queue references fscrypt: stop using keyrings subsystem for fscrypt_master_key fscrypt: stop using PG_error to track error status fscrypt: remove fscrypt_set_test_dummy_encryption()
2022-10-01ext4: fix potential out of bound read in ext4_fc_replay_scan()Ye Bin1-2/+36
For scan loop must ensure that at least EXT4_FC_TAG_BASE_LEN space. If remain space less than EXT4_FC_TAG_BASE_LEN which will lead to out of bound read when mounting corrupt file system image. ADD_RANGE/HEAD/TAIL is needed to add extra check when do journal scan, as this three tags will read data during scan, tag length couldn't less than data length which will read. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20220924075233.2315259-4-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_fc_get_tl()Ye Bin1-21/+25
Factor out ext4_fc_get_tl() to fill 'tl' with host byte order. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20220924075233.2315259-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: introduce EXT4_FC_TAG_BASE_LEN helperYe Bin2-26/+31
Introduce EXT4_FC_TAG_BASE_LEN helper for calculate length of struct ext4_fc_tl. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20220924075233.2315259-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_free_ext_path()Ye Bin7-82/+54
Factor out ext4_free_ext_path() to free extent path. As after previous patch 'ext4_ext_drop_refs()' is only used in 'extents.c', so make it static. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220924021211.3831551-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: remove unnecessary drop path references in mext_check_coverage()Ye Bin1-1/+0
According to Jan Kara's suggestion: "The use in mext_check_coverage() can be actually removed - get_ext_path() -> ext4_find_extent() takes care of dropping the references." So remove unnecessary call ext4_ext_drop_refs() in mext_check_coverage(). Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220924021211.3831551-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: update 'state->fc_regions_size' after successful memory allocationYe Bin1-4/+5
To avoid to 'state->fc_regions_size' mismatch with 'state->fc_regions' when fail to reallocate 'fc_reqions',only update 'state->fc_regions_size' after 'state->fc_regions' is allocated successfully. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-4-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: fix potential memory leak in ext4_fc_record_regions()Ye Bin1-6/+8
As krealloc may return NULL, in this case 'state->fc_regions' may not be freed by krealloc, but 'state->fc_regions' already set NULL. Then will lead to 'state->fc_regions' memory leak. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: fix potential memory leak in ext4_fc_record_modified_inode()Ye Bin1-3/+5
As krealloc may return NULL, in this case 'state->fc_modified_inodes' may not be freed by krealloc, but 'state->fc_modified_inodes' already set NULL. Then will lead to 'state->fc_modified_inodes' memory leak. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: remove redundant checking in ext4_ioctl_checkpointGuoqing Jiang1-3/+0
It is already checked after comment "check for invalid bits set", so let's remove this one. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Link: https://lore.kernel.org/r/20220918115219.12407-1-guoqing.jiang@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts()Jason Yan1-3/+3
Now since all preparations is done, we can move the DIOREAD_NOLOCK setting to ext4_set_def_opts(). Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916141527.1012715-17-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: remove useless local variable 'blocksize'Jason Yan1-24/+21
Since sb->s_blocksize is now initialized at the very beginning, the local variable 'blocksize' in __ext4_fill_super() is not needed now. Remove it and use sb->s_blocksize instead. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916141527.1012715-16-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: unify the ext4 super block loading operationJason Yan1-80/+106
Now we load the super block from the disk in two steps. First we load the super block with the default block size(EXT4_MIN_BLOCK_SIZE). Second we load the super block with the real block size. The second step is a little far from the first step. This patch move these two steps together in a new function. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916141527.1012715-15-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_journal_data_mode_check()Jason Yan1-25/+35
Factor out ext4_journal_data_mode_check(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara<jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-14-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_load_and_init_journal()Jason Yan1-69/+88
This patch group the journal load and initialize code together and factor out ext4_load_and_init_journal(). This patch also removes the lable 'no_journal' which is not needed after refactor. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-13-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_group_desc_init() and ext4_group_desc_free()Jason Yan1-59/+84
Factor out ext4_group_desc_init() and ext4_group_desc_free(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-12-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_geometry_check()Jason Yan1-50/+61
Factor out ext4_geometry_check(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-11-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_check_feature_compatibility()Jason Yan1-67/+77
Factor out ext4_check_feature_compatibility(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-10-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_init_metadata_csum()Jason Yan1-37/+46
Factor out ext4_init_metadata_csum(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-9-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_encoding_init()Jason Yan1-36/+50
Factor out ext4_encoding_init(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916141527.1012715-8-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_inode_info_init()Jason Yan1-62/+75
Factor out ext4_inode_info_init(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-7-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_fast_commit_init()Jason Yan1-18/+25
Factor out ext4_fast_commit_init(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-6-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_handle_clustersize()Jason Yan1-49/+61
Factor out ext4_handle_clustersize(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-5-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_set_def_opts()Jason Yan1-49/+56
Factor out ext4_set_def_opts(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-4-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: remove cantfind_ext4 error handlerJason Yan1-16/+13
The 'cantfind_ext4' error handler is just a error msg print and then goto failed_mount. This two level goto makes the code complex and not easy to read. The only benefit is that is saves a little bit code. However some branches can merge and some branches dot not even need it. So do some refactor and remove it. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-3-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: goto right label 'failed_mount3a'Jason Yan1-5/+5
Before these two branches neither loaded the journal nor created the xattr cache. So the right label to goto is 'failed_mount3a'. Although this did not cause any issues because the error handler validated if the pointer is null. However this still made me confused when reading the code. So it's still worth to modify to goto the right label. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-2-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: adjust fast commit disable judgement order in ext4_fc_track_inodeYe Bin1-3/+3
If fastcommit is already disabled, there isn't need to mark inode ineligible. So move 'ext4_fc_disabled()' judgement bofore 'ext4_should_journal_data(inode)' judgement which can avoid to do meaningless judgement. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916083836.388347-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: factor out ext4_fc_disabled()Ye Bin1-23/+15
Factor out ext4_fc_disabled(). No functional change. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916083836.388347-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: fix miss release buffer head in ext4_fc_write_inodeYe Bin1-6/+9
In 'ext4_fc_write_inode' function first call 'ext4_get_inode_loc' get 'iloc', after use it miss release 'iloc.bh'. So just release 'iloc.bh' before 'ext4_fc_write_inode' return. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100859.1415196-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: fix dir corruption when ext4_dx_add_entry() failsZhihao Cheng1-5/+10
Following process may lead to fs corruption: 1. ext4_create(dir/foo) ext4_add_nondir ext4_add_entry ext4_dx_add_entry a. add_dirent_to_buf ext4_mark_inode_dirty ext4_handle_dirty_metadata // dir inode bh is recorded into journal b. ext4_append // dx_get_count(entries) == dx_get_limit(entries) ext4_bread(EXT4_GET_BLOCKS_CREATE) ext4_getblk ext4_map_blocks ext4_ext_map_blocks ext4_mb_new_blocks dquot_alloc_block dquot_alloc_space_nodirty inode_add_bytes // update dir's i_blocks ext4_ext_insert_extent ext4_ext_dirty // record extent bh into journal ext4_handle_dirty_metadata(bh) // record new block into journal inode->i_size += inode->i_sb->s_blocksize // new size(in mem) c. ext4_handle_dirty_dx_node(bh2) // record dir's new block(dx_node) into journal d. ext4_handle_dirty_dx_node((frame - 1)->bh) e. ext4_handle_dirty_dx_node(frame->bh) f. do_split // ret err! g. add_dirent_to_buf ext4_mark_inode_dirty(dir) // update raw_inode on disk(skipped) 2. fsck -a /dev/sdb drop last block(dx_node) which beyonds dir's i_size. /dev/sdb: recovering journal /dev/sdb contains a file system with errors, check forced. /dev/sdb: Inode 12, end of extent exceeds allowed value (logical block 128, physical block 3938, len 1) 3. fsck -fn /dev/sdb dx_node->entry[i].blk > dir->i_size Pass 2: Checking directory structure Problem in HTREE directory inode 12 (/dir): bad block number 128. Clear HTree index? no Problem in HTREE directory inode 12: block #3 has invalid depth (2) Problem in HTREE directory inode 12: block #3 has bad max hash Problem in HTREE directory inode 12: block #3 not referenced Fix it by marking inode dirty directly inside ext4_append(). Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216466 Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220911045204.516460-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: remove ext4_inline_data_fiemap() declarationGaosheng Cui1-3/+0
ext4_inline_data_fiemap() has been removed since commit d3b6f23f7167 ("ext4: move ext4_fiemap to use iomap framework"), so remove it. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220909065307.1155201-1-cuigaosheng1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: fix i_version handling in ext4Jeff Layton3-9/+10
ext4 currently updates the i_version counter when the atime is updated during a read. This is less than ideal as it can cause unnecessary cache invalidations with NFSv4 and unnecessary remeasurements for IMA. The increment in ext4_mark_iloc_dirty is also problematic since it can corrupt the i_version counter for ea_inodes. We aren't bumping the file times in ext4_mark_iloc_dirty, so changing the i_version there seems wrong, and is the cause of both problems. Remove that callsite and add increments to the setattr, setxattr and ioctl codepaths, at the same times that we update the ctime. The i_version bump that already happens during timestamp updates should take care of the rest. In ext4_move_extents, increment the i_version on both inodes, and also add in missing ctime updates. [ Some minor updates since we've already enabled the i_version counter unconditionally already via another patch series. -- TYT ] Cc: stable@kernel.org Cc: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20220908172448.208585-3-jlayton@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: place buffer head allocation before handle startJinke Han1-0/+7
In our product environment, we encounter some jbd hung waiting handles to stop while several writters were doing memory reclaim for buffer head allocation in delay alloc write path. Ext4 do buffer head allocation with holding transaction handle which may be blocked too long if the reclaim works not so smooth. According to our bcc trace, the reclaim time in buffer head allocation can reach 258s and the jbd transaction commit also take almost the same time meanwhile. Except for these extreme cases, we often see several seconds delays for cgroup memory reclaim on our servers. This is more likely to happen considering docker environment. One thing to note, the allocation of buffer heads is as often as page allocation or more often when blocksize less than page size. Just like page cache allocation, we should also place the buffer head allocation before startting the handle. Cc: stable@kernel.org Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Link: https://lore.kernel.org/r/20220903012429.22555-1-hanjinke.666@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: ext4_read_bh_lock() should submit IO if the buffer isn't uptodateZhang Yi1-11/+5
Recently we notice that ext4 filesystem would occasionally fail to read metadata from disk and report error message, but the disk and block layer looks fine. After analyse, we lockon commit 88dbcbb3a484 ("blkdev: avoid migration stalls for blkdev pages"). It provide a migration method for the bdev, we could move page that has buffers without extra users now, but it lock the buffers on the page, which breaks the fragile metadata read operation on ext4 filesystem, ext4_read_bh_lock() was copied from ll_rw_block(), it depends on the assumption of that locked buffer means it is under IO. So it just trylock the buffer and skip submit IO if it lock failed, after wait_on_buffer() we conclude IO error because the buffer is not uptodate. This issue could be easily reproduced by add some delay just after buffer_migrate_lock_buffers() in __buffer_migrate_folio() and do fsstress on ext4 filesystem. EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #73193: comm fsstress: reading directory lblock 0 EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #75334: comm fsstress: reading directory lblock 0 Fix it by removing the trylock logic in ext4_read_bh_lock(), just lock the buffer and submit IO if it's not uptodate, and also leave over readahead helper. Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220831074629.3755110-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-01ext4: unconditionally enable the i_version counterJeff Layton2-20/+7
The original i_version implementation was pretty expensive, requiring a log flush on every change. Because of this, it was gated behind a mount option (implemented via the MS_I_VERSION mountoption flag). Commit ae5e165d855d (fs: new API for handling inode->i_version) made the i_version flag much less expensive, so there is no longer a performance penalty from enabling it. xfs and btrfs already enable it unconditionally when the on-disk format can support it. Have ext4 ignore the SB_I_VERSION flag, and just enable it unconditionally. While we're in here, mark the i_version mount option Opt_removed. [ Removed leftover bits of i_version from ext4_apply_options() since it now can't ever be set in ctx->mask_s_flags -- lczerner ] Cc: stable@kernel.org Cc: Dave Chinner <david@fromorbit.com> Cc: Benjamin Coddington <bcodding@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220824160349.39664-3-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: don't increase iversion counter for ea_inodesLukas Czerner1-1/+6
ea_inodes are using i_version for storing part of the reference count so we really need to leave it alone. The problem can be reproduced by xfstest ext4/026 when iversion is enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL inodes in ext4_mark_iloc_dirty(). Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220824160349.39664-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: fix check for block being out of directory sizeJan Kara1-1/+1
The check in __ext4_read_dirblock() for block being outside of directory size was wrong because it compared block number against directory size in bytes. Fix it. Fixes: 65f8ea4cd57d ("ext4: check if directory block is within i_size") CVE: CVE-2022-1184 CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220822114832.1482-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: make ext4_lazyinit_thread freezableLalith Rajendran1-0/+1
ext4_lazyinit_thread is not set freezable. Hence when the thread calls try_to_freeze it doesn't freeze during suspend and continues to send requests to the storage during suspend, resulting in suspend failures. Cc: stable@kernel.org Signed-off-by: Lalith Rajendran <lalithkraj@google.com> Link: https://lore.kernel.org/r/20220818214049.1519544-1-lalithkraj@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29ext4: fix null-ptr-deref in ext4_write_infoBaokun Li1-1/+1
I caught a null-ptr-deref bug as follows: ================================================================== KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] CPU: 1 PID: 1589 Comm: umount Not tainted 5.10.0-02219-dirty #339 RIP: 0010:ext4_write_info+0x53/0x1b0 [...] Call Trace: dquot_writeback_dquots+0x341/0x9a0 ext4_sync_fs+0x19e/0x800 __sync_filesystem+0x83/0x100 sync_filesystem+0x89/0xf0 generic_shutdown_super+0x79/0x3e0 kill_block_super+0xa1/0x110 deactivate_locked_super+0xac/0x130 deactivate_super+0xb6/0xd0 cleanup_mnt+0x289/0x400 __cleanup_mnt+0x16/0x20 task_work_run+0x11c/0x1c0 exit_to_user_mode_prepare+0x203/0x210 syscall_exit_to_user_mode+0x5b/0x3a0 do_syscall_64+0x59/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ================================================================== Above issue may happen as follows: ------------------------------------- exit_to_user_mode_prepare task_work_run __cleanup_mnt cleanup_mnt deactivate_super deactivate_locked_super kill_block_super generic_shutdown_super shrink_dcache_for_umount dentry = sb->s_root sb->s_root = NULL <--- Here set NULL sync_filesystem __sync_filesystem sb->s_op->sync_fs > ext4_sync_fs dquot_writeback_dquots sb->dq_op->write_info > ext4_write_info ext4_journal_start(d_inode(sb->s_root), EXT4_HT_QUOTA, 2) d_inode(sb->s_root) s_root->d_inode <--- Null pointer dereference To solve this problem, we use ext4_journal_start_sb directly to avoid s_root being used. Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220805123947.565152-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29ext4: don't run ext4lazyinit for read-only filesystemsJosh Triplett1-3/+3
On a read-only filesystem, we won't invoke the block allocator, so we don't need to prefetch the block bitmaps. This avoids starting and running the ext4lazyinit thread at all on a system with no read-write ext4 filesystems (for instance, a container VM with read-only filesystems underneath an overlayfs). Fixes: 21175ca434c5 ("ext4: make prefetch_block_bitmaps default") Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/48b41da1498fcac3287e2e06b660680646c1c050.1659323972.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29ext4: remove deprecated noacl/nouser_xattr optionsYang Xu1-10/+1
These two options should have been removed since 3.5, but none notices it. Recently, I and Darrick found this. Also, have some discussion for this[1][2][3]. So now, let's remove them. Link: https://lore.kernel.org/linux-ext4/6258F7BB.8010104@fujitsu.com/T/#u[1] Link: https://lore.kernel.org/linux-ext4/20220602110421.ymoug3rwfspmryqg@fedora/T/#t[2] Link: https://lore.kernel.org/linux-ext4/08e2ca4c8f6344bdcd76d75b821116c6147fd57a.camel@kernel.org/T/#t[3] Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1658977369-2478-1-git-send-email-xuyang2018.jy@fujitsu.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29ext4: avoid crash when inline data creation follows DIO writeJan Kara1-0/+6
When inode is created and written to using direct IO, there is nothing to clear the EXT4_STATE_MAY_INLINE_DATA flag. Thus when inode gets truncated later to say 1 byte and written using normal write, we will try to store the data as inline data. This confuses the code later because the inode now has both normal block and inline data allocated and the confusion manifests for example as: kernel BUG at fs/ext4/inode.c:2721! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 359 Comm: repro Not tainted 5.19.0-rc8-00001-g31ba1e3b8305-dirty #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:ext4_writepages+0x363d/0x3660 RSP: 0018:ffffc90000ccf260 EFLAGS: 00010293 RAX: ffffffff81e1abcd RBX: 0000008000000000 RCX: ffff88810842a180 RDX: 0000000000000000 RSI: 0000008000000000 RDI: 0000000000000000 RBP: ffffc90000ccf650 R08: ffffffff81e17d58 R09: ffffed10222c680b R10: dfffe910222c680c R11: 1ffff110222c680a R12: ffff888111634128 R13: ffffc90000ccf880 R14: 0000008410000000 R15: 0000000000000001 FS: 00007f72635d2640(0000) GS:ffff88811b000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000565243379180 CR3: 000000010aa74000 CR4: 0000000000150eb0 Call Trace: <TASK> do_writepages+0x397/0x640 filemap_fdatawrite_wbc+0x151/0x1b0 file_write_and_wait_range+0x1c9/0x2b0 ext4_sync_file+0x19e/0xa00 vfs_fsync_range+0x17b/0x190 ext4_buffered_write_iter+0x488/0x530 ext4_file_write_iter+0x449/0x1b90 vfs_write+0xbcd/0xf40 ksys_write+0x198/0x2c0 __x64_sys_write+0x7b/0x90 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fix the problem by clearing EXT4_STATE_MAY_INLINE_DATA when we are doing direct IO write to a file. Cc: stable@kernel.org Reported-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reported-by: syzbot+bd13648a53ed6933ca49@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=a1e89d09bbbcbd5c4cb45db230ee28c822953984 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Tested-by: Tadeusz Struk<tadeusz.struk@linaro.org> Link: https://lore.kernel.org/r/20220727155753.13969-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-27ext4: minor defrag code improvementsEric Whitney1-9/+7
Modify the error returns for two file types that can't be defragged to more clearly communicate those restrictions to a caller. When the defrag code is applied to swap files, return -ETXTBSY, and when applied to quota files, return -EOPNOTSUPP. Move an extent tree search whose results are only occasionally required to the site always requiring them for improved efficiency. Address a few typos. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20220722163910.268564-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-27ext4: continue to expand file system when the target size doesn't reachJerry Lee 李修賢1-1/+1
When expanding a file system from (16TiB-2MiB) to 18TiB, the operation exits early which leads to result inconsistency between resize2fs and Ext4 kernel driver. === before === ○ → resize2fs /dev/mapper/thin resize2fs 1.45.5 (07-Jan-2020) Filesystem at /dev/mapper/thin is mounted on /mnt/test; on-line resizing required old_desc_blocks = 2048, new_desc_blocks = 2304 The filesystem on /dev/mapper/thin is now 4831837696 (4k) blocks long. [ 865.186308] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [ 912.091502] dm-4: detected capacity change from 34359738368 to 38654705664 [ 970.030550] dm-5: detected capacity change from 34359734272 to 38654701568 [ 1000.012751] EXT4-fs (dm-5): resizing filesystem from 4294966784 to 4831837696 blocks [ 1000.012878] EXT4-fs (dm-5): resized filesystem to 4294967296 === after === [ 129.104898] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [ 143.773630] dm-4: detected capacity change from 34359738368 to 38654705664 [ 198.203246] dm-5: detected capacity change from 34359734272 to 38654701568 [ 207.918603] EXT4-fs (dm-5): resizing filesystem from 4294966784 to 4831837696 blocks [ 207.918754] EXT4-fs (dm-5): resizing filesystem from 4294967296 to 4831837696 blocks [ 207.918758] EXT4-fs (dm-5): Converting file system to meta_bg [ 207.918790] EXT4-fs (dm-5): resizing filesystem from 4294967296 to 4831837696 blocks [ 221.454050] EXT4-fs (dm-5): resized to 4658298880 blocks [ 227.634613] EXT4-fs (dm-5): resized filesystem to 4831837696 Signed-off-by: Jerry Lee <jerrylee@qnap.com> Link: https://lore.kernel.org/r/PU1PR04MB22635E739BD21150DC182AC6A18C9@PU1PR04MB2263.apcprd04.prod.outlook.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-26ext4: fixup possible uninitialized variable access in ↵Jan Kara1-2/+1
ext4_mb_choose_next_group_cr1() Variable 'grp' may be left uninitialized if there's no group with suitable average fragment size (or larger). Fix the problem by initializing it earlier. Link: https://lore.kernel.org/r/20220922091542.pkhedytey7wzp5fi@quack3 Fixes: 83e80a6e3543 ("ext4: use buckets for cr 1 block scan instead of rbtree") Cc: stable@kernel.org Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-24vfs: open inside ->tmpfile()Miklos Szeredi1-3/+3
This is in preparation for adding tmpfile support to fuse, which requires that the tmpfile creation and opening are done as a single operation. Replace the 'struct dentry *' argument of i_op->tmpfile with 'struct file *'. Call finish_open_simple() as the last thing in ->tmpfile() instances (may be omitted in the error case). Change d_tmpfile() argument to 'struct file *' as well to make callers more readable. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-09-22ext4: limit the number of retries after discarding preallocations blocksTheodore Ts'o1-1/+3
This patch avoids threads live-locking for hours when a large number threads are competing over the last few free extents as they blocks getting added and removed from preallocation pools. From our bug reporter: A reliable way for triggering this has multiple writers continuously write() to files when the filesystem is full, while small amounts of space are freed (e.g. by truncating a large file -1MiB at a time). In the local filesystem, this can be done by simply not checking the return code of write (0) and/or the error (ENOSPACE) that is set. Over NFS with an async mount, even clients with proper error checking will behave this way since the linux NFS client implementation will not propagate the server errors [the write syscalls immediately return success] until the file handle is closed. This leads to a situation where NFS clients send a continuous stream of WRITE rpcs which result in ERRNOSPACE -- but since the client isn't seeing this, the stream of writes continues at maximum network speed. When some space does appear, multiple writers will all attempt to claim it for their current write. For NFS, we may see dozens to hundreds of threads that do this. The real-world scenario of this is database backup tooling (in particular, github.com/mdkent/percona-xtrabackup) which may write large files (>1TiB) to NFS for safe keeping. Some temporary files are written, rewound, and read back -- all before closing the file handle (the temp file is actually unlinked, to trigger automatic deletion on close/crash.) An application like this operating on an async NFS mount will not see an error code until TiB have been written/read. The lockup was observed when running this database backup on large filesystems (64 TiB in this case) with a high number of block groups and no free space. Fragmentation is generally not a factor in this filesystem (~thousands of large files, mostly contiguous except for the parts written while the filesystem is at capacity.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org