summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2025-04-20ext4: fix off-by-one error in do_splitArtem Sadovnikov1-1/+1
commit 94824ac9a8aaf2fb3c54b4bdde842db80ffa555d upstream. Syzkaller detected a use-after-free issue in ext4_insert_dentry that was caused by out-of-bounds access due to incorrect splitting in do_split. BUG: KASAN: use-after-free in ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109 Write of size 251 at addr ffff888074572f14 by task syz-executor335/5847 CPU: 0 UID: 0 PID: 5847 Comm: syz-executor335 Not tainted 6.12.0-rc6-syzkaller-00318-ga9cda7c0ffed #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 __asan_memcpy+0x40/0x70 mm/kasan/shadow.c:106 ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109 add_dirent_to_buf+0x3d9/0x750 fs/ext4/namei.c:2154 make_indexed_dir+0xf98/0x1600 fs/ext4/namei.c:2351 ext4_add_entry+0x222a/0x25d0 fs/ext4/namei.c:2455 ext4_add_nondir+0x8d/0x290 fs/ext4/namei.c:2796 ext4_symlink+0x920/0xb50 fs/ext4/namei.c:3431 vfs_symlink+0x137/0x2e0 fs/namei.c:4615 do_symlinkat+0x222/0x3a0 fs/namei.c:4641 __do_sys_symlink fs/namei.c:4662 [inline] __se_sys_symlink fs/namei.c:4660 [inline] __x64_sys_symlink+0x7a/0x90 fs/namei.c:4660 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> The following loop is located right above 'if' statement. for (i = count-1; i >= 0; i--) { /* is more than half of this entry in 2nd half of the block? */ if (size + map[i].size/2 > blocksize/2) break; size += map[i].size; move++; } 'i' in this case could go down to -1, in which case sum of active entries wouldn't exceed half the block size, but previous behaviour would also do split in half if sum would exceed at the very last block, which in case of having too many long name files in a single block could lead to out-of-bounds access and following use-after-free. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Cc: stable@vger.kernel.org Fixes: 5872331b3d91 ("ext4: fix potential negative array index in do_split()") Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250404082804.2567-3-a.sadovnikov@ispras.ru Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20ext4: don't treat fhandle lookup of ea_inode as FS corruptionJann Horn1-20/+48
[ Upstream commit 642335f3ea2b3fd6dba03e57e01fa9587843a497 ] A file handle that userspace provides to open_by_handle_at() can legitimately contain an outdated inode number that has since been reused for another purpose - that's why the file handle also contains a generation number. But if the inode number has been reused for an ea_inode, check_igot_inode() will notice, __ext4_iget() will go through ext4_error_inode(), and if the inode was newly created, it will also be marked as bad by iget_failed(). This all happens before the point where the inode generation is checked. ext4_error_inode() is supposed to only be used on filesystem corruption; it should not be used when userspace just got unlucky with a stale file handle. So when this happens, let __ext4_iget() just return an error. Fixes: b3e6bcb94590 ("ext4: add EA_INODE checking to ext4_iget()") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241129-ext4-ignore-ea-fhandle-v1-1-e532c0d1cee0@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20ext4: ignore xattrs past endBhupesh1-1/+10
[ Upstream commit c8e008b60492cf6fd31ef127aea6d02fd3d314cd ] Once inside 'ext4_xattr_inode_dec_ref_all' we should ignore xattrs entries past the 'end' entry. This fixes the following KASAN reported issue: ================================================================== BUG: KASAN: slab-use-after-free in ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 Read of size 4 at addr ffff888012c120c4 by task repro/2065 CPU: 1 UID: 0 PID: 2065 Comm: repro Not tainted 6.13.0-rc2+ #11 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x1fd/0x300 ? tcp_gro_dev_warn+0x260/0x260 ? _printk+0xc0/0x100 ? read_lock_is_recursive+0x10/0x10 ? irq_work_queue+0x72/0xf0 ? __virt_addr_valid+0x17b/0x4b0 print_address_description+0x78/0x390 print_report+0x107/0x1f0 ? __virt_addr_valid+0x17b/0x4b0 ? __virt_addr_valid+0x3ff/0x4b0 ? __phys_addr+0xb5/0x160 ? ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 kasan_report+0xcc/0x100 ? ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 ? ext4_xattr_delete_inode+0xd30/0xd30 ? __ext4_journal_ensure_credits+0x5f0/0x5f0 ? __ext4_journal_ensure_credits+0x2b/0x5f0 ? inode_update_timestamps+0x410/0x410 ext4_xattr_delete_inode+0xb64/0xd30 ? ext4_truncate+0xb70/0xdc0 ? ext4_expand_extra_isize_ea+0x1d20/0x1d20 ? __ext4_mark_inode_dirty+0x670/0x670 ? ext4_journal_check_start+0x16f/0x240 ? ext4_inode_is_fast_symlink+0x2f2/0x3a0 ext4_evict_inode+0xc8c/0xff0 ? ext4_inode_is_fast_symlink+0x3a0/0x3a0 ? do_raw_spin_unlock+0x53/0x8a0 ? ext4_inode_is_fast_symlink+0x3a0/0x3a0 evict+0x4ac/0x950 ? proc_nr_inodes+0x310/0x310 ? trace_ext4_drop_inode+0xa2/0x220 ? _raw_spin_unlock+0x1a/0x30 ? iput+0x4cb/0x7e0 do_unlinkat+0x495/0x7c0 ? try_break_deleg+0x120/0x120 ? 0xffffffff81000000 ? __check_object_size+0x15a/0x210 ? strncpy_from_user+0x13e/0x250 ? getname_flags+0x1dc/0x530 __x64_sys_unlinkat+0xc8/0xf0 do_syscall_64+0x65/0x110 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x434ffd Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 RSP: 002b:00007ffc50fa7b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000107 RAX: ffffffffffffffda RBX: 00007ffc50fa7e18 RCX: 0000000000434ffd RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00007ffc50fa7be0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007ffc50fa7e08 R14: 00000000004bbf30 R15: 0000000000000001 </TASK> The buggy address belongs to the object at ffff888012c12000 which belongs to the cache filp of size 360 The buggy address is located 196 bytes inside of freed 360-byte region [ffff888012c12000, ffff888012c12168) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12c12 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x40(head|node=0|zone=0) page_type: f5(slab) raw: 0000000000000040 ffff888000ad7640 ffffea0000497a00 dead000000000004 raw: 0000000000000000 0000000000100010 00000001f5000000 0000000000000000 head: 0000000000000040 ffff888000ad7640 ffffea0000497a00 dead000000000004 head: 0000000000000000 0000000000100010 00000001f5000000 0000000000000000 head: 0000000000000001 ffffea00004b0481 ffffffffffffffff 0000000000000000 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888012c11f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888012c12000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff888012c12080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888012c12100: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc ffff888012c12180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Reported-by: syzbot+b244bda78289b00204ed@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b244bda78289b00204ed Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Bhupesh <bhupesh@igalia.com> Link: https://patch.msgid.link/20250128082751.124948-2-bhupesh@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20ext4: protect ext4_release_dquot against freezingOjaswin Mujoo1-0/+17
[ Upstream commit 530fea29ef82e169cd7fe048c2b7baaeb85a0028 ] Protect ext4_release_dquot against freezing so that we don't try to start a transaction when FS is frozen, leading to warnings. Further, avoid taking the freeze protection if a transaction is already running so that we don't need end up in a deadlock as described in 46e294efc355 ext4: fix deadlock with fs freezing and EA inodes Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241121123855.645335-3-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: fix OOB read when checking dotdot dirAcs, Jakub1-0/+3
commit d5e206778e96e8667d3bde695ad372c296dc9353 upstream. Mounting a corrupted filesystem with directory which contains '.' dir entry with rec_len == block size results in out-of-bounds read (later on, when the corrupted directory is removed). ext4_empty_dir() assumes every ext4 directory contains at least '.' and '..' as directory entries in the first data block. It first loads the '.' dir entry, performs sanity checks by calling ext4_check_dir_entry() and then uses its rec_len member to compute the location of '..' dir entry (in ext4_next_entry). It assumes the '..' dir entry fits into the same data block. If the rec_len of '.' is precisely one block (4KB), it slips through the sanity checks (it is considered the last directory entry in the data block) and leaves "struct ext4_dir_entry_2 *de" point exactly past the memory slot allocated to the data block. The following call to ext4_check_dir_entry() on new value of de then dereferences this pointer which results in out-of-bounds mem access. Fix this by extending __ext4_check_dir_entry() to check for '.' dir entries that reach the end of data block. Make sure to ignore the phony dir entries for checksum (by checking name_len for non-zero). Note: This is reported by KASAN as use-after-free in case another structure was recently freed from the slot past the bound, but it is really an OOB read. This issue was found by syzkaller tool. Call Trace: [ 38.594108] BUG: KASAN: slab-use-after-free in __ext4_check_dir_entry+0x67e/0x710 [ 38.594649] Read of size 2 at addr ffff88802b41a004 by task syz-executor/5375 [ 38.595158] [ 38.595288] CPU: 0 UID: 0 PID: 5375 Comm: syz-executor Not tainted 6.14.0-rc7 #1 [ 38.595298] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 38.595304] Call Trace: [ 38.595308] <TASK> [ 38.595311] dump_stack_lvl+0xa7/0xd0 [ 38.595325] print_address_description.constprop.0+0x2c/0x3f0 [ 38.595339] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595349] print_report+0xaa/0x250 [ 38.595359] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595368] ? kasan_addr_to_slab+0x9/0x90 [ 38.595378] kasan_report+0xab/0xe0 [ 38.595389] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595400] __ext4_check_dir_entry+0x67e/0x710 [ 38.595410] ext4_empty_dir+0x465/0x990 [ 38.595421] ? __pfx_ext4_empty_dir+0x10/0x10 [ 38.595432] ext4_rmdir.part.0+0x29a/0xd10 [ 38.595441] ? __dquot_initialize+0x2a7/0xbf0 [ 38.595455] ? __pfx_ext4_rmdir.part.0+0x10/0x10 [ 38.595464] ? __pfx___dquot_initialize+0x10/0x10 [ 38.595478] ? down_write+0xdb/0x140 [ 38.595487] ? __pfx_down_write+0x10/0x10 [ 38.595497] ext4_rmdir+0xee/0x140 [ 38.595506] vfs_rmdir+0x209/0x670 [ 38.595517] ? lookup_one_qstr_excl+0x3b/0x190 [ 38.595529] do_rmdir+0x363/0x3c0 [ 38.595537] ? __pfx_do_rmdir+0x10/0x10 [ 38.595544] ? strncpy_from_user+0x1ff/0x2e0 [ 38.595561] __x64_sys_unlinkat+0xf0/0x130 [ 38.595570] do_syscall_64+0x5b/0x180 [ 38.595583] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: ac27a0ec112a0 ("[PATCH] ext4: initial copy of files from ext3") Signed-off-by: Jakub Acs <acsjakub@amazon.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Mahmoud Adam <mngyadam@amazon.com> Cc: stable@vger.kernel.org Cc: security@kernel.org Link: https://patch.msgid.link/b3ae36a6794c4a01944c7d70b403db5b@amazon.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10ext4: don't over-report free space or inodes in statvfsTheodore Ts'o1-10/+17
commit f87d3af7419307ae26e705a2b2db36140db367a2 upstream. This fixes an analogus bug that was fixed in xfs in commit 4b8d867ca6e2 ("xfs: don't over-report free space or inodes in statvfs") where statfs can report misleading / incorrect information where project quota is enabled, and the free space is less than the remaining quota. This commit will resolve a test failure in generic/762 which tests for this bug. Cc: stable@kernel.org Fixes: 689c958cbe6b ("ext4: add project quota support") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19Merge tag 'ext4_for_linus-6.13-rc1' of ↵Linus Torvalds20-102/+227
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most notably in the journaling code, bufered I/O, and compiler warning cleanups" * tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits) jbd2: Fix comment describing journal_init_common() ext4: prevent an infinite loop in the lazyinit thread ext4: use struct_size() to improve ext4_htree_store_dirent() ext4: annotate struct fname with __counted_by() jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings ext4: use str_yes_no() helper function ext4: prevent delalloc to nodelalloc on remount jbd2: make b_frozen_data allocation always succeed ext4: cleanup variable name in ext4_fc_del() ext4: use string choices helpers jbd2: remove the 'success' parameter from the jbd2_do_replay() function jbd2: remove useless 'block_error' variable jbd2: factor out jbd2_do_replay() jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass() jbd2: unified release of buffer_head in do_one_pass() jbd2: remove redundant judgments for check v1 checksum ext4: use ERR_CAST to return an error-valued pointer mm: zero range of eof folio exposed by inode size extension ext4: partial zero eof block on unaligned inode size extension ext4: disambiguate the return value of ext4_dio_write_end_io() ...
2024-11-18Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-14/+7
Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
2024-11-18Merge tag 'vfs-6.13.untorn.writes' of ↵Linus Torvalds4-5/+99
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs untorn write support from Christian Brauner: "An atomic write is a write issed with torn-write protection. This means for a power failure or any hardware failure all or none of the data from the write will be stored, never a mix of old and new data. This work is already supported for block devices. If a block device is opened with O_DIRECT and the block device supports atomic write, then FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block device. This contains the work to expand atomic write support to filesystems, specifically ext4 and XFS. Currently, only support for writing exactly one filesystem block atomically is added. Since it's now possible to have filesystem block size > page size for XFS, it's possible to write 4K+ blocks atomically on x86" * tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: drop an obsolete comment in iomap_dio_bio_iter ext4: Do not fallback to buffered-io for DIO atomic write ext4: Support setting FMODE_CAN_ATOMIC_WRITE ext4: Check for atomic writes support in write iter ext4: Add statx support for atomic writes xfs: Support setting FMODE_CAN_ATOMIC_WRITE xfs: Validate atomic writes xfs: Support atomic write for statx fs: iomap: Atomic write support fs: Export generic_atomic_write_valid() block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid()
2024-11-18Merge tag 'vfs-6.13.tmpfs' of ↵Linus Torvalds1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull tmpfs case folding updates from Christian Brauner: "This adds case-insensitive support for tmpfs. The work contained in here adds support for case-insensitive file names lookups in tmpfs. The main difference from other casefold filesystems is that tmpfs has no information on disk, just on RAM, so we can't use mkfs to create a case-insensitive tmpfs. For this implementation, there's a mount option for casefolding. The rest of the patchset follows a similar approach as ext4 and f2fs. The use case for this feature is similar to the use case for ext4, to better support compatibility layers (like Wine), particularly in combination with sandboxing/container tools (like Flatpak). Those containerization tools can share a subset of the host filesystem with an application. In the container, the root directory and any parent directories required for a shared directory are on tmpfs, with the shared directories bind-mounted into the container's view of the filesystem. If the host filesystem is using case-insensitive directories, then the application can do lookups inside those directories in a case-insensitive way, without this needing to be implemented in user-space. However, if the host is only sharing a subset of a case-insensitive directory with the application, then the parent directories of the mount point will be part of the container's root tmpfs. When the application tries to do case-insensitive lookups of those parent directories on a case-sensitive tmpfs, the lookup will fail" * tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs: Initialize sysfs during tmpfs init tmpfs: Fix type for sysfs' casefold attribute libfs: Fix kernel-doc warning in generic_ci_validate_strict_name docs: tmpfs: Add casefold options tmpfs: Expose filesystem features via sysfs tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirs tmpfs: Add casefold lookup support libfs: Export generic_ci_ dentry functions unicode: Recreate utf8_parse_version() unicode: Export latest available UTF-8 version number ext4: Use generic_ci_validate_strict_name helper libfs: Create the helper function generic_ci_validate_strict_name()
2024-11-18Merge tag 'vfs-6.13.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Fixup and improve NLM and kNFSD file lock callbacks Last year both GFS2 and OCFS2 had some work done to make their locking more robust when exported over NFS. Unfortunately, part of that work caused both NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send lock notifications to clients This in itself is not a huge problem because most NFS clients will still poll the server in order to acquire a conflicted lock It's important for NLM and kNFSD that they do not block their kernel threads inside filesystem's file_lock implementations because that can produce deadlocks. We used to make sure of this by only trusting that posix_lock_file() can correctly handle blocking lock calls asynchronously, so the lock managers would only setup their file_lock requests for async callbacks if the filesystem did not define its own lock() file operation However, when GFS2 and OCFS2 grew the capability to correctly handle blocking lock requests asynchronously, they started signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting posix_lock_file() was inadvertently dropped, so now most filesystems no longer produce lock notifications when exported over NFS Fix this by using an fop_flag which greatly simplifies the problem and grooms the way for future uses by both filesystems and lock managers alike - Add a sysctl to delete the dentry when a file is removed instead of making it a negative dentry Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s, prompting a revert in commit 4a4be1ad3a6e ("Revert "vfs: Delete the associated dentry when deleting a file""). This reintroduces the concept conditionally through a sysctl - Expand the statmount() system call: * Report the filesystem subtype in a new fs_subtype field to e.g., report fuse filesystem subtypes * Report the superblock source in a new sb_source field * Add a new way to return filesystem specific mount options in an option array that returns filesystem specific mount options separated by zero bytes and unescaped. This allows caller's to retrieve filesystem specific mount options and immediately pass them to e.g., fsconfig() without having to unescape or split them * Report security (LSM) specific mount options in a separate security option array. We don't lump them together with filesystem specific mount options as security mount options are generic and most users aren't interested in them The format is the same as for the filesystem specific mount option array - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command - Optimize acl_permission_check() to avoid costly {g,u}id ownership checks if possible - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict() - Add synchronous wakeup support for ep_poll_callback. Currently, epoll only uses wake_up() to wake up task. But sometimes there are epoll users which want to use the synchronous wakeup flag to give a hint to the scheduler, e.g., the Android binder driver. So add a wake_up_sync() define, and use wake_up_sync() when sync is true in ep_poll_callback() Fixes: - Fix kernel documentation for inode_insert5() and iget5_locked() - Annotate racy epoll check on file->f_ep - Make F_DUPFD_QUERY associative - Avoid filename buffer overrun in initramfs - Don't let statmount() return empty strings - Add a cond_resched() to dump_user_range() to avoid hogging the CPU - Don't query the device logical blocksize multiple times for hfsplus - Make filemap_read() check that the offset is positive or zero Cleanups: - Various typo fixes - Cleanup wbc_attach_fdatawrite_inode() - Add __releases annotation to wbc_attach_and_unlock_inode() - Add hugetlbfs tracepoints - Fix various vfs kernel doc parameters - Remove obsolete TODO comment from io_cancel() - Convert wbc_account_cgroup_owner() to take a folio - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add() - Reorder struct posix_acl to save 8 bytes - Annotate struct posix_acl with __counted_by() - Replace one-element array with flexible array member in freevxfs - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()" * tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) statmount: retrieve security mount options vfs: make evict() use smp_mb__after_spinlock instead of smp_mb statmount: add flag to retrieve unescaped options fs: add the ability for statmount() to report the sb_source writeback: wbc_attach_fdatawrite_inode out of line writeback: add a __releases annoation to wbc_attach_and_unlock_inode fs: add the ability for statmount() to report the fs_subtype fs: don't let statmount return empty strings fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel() hfsplus: don't query the device logical block size multiple times freevxfs: Replace one-element array with flexible array member fs: optimize acl_permission_check() initramfs: avoid filename buffer overrun fs/writeback: convert wbc_account_cgroup_owner to take a folio acl: Annotate struct posix_acl with __counted_by() acl: Realign struct posix_acl to save 8 bytes epoll: Add synchronous wakeup support for ep_poll_callback coredump: add cond_resched() to dump_user_range mm/page-writeback.c: Fix comment of wb_domain_writeout_add() mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL ...
2024-11-13ext4: prevent an infinite loop in the lazyinit threadMathieu Othacehe1-7/+13
Use ktime_get_ns instead of ktime_get_real_ns when computing the lr_timeout not to be affected by system time jumps. Use a boolean instead of the MAX_JIFFY_OFFSET value to determine whether the next_wakeup value has been set. Comparing elr->lr_next_sched to MAX_JIFFY_OFFSET can cause the lazyinit thread to loop indefinitely. Co-developed-by: Lukas Skupinski <lukas.skupinski@landisgyr.com> Signed-off-by: Lukas Skupinski <lukas.skupinski@landisgyr.com> Signed-off-by: Mathieu Othacehe <othacehe@gnu.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241106134741.26948-2-othacehe@gnu.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use struct_size() to improve ext4_htree_store_dirent()Thorsten Blum1-3/+2
Inline and use struct_size() to calculate the number of bytes to allocate for new_fn and remove the local variable len. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241105103353.11590-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: annotate struct fname with __counted_by()Thorsten Blum1-1/+1
Add the __counted_by compiler attribute to the flexible array member name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241105101813.10864-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use str_yes_no() helper functionThorsten Blum1-1/+1
Remove hard-coded strings by using the str_yes_no() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241021100056.5521-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: prevent delalloc to nodelalloc on remountNicolas Bretz1-0/+7
Implemented the suggested solution mentioned in the bug https://bugzilla.kernel.org/show_bug.cgi?id=218820 Preventing the disabling of delayed allocation mode on remount. delalloc to nodelalloc not permitted anymore nodelalloc to delalloc permitted, not affected Signed-off-by: Nicolas Bretz <bretznic@gmail.com> Link: https://patch.msgid.link/20241014034143.59779-1-bretznic@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: cleanup variable name in ext4_fc_del()Dan Carpenter1-2/+2
The variables "&EXT4_SB(inode->i_sb)->s_fc_lock" and "&sbi->s_fc_lock" are the same lock. This function uses a mix of both, which is a bit unsightly and confuses Smatch. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/96008557-8ff4-44cc-b5e3-ce242212f1a3@stanley.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use string choices helpersR Sundar1-1/+1
Use string choice helpers for better readability and to fix cocci warning Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/r/202410062256.BoynX3c2-lkp@intel.com/ Signed-off-by: R Sundar <prosunofficial@gmail.com> Link: https://patch.msgid.link/20241007172006.83339-1-prosunofficial@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use ERR_CAST to return an error-valued pointerYu Jiaoliang1-2/+2
Instead of directly casting and returning an error-valued pointer, use ERR_CAST to make the error handling more explicit and improve code clarity. Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com> Link: https://patch.msgid.link/20240920021440.1959243-1-yujiaoliang@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: partial zero eof block on unaligned inode size extensionBrian Foster2-16/+42
Using mapped writes, it's technically possible to expose stale post-eof data on a truncate up operation. Consider the following example: $ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \ -c "truncate 8k" -c "pread -v 2k 16" <file> ... 00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX ... This shows that the post-eof data written via mwrite lands within EOF after a truncate up. While this is deliberate of the test case, behavior is somewhat unpredictable because writeback does post-eof zeroing, and writeback can occur at any time in the background. For example, an fsync inserted between the mwrite and truncate causes the subsequent read to instead return zeroes. This basically means that there is a race window in this situation between any subsequent extending operation and writeback that dictates whether post-eof data is exposed to the file or zeroed. To prevent this problem, perform partial block zeroing as part of the various inode size extending operations that are susceptible to it. For truncate extension, zero around the original eof similar to how truncate down does partial zeroing of the new eof. For extension via writes and fallocate related operations, zero the newly exposed range of the file to cover any partial zeroing that must occur at the original and new eof blocks. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: disambiguate the return value of ext4_dio_write_end_io()Jinliang Zheng1-2/+3
The commit 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") causes confusion about the meaning of the return value of ext4_dio_write_end_io(). Specifically, when the ext4_handle_inode_extension() operation succeeds, ext4_dio_write_end_io() directly returns count instead of 0. This does not cause a bug in the current kernel, but the semantics of the return value of the ext4_dio_write_end_io() function are wrong, which is likely to introduce bugs in the future code evolution. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240919082539.381626-1-alexjlzheng@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: pass write-hint for buffered IOj.xia1-1/+3
Commit 449813515d3e ("block, fs: Restore the per-bio/request data lifetime fields") restored write-hint support in ext4. But that is applicable only for direct IO. This patch supports passing write-hint for buffered IO from ext4 file system to block layer by filling bi_write_hint of struct bio in io_submit_add_bh(). Signed-off-by: j.xia <j.xia@samsung.com> Link: https://patch.msgid.link/20240919020341.2657646-1-j.xia@samsung.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: fix race in buffer_head read fault injectionLong Li10-29/+29
When I enabled ext4 debug for fault injection testing, I encountered the following warning: EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress: Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051 WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0 The root cause of the issue lies in the improper implementation of ext4's buffer_head read fault injection. The actual completion of buffer_head read and the buffer_head fault injection are not atomic, which can lead to the uptodate flag being cleared on normally used buffer_heads in race conditions. [CPU0] [CPU1] [CPU2] ext4_read_inode_bitmap ext4_read_bh() <bh read complete> ext4_read_inode_bitmap if (buffer_uptodate(bh)) return bh jbd2_journal_commit_transaction __jbd2_journal_refile_buffer __jbd2_journal_unfile_buffer __jbd2_journal_temp_unlink_buffer ext4_simulate_fail_bh() clear_buffer_uptodate mark_buffer_dirty <report warning> WARN_ON_ONCE(!buffer_uptodate(bh)) The best approach would be to perform fault injection in the IO completion callback function, rather than after IO completion. However, the IO completion callback function cannot get the fault injection code in sb. Fix it by passing the result of fault injection into the bh read function, we simulate faults within the bh read function itself. This requires adding an extra parameter to the bh read functions that need fault injection. Fixes: 46f870d690fe ("ext4: simulate various I/O and checksum errors when reading metadata") Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: don't pass full mapping flags to ext4_es_insert_extent()Zhang Yi4-10/+11
When converting a delalloc extent in ext4_es_insert_extent(), since we only want to pass the info of whether the quota has already been claimed if the allocation is a direct allocation from ext4_map_create_blocks(), there is no need to pass full mapping flags, so changes to just pass whether the EXT4_GET_BLOCKS_DELALLOC_RESERVE bit is set. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240906061401.2980330-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: mark ctx_*_flags() with __maybe_unusedAndy Shevchenko1-4/+4
When ctx_set_flags() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: .../ext4/super.c:2120:1: error: unused function 'ctx_set_flags' [-Werror,-Wunused-function] 2120 | EXT4_SET_CTX(flags); /* set only */ | ^~~~~~~~~~~~~~~~~~~ Fix this by marking ctx_*_flags() with __maybe_unused (mark both for the sake of symmetry). See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20240905163229.140522-1-andriy.shevchenko@linux.intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: return error on syncfs after shutdownAmir Goldstein1-1/+1
This is the logic behavior and one that we would like to verify using a generic fstest similar to xfs/546. Link: https://lore.kernel.org/fstests/20240830152648.GE6216@frogsfrogsfrogs/ Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240904084657.1062243-1-amir73il@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13fs: ext4: Don't use CMA for buffer_headZhaoyang Huang1-1/+8
cma_alloc() keep failed in our system which thanks to a jh->bh->b_page can not be migrated out of CMA area[1] as the jh has one cp_transaction pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6]. We temporarily solve this by launching jbd2_log_do_checkpoint forcefully somewhere. Since journal is common mechanism to all JFSs and cp_transaction has a little fewer opportunity to be launched, the cma_alloc() could be affected under the same scenario. This patch would like to have buffer_head of ext4 not use CMA pages when doing sb_getblk. [1] crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping) fffffffe01a51c00 e9470000 ffffff808f0aa150 3 2 8000000008020 lru,private fffffffe03d189c0 174627000 ffffff808f0aa150 4 2 2004000000008020 lru,private fffffffe03d88e00 176238000 ffffff808f0aa150 3f9 2 2008000000008020 lru,private fffffffe03d88e40 176239000 ffffff808f0aa150 6 2 2008000000008020 lru,private fffffffe03d88e80 17623a000 ffffff808f0aa150 5 2 2008000000008020 lru,private fffffffe03d88ec0 17623b000 ffffff808f0aa150 1 2 2008000000008020 lru,private fffffffe03d88f00 17623c000 ffffff808f0aa150 0 2 2008000000008020 lru,private fffffffe040e6540 183995000 ffffff808f0aa150 3f4 2 2004000000008020 lru,private [2] page -> buffer_head crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x private = 0xffffff802fca0c00 [3] buffer_head -> journal_head crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00 b_private = 0xffffff8041338e10, [4] journal_head -> b_cp_transaction crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x b_cp_transaction = 0xffffff80410f1900, [5] transaction_t -> journal crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x t_journal = 0xffffff80e70f3000, [6] j_free & j_max_transaction_buffers crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x j_free = 0x3f1, j_max_transaction_buffers = 0x100, Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240904075300.1148836-1-zhaoyang.huang@unisoc.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: simplify if conditionJiapeng Chong1-3/+1
The if condition !A || A && B can be simplified to !A || B. ./fs/ext4/fast_commit.c:362:21-23: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9837 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830071713.40565-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: fix FS_IOC_GETFSMAP handlingTheodore Ts'o3-5/+68
The original implementation ext4's FS_IOC_GETFSMAP handling only worked when the range of queried blocks included at least one free (unallocated) block range. This is because how the metadata blocks were emitted was as a side effect of ext4_mballoc_query_range() calling ext4_getfsmap_datadev_helper(), and that function was only called when a free block range was identified. As a result, this caused generic/365 to fail. Fix this by creating a new function ext4_getfsmap_meta_helper() which gets called so that blocks before the first free block range in a block group can get properly reported. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2024-11-13ext4: WARN if a full dir leaf block has only one dentryBaokun Li1-2/+12
The maximum length of a filename is 255 and the minimum block size is 1024, so it is always guaranteed that the number of entries is greater than or equal to 2 when do_split() is called. So unless ext4_dx_add_entry() and make_indexed_dir() or some other functions are buggy, 'split == 0' will not occur. Setting 'continued' to 0 in this case masks the problem that the file system has become corrupted, even though it prevents possible out-of-bounds access. Hence WARN_ON_ONCE() is used to check if 'split' is 0, and if it is then warns and returns an error to abort split. Suggested-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20240823160518.GA424729@mit.edu Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241008121152.3771906-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: show the default enabled prefetch_block_bitmaps optionBaokun Li1-0/+3
After commit 21175ca434c5 ("ext4: make prefetch_block_bitmaps default"), we enable 'prefetch_block_bitmaps' by default, but this is not shown in the '/proc/fs/ext4/sdx/options' procfs interface. This makes it impossible to distinguish whether the feature is enabled by default or not, so 'prefetch_block_bitmaps' is shown in the 'options' procfs interface when prefetch_block_bitmaps is enabled by default. This makes it easy to notice changes to the default mount options between versions through the '/proc/fs/ext4/sdx/options' procfs interface. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241008120134.3758097-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-06ext4: Do not fallback to buffered-io for DIO atomic writeRitesh Harjani (IBM)2-5/+29
atomic writes is currently only supported for single fsblock and only for direct-io. We should not return -ENOTBLK for atomic writes since we want the atomic write request to either complete fully or fail otherwise. Hence, we should never fallback to buffered-io in case of DIO atomic write requests. Let's also catch if this ever happens by adding some WARN_ON_ONCE before buffered-io handling for direct-io atomic writes. More details of the discussion [1]. While at it let's add an inline helper ext4_want_directio_fallback() which simplifies the logic checks and inherently fixes condition on when to return -ENOTBLK which otherwise was always returning true for any write or directio in ext4_iomap_end(). It was ok since ext4 only supports direct-io via iomap. [1]: https://lore.kernel.org/linux-xfs/cover.1729825985.git.ritesh.list@gmail.com/T/#m9dbecc11bed713ed0d7a486432c56b105b555f04 Suggested-by: Darrick J. Wong <djwong@kernel.org> # inline helper Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-06ext4: Support setting FMODE_CAN_ATOMIC_WRITERitesh Harjani (IBM)1-0/+3
FS needs to add the fmode capability in order to support atomic writes during file open (refer kiocb_set_rw_flags()). Set this capability on a regular file if ext4 can do atomic write. Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-06ext4: Check for atomic writes support in write iterRitesh Harjani (IBM)1-0/+14
Let's validate the given constraints for atomic write request. Otherwise it will fail with -EINVAL. Currently atomic write is only supported on DIO, so for buffered-io it will return -EOPNOTSUPP. Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-06ext4: Add statx support for atomic writesRitesh Harjani (IBM)3-0/+53
This patch adds base support for atomic writes via statx getattr. On bs < ps systems, we can create FS with say bs of 16k. That means both atomic write min and max unit can be set to 16k for supporting atomic writes. Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-03fdget(), more trivial conversionsAl Viro1-14/+7
all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-31ext4: avoid remount errors with 'abort' mount optionJan Kara1-3/+8
When we remount filesystem with 'abort' mount option while changing other mount options as well (as is LTP test doing), we can return error from the system call after commit d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors") because the application of mount option changes detects shutdown filesystem and refuses to do anything. The behavior of application of other mount options in presence of 'abort' mount option is currently rather arbitary as some mount option changes are handled before 'abort' and some after it. Move aborting of the filesystem to the end of remount handling so all requested changes are properly applied before the filesystem is shutdown to have a reasonably consistent behavior. Fixes: d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors") Reported-by: Jan Stancek <jstancek@redhat.com> Link: https://lore.kernel.org/all/Zvp6L+oFnfASaoHl@t14s Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Jan Stancek <jstancek@redhat.com> Link: https://patch.msgid.link/20241004221556.19222-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-31ext4: supress data-race warnings in ext4_free_inodes_{count,set}()Jeongjun Park1-4/+4
find_group_other() and find_group_orlov() read *_lo, *_hi with ext4_free_inodes_count without additional locking. This can cause data-race warning, but since the lock is held for most writes and free inodes value is generally not a problem even if it is incorrect, it is more appropriate to use READ_ONCE()/WRITE_ONCE() than to add locking. ================================================================== BUG: KCSAN: data-race in ext4_free_inodes_count / ext4_free_inodes_set write to 0xffff88810404300e of 2 bytes by task 6254 on cpu 1: ext4_free_inodes_set+0x1f/0x80 fs/ext4/super.c:405 __ext4_new_inode+0x15ca/0x2200 fs/ext4/ialloc.c:1216 ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391 vfs_symlink+0xca/0x1d0 fs/namei.c:4615 do_symlinkat+0xe3/0x340 fs/namei.c:4641 __do_sys_symlinkat fs/namei.c:4657 [inline] __se_sys_symlinkat fs/namei.c:4654 [inline] __x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654 x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e read to 0xffff88810404300e of 2 bytes by task 6257 on cpu 0: ext4_free_inodes_count+0x1c/0x80 fs/ext4/super.c:349 find_group_other fs/ext4/ialloc.c:594 [inline] __ext4_new_inode+0x6ec/0x2200 fs/ext4/ialloc.c:1017 ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391 vfs_symlink+0xca/0x1d0 fs/namei.c:4615 do_symlinkat+0xe3/0x340 fs/namei.c:4641 __do_sys_symlinkat fs/namei.c:4657 [inline] __se_sys_symlinkat fs/namei.c:4654 [inline] __x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654 x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: stable@vger.kernel.org Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20241003125337.47283-1-aha310510@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-31ext4: Call ext4_journal_stop(handle) only once in ext4_dio_write_iter()Markus Elfring1-5/+2
An ext4_journal_stop(handle) call was immediately used after a return value check for a ext4_orphan_add() call in this function implementation. Thus call such a function only once instead directly before the check. This issue was transformed by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/cf895072-43cf-412c-bced-8268498ad13e@web.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-28ext4: Use generic_ci_validate_strict_name helperAndré Almeida1-4/+1
Use the helper function to check the requirements for casefold directories using strict encoding. Suggested-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-2-f443d5814194@igalia.com Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28fs/writeback: convert wbc_account_cgroup_owner to take a folioPankaj Raghav1-1/+1
Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio(). Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs. Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future. No functional changes. Only compile tested. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com Acked-by: David Sterba <dsterba@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10Merge patch series "timekeeping/fs: multigrain timestamp redux"Christian Brauner1-1/+1
Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10ext4: switch to multigrain timestampsJeff Layton1-1/+1
Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. For ext4, we only need to enable the FS_MGTIME flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-10-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-05ext4: fix off by one issue in alloc_flex_gd()Baokun Li1-8/+10
Wesley reported an issue: ================================================================== EXT4-fs (dm-5): resizing filesystem from 7168 to 786432 blocks ------------[ cut here ]------------ kernel BUG at fs/ext4/resize.c:324! CPU: 9 UID: 0 PID: 3576 Comm: resize2fs Not tainted 6.11.0+ #27 RIP: 0010:ext4_resize_fs+0x1212/0x12d0 Call Trace: __ext4_ioctl+0x4e0/0x1800 ext4_ioctl+0x12/0x20 __x64_sys_ioctl+0x99/0xd0 x64_sys_call+0x1206/0x20d0 do_syscall_64+0x72/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== While reviewing the patch, Honza found that when adjusting resize_bg in alloc_flex_gd(), it was possible for flex_gd->resize_bg to be bigger than flexbg_size. The reproduction of the problem requires the following: o_group = flexbg_size * 2 * n; o_size = (o_group + 1) * group_size; n_group: [o_group + flexbg_size, o_group + flexbg_size * 2) o_size = (n_group + 1) * group_size; Take n=0,flexbg_size=16 as an example: last:15 |o---------------|--------------n-| o_group:0 resize to n_group:30 The corresponding reproducer is: img=test.img rm -f $img truncate -s 600M $img mkfs.ext4 -F $img -b 1024 -G 16 8M dev=`losetup -f --show $img` mkdir -p /tmp/test mount $dev /tmp/test resize2fs $dev 248M Delete the problematic plus 1 to fix the issue, and add a WARN_ON_ONCE() to prevent the issue from happening again. [ Note: another reproucer which this commit fixes is: img=test.img rm -f $img truncate -s 25MiB $img mkfs.ext4 -b 4096 -E nodiscard,lazy_itable_init=0,lazy_journal_init=0 $img truncate -s 3GiB $img dev=`losetup -f --show $img` mkdir -p /tmp/test mount $dev /tmp/test resize2fs $dev 3G umount $dev losetup -d $dev -- TYT ] Reported-by: Wesley Hershberger <wesley.hershberger@canonical.com> Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2081231 Reported-by: Stéphane Graber <stgraber@stgraber.org> Closes: https://lore.kernel.org/all/20240925143325.518508-1-aleksandr.mikhalitsyn@canonical.com/ Tested-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Tested-by: Eric Sandeen <sandeen@redhat.com> Fixes: 665d3e0af4d3 ("ext4: reduce unnecessary memory allocation in alloc_flex_gd()") Cc: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240927133329.1015041-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-05ext4: mark fc as ineligible using an handle in ext4_xattr_set()Luis Henriques (SUSE)1-1/+2
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result in a fast-commit being done before the filesystem is effectively marked as ineligible. This patch moves the call to this function so that an handle can be used. If a transaction fails to start, then there's not point in trying to mark the filesystem as ineligible, and an error will eventually be returned to user-space. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240923104909.18342-3-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2024-10-05ext4: use handle to mark fc as ineligible in __track_dentry_update()Luis Henriques (SUSE)1-8/+11
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result in a fast-commit being done before the filesystem is effectively marked as ineligible. This patch fixes the calls to this function in __track_dentry_update() by adding an extra parameter to the callback used in ext4_fc_track_template(). Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240923104909.18342-2-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2024-09-23Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
2024-09-21Merge tag 'ext4_for_linus-6.12-rc1' of ↵Linus Torvalds21-921/+930
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Lots of cleanups and bug fixes this cycle, primarily in the block allocation, extent management, fast commit, and journalling" * tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (93 commits) ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C ext4: check stripe size compatibility on remount as well ext4: fix i_data_sem unlock order in ext4_ind_migrate() ext4: remove the special buffer dirty handling in do_journal_get_write_access ext4: fix a potential assertion failure due to improperly dirtied buffer ext4: hoist ext4_block_write_begin and replace the __block_write_begin ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers ext4: dax: keep orphan list before truncate overflow allocated blocks ext4: fix error message when rejecting the default hash ext4: save unnecessary indentation in ext4_ext_create_new_leaf() ext4: make some fast commit functions reuse extents path ext4: refactor ext4_swap_extents() to reuse extents path ext4: get rid of ppath in convert_initialized_extent() ext4: get rid of ppath in ext4_ext_handle_unwritten_extents() ext4: get rid of ppath in ext4_ext_convert_to_initialized() ext4: get rid of ppath in ext4_convert_unwritten_extents_endio() ext4: get rid of ppath in ext4_split_convert_extents() ext4: get rid of ppath in ext4_split_extent() ext4: get rid of ppath in ext4_force_split_extent_at() ext4: get rid of ppath in ext4_split_extent_at() ...
2024-09-16Merge tag 'vfs-6.12.file' of ↵Linus Torvalds3-25/+34
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This is the work to cleanup and shrink struct file significantly. Right now, (focusing on x86) struct file is 232 bytes. After this series struct file will be 184 bytes aka 3 cacheline and a spare 8 bytes for future extensions at the end of the struct. With struct file being as ubiquitous as it is this should make a difference for file heavy workloads and allow further optimizations in the future. - struct fown_struct was embedded into struct file letting it take up 32 bytes in total when really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct now allocate the struct on demand. This frees up 24 bytes. - Move struct file_ra_state into the union containg the cleanup hooks and move f_iocb_flags out of the union. This closes a 4 byte hole we created earlier and brings struct file to 192 bytes. Which means struct file is 3 cachelines and we managed to shrink it by 40 bytes. - Reorder struct file so that nothing crosses a cacheline. I suspect that in the future we will end up reordering some members to mitigate false sharing issues or just because someone does actually provide really good perf data. - Shrinking struct file to 192 bytes is only part of the work. Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer must be located outside of the object because the cache doesn't know what part of the memory can safely be overwritten as it may be needed to prevent object recycling. That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a new cacheline. So this also contains work to add a new kmem_cache_create_rcu() function that allows the caller to specify an offset where the freelist pointer is supposed to be placed. Thus avoiding the implicit addition of a fourth cacheline. - And finally this removes the f_version member in struct file. The f_version member isn't particularly well-defined. It is mainly used as a cookie to detect concurrent seeks when iterating directories. But it is also abused by some subsystems for completely unrelated things. It is mostly a directory and filesystem specific thing that doesn't really need to live in struct file and with its wonky semantics it really lacks a specific function. For pipes, f_version is (ab)used to defer poll notifications until a write has happened. And struct pipe_inode_info is used by multiple struct files in their ->private_data so there's no chance of pushing that down into file->private_data without introducing another pointer indirection. But pipes don't rely on f_pos_lock so this adds a union into struct file encompassing f_pos_lock and a pipe specific f_pipe member that pipes can use. This union of course can be extended to other file types and is similar to what we do in struct inode already" * tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits) fs: remove f_version pipe: use f_pipe fs: add f_pipe ubifs: store cookie in private data ufs: store cookie in private data udf: store cookie in private data proc: store cookie in private data ocfs2: store cookie in private data input: remove f_version abuse ext4: store cookie in private data ext2: store cookie in private data affs: store cookie in private data fs: add generic_llseek_cookie() fs: use must_set_pos() fs: add must_set_pos() fs: add vfs_setpos_cookie() s390: remove unused f_version ceph: remove unused f_version adi: remove unused f_version mm: Removed @freeptr_offset to prevent doc warning ...
2024-09-09ext4: store cookie in private dataChristian Brauner3-25/+34
Store the cookie to detect concurrent seeks on directories in file->private_data. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-11-6d3e4816aa7b@kernel.org Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>