summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2025-05-29ext4: remove writable userspace mappings before truncating page cacheZhang Yi3-14/+70
[ Upstream commit 17207d0bb209e8b40f27d7f3f96e82a78af0bf2c ] When zeroing a range of folios on the filesystem which block size is less than the page size, the file's mapped blocks within one page will be marked as unwritten, we should remove writable userspace mappings to ensure that ext4_page_mkwrite() can be called during subsequent write access to these partial folios. Otherwise, data written by subsequent mmap writes may not be saved to disk. $mkfs.ext4 -b 1024 /dev/vdb $mount /dev/vdb /mnt $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \ -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo $od -Ax -t x1z /mnt/foo 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 * 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 * 001000 $umount /mnt && mount /dev/vdb /mnt $od -Ax -t x1z /mnt/foo 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 * 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 001000 Fix this by introducing ext4_truncate_page_cache_block_range() to remove writable userspace mappings when truncating a partial folio range. Additionally, move the journal data mode-specific handlers and truncate_pagecache_range() into this function, allowing it to serve as a common helper that correctly manages the page cache in preparation for block range manipulations. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29ext4: don't write back data before punch hole in nojournal modeZhang Yi1-13/+5
[ Upstream commit 43d0105e2c7523cc6b14cad65e2044e829c0a07a ] There is no need to write back all data before punching a hole in non-journaled mode since it will be dropped soon after removing space. Therefore, the call to filemap_write_and_wait_range() can be eliminated. Besides, similar to ext4_zero_range(), we must address the case of partially punched folios when block size < page size. It is essential to remove writable userspace mappings to ensure that the folio can be faulted again during subsequent mmap write access. In journaled mode, we need to write dirty pages out before discarding page cache in case of crash before committing the freeing data transaction, which could expose old, stale data, even if synchronization has been performed. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29ext4: do not convert the unwritten extents if data writeback failsBaokun Li2-3/+16
[ Upstream commit e856f93e0fb249955f7d5efb18fe20500a9ccc6d ] When dioread_nolock is turned on (the default), it will convert unwritten extents to written at ext4_end_io_end(), even if the data writeback fails. It leads to the possibility that stale data may be exposed when the physical block corresponding to the file data is read-only (i.e., writes return -EIO, but reads are normal). Therefore a new ext4_io_end->flags EXT4_IO_END_FAILED is added, which indicates that some bio write-back failed in the current ext4_io_end. When this flag is set, the unwritten to written conversion is no longer performed. Users can read the data normally until the caches are dropped, after that, the failed extents can only be read to all 0. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122110533.4116662-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29ext4: reject the 'data_err=abort' option in nojournal modeBaokun Li1-0/+12
[ Upstream commit 26343ca0df715097065b02a6cddb4a029d5b9327 ] data_err=abort aborts the journal on I/O errors. However, this option is meaningless if journal is disabled, so it is rejected in nojournal mode to reduce unnecessary checks. Also, this option is ignored upon remount. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250122110533.4116662-4-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29ext4: reorder capability check lastChristian Göttsche1-2/+2
[ Upstream commit 1b419c889c0767a5b66d0a6c566cae491f1cb0f7 ] capable() calls refer to enabled LSMs whether to permit or deny the request. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to permit the task the requested capability, while it does not need it, violating the principle of least privilege. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250302160657.127253-2-cgoettsche@seltendoof.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29ext4: on a remount, only log the ro or r/w state when it has changedNicolas Bretz1-3/+4
[ Upstream commit d7b0befd09320e3356a75cb96541c030515e7f5f ] A user complained that a message such as: EXT4-fs (nvme0n1p3): re-mounted UUID ro. Quota mode: none. implied that the file system was previously mounted read/write and was now remounted read-only, when it could have been some other mount state that had changed by the "mount -o remount" operation. Fix this by only logging "ro"or "r/w" when it has changed. https://bugzilla.kernel.org/show_bug.cgi?id=219132 Signed-off-by: Nicolas Bretz <bretznic@gmail.com> Link: https://patch.msgid.link/20250319171011.8372-1-bretznic@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29fs/ext4: use sleeping version of sb_find_get_block()Davidlohr Bueso1-1/+2
[ Upstream commit 6e8f57fd09c9fb569d10b2ccc3878155b702591a ] Enable ext4_free_blocks() to use it, which has a cond_resched to begin with. Convert to the new nonatomic flavor to benefit from potential performance benefits and adapt in the future vs migration such that semantics are kept. Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-7-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02ext4: make block validity check resistent to sb bh corruptionOjaswin Mujoo2-6/+6
[ Upstream commit ccad447a3d331a239477c281533bacb585b54a98 ] Block validity checks need to be skipped in case they are called for journal blocks since they are part of system's protected zone. Currently, this is done by checking inode->ino against sbi->s_es->s_journal_inum, which is a direct read from the ext4 sb buffer head. If someone modifies this underneath us then the s_journal_inum field might get corrupted. To prevent against this, change the check to directly compare the inode with journal->j_inode. **Slight change in behavior**: During journal init path, check_block_validity etc might be called for journal inode when sbi->s_journal is not set yet. In this case we now proceed with ext4_inode_block_valid() instead of returning early. Since systems zones have not been set yet, it is okay to proceed so we can perform basic checks on the blocks. Suggested-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/0c06bc9ebfcd6ccfed84a36e79147bf45ff5adc1.1743142920.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20ext4: fix off-by-one error in do_splitArtem Sadovnikov1-1/+1
commit 94824ac9a8aaf2fb3c54b4bdde842db80ffa555d upstream. Syzkaller detected a use-after-free issue in ext4_insert_dentry that was caused by out-of-bounds access due to incorrect splitting in do_split. BUG: KASAN: use-after-free in ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109 Write of size 251 at addr ffff888074572f14 by task syz-executor335/5847 CPU: 0 UID: 0 PID: 5847 Comm: syz-executor335 Not tainted 6.12.0-rc6-syzkaller-00318-ga9cda7c0ffed #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 __asan_memcpy+0x40/0x70 mm/kasan/shadow.c:106 ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109 add_dirent_to_buf+0x3d9/0x750 fs/ext4/namei.c:2154 make_indexed_dir+0xf98/0x1600 fs/ext4/namei.c:2351 ext4_add_entry+0x222a/0x25d0 fs/ext4/namei.c:2455 ext4_add_nondir+0x8d/0x290 fs/ext4/namei.c:2796 ext4_symlink+0x920/0xb50 fs/ext4/namei.c:3431 vfs_symlink+0x137/0x2e0 fs/namei.c:4615 do_symlinkat+0x222/0x3a0 fs/namei.c:4641 __do_sys_symlink fs/namei.c:4662 [inline] __se_sys_symlink fs/namei.c:4660 [inline] __x64_sys_symlink+0x7a/0x90 fs/namei.c:4660 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> The following loop is located right above 'if' statement. for (i = count-1; i >= 0; i--) { /* is more than half of this entry in 2nd half of the block? */ if (size + map[i].size/2 > blocksize/2) break; size += map[i].size; move++; } 'i' in this case could go down to -1, in which case sum of active entries wouldn't exceed half the block size, but previous behaviour would also do split in half if sum would exceed at the very last block, which in case of having too many long name files in a single block could lead to out-of-bounds access and following use-after-free. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Cc: stable@vger.kernel.org Fixes: 5872331b3d91 ("ext4: fix potential negative array index in do_split()") Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250404082804.2567-3-a.sadovnikov@ispras.ru Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20ext4: don't treat fhandle lookup of ea_inode as FS corruptionJann Horn1-20/+48
[ Upstream commit 642335f3ea2b3fd6dba03e57e01fa9587843a497 ] A file handle that userspace provides to open_by_handle_at() can legitimately contain an outdated inode number that has since been reused for another purpose - that's why the file handle also contains a generation number. But if the inode number has been reused for an ea_inode, check_igot_inode() will notice, __ext4_iget() will go through ext4_error_inode(), and if the inode was newly created, it will also be marked as bad by iget_failed(). This all happens before the point where the inode generation is checked. ext4_error_inode() is supposed to only be used on filesystem corruption; it should not be used when userspace just got unlucky with a stale file handle. So when this happens, let __ext4_iget() just return an error. Fixes: b3e6bcb94590 ("ext4: add EA_INODE checking to ext4_iget()") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241129-ext4-ignore-ea-fhandle-v1-1-e532c0d1cee0@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20ext4: ignore xattrs past endBhupesh1-1/+10
[ Upstream commit c8e008b60492cf6fd31ef127aea6d02fd3d314cd ] Once inside 'ext4_xattr_inode_dec_ref_all' we should ignore xattrs entries past the 'end' entry. This fixes the following KASAN reported issue: ================================================================== BUG: KASAN: slab-use-after-free in ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 Read of size 4 at addr ffff888012c120c4 by task repro/2065 CPU: 1 UID: 0 PID: 2065 Comm: repro Not tainted 6.13.0-rc2+ #11 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x1fd/0x300 ? tcp_gro_dev_warn+0x260/0x260 ? _printk+0xc0/0x100 ? read_lock_is_recursive+0x10/0x10 ? irq_work_queue+0x72/0xf0 ? __virt_addr_valid+0x17b/0x4b0 print_address_description+0x78/0x390 print_report+0x107/0x1f0 ? __virt_addr_valid+0x17b/0x4b0 ? __virt_addr_valid+0x3ff/0x4b0 ? __phys_addr+0xb5/0x160 ? ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 kasan_report+0xcc/0x100 ? ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 ext4_xattr_inode_dec_ref_all+0xb8c/0xe90 ? ext4_xattr_delete_inode+0xd30/0xd30 ? __ext4_journal_ensure_credits+0x5f0/0x5f0 ? __ext4_journal_ensure_credits+0x2b/0x5f0 ? inode_update_timestamps+0x410/0x410 ext4_xattr_delete_inode+0xb64/0xd30 ? ext4_truncate+0xb70/0xdc0 ? ext4_expand_extra_isize_ea+0x1d20/0x1d20 ? __ext4_mark_inode_dirty+0x670/0x670 ? ext4_journal_check_start+0x16f/0x240 ? ext4_inode_is_fast_symlink+0x2f2/0x3a0 ext4_evict_inode+0xc8c/0xff0 ? ext4_inode_is_fast_symlink+0x3a0/0x3a0 ? do_raw_spin_unlock+0x53/0x8a0 ? ext4_inode_is_fast_symlink+0x3a0/0x3a0 evict+0x4ac/0x950 ? proc_nr_inodes+0x310/0x310 ? trace_ext4_drop_inode+0xa2/0x220 ? _raw_spin_unlock+0x1a/0x30 ? iput+0x4cb/0x7e0 do_unlinkat+0x495/0x7c0 ? try_break_deleg+0x120/0x120 ? 0xffffffff81000000 ? __check_object_size+0x15a/0x210 ? strncpy_from_user+0x13e/0x250 ? getname_flags+0x1dc/0x530 __x64_sys_unlinkat+0xc8/0xf0 do_syscall_64+0x65/0x110 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x434ffd Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 RSP: 002b:00007ffc50fa7b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000107 RAX: ffffffffffffffda RBX: 00007ffc50fa7e18 RCX: 0000000000434ffd RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00007ffc50fa7be0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007ffc50fa7e08 R14: 00000000004bbf30 R15: 0000000000000001 </TASK> The buggy address belongs to the object at ffff888012c12000 which belongs to the cache filp of size 360 The buggy address is located 196 bytes inside of freed 360-byte region [ffff888012c12000, ffff888012c12168) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12c12 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x40(head|node=0|zone=0) page_type: f5(slab) raw: 0000000000000040 ffff888000ad7640 ffffea0000497a00 dead000000000004 raw: 0000000000000000 0000000000100010 00000001f5000000 0000000000000000 head: 0000000000000040 ffff888000ad7640 ffffea0000497a00 dead000000000004 head: 0000000000000000 0000000000100010 00000001f5000000 0000000000000000 head: 0000000000000001 ffffea00004b0481 ffffffffffffffff 0000000000000000 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888012c11f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888012c12000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff888012c12080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888012c12100: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc ffff888012c12180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Reported-by: syzbot+b244bda78289b00204ed@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b244bda78289b00204ed Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Bhupesh <bhupesh@igalia.com> Link: https://patch.msgid.link/20250128082751.124948-2-bhupesh@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20ext4: protect ext4_release_dquot against freezingOjaswin Mujoo1-0/+17
[ Upstream commit 530fea29ef82e169cd7fe048c2b7baaeb85a0028 ] Protect ext4_release_dquot against freezing so that we don't try to start a transaction when FS is frozen, leading to warnings. Further, avoid taking the freeze protection if a transaction is already running so that we don't need end up in a deadlock as described in 46e294efc355 ext4: fix deadlock with fs freezing and EA inodes Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241121123855.645335-3-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: fix OOB read when checking dotdot dirAcs, Jakub1-0/+3
commit d5e206778e96e8667d3bde695ad372c296dc9353 upstream. Mounting a corrupted filesystem with directory which contains '.' dir entry with rec_len == block size results in out-of-bounds read (later on, when the corrupted directory is removed). ext4_empty_dir() assumes every ext4 directory contains at least '.' and '..' as directory entries in the first data block. It first loads the '.' dir entry, performs sanity checks by calling ext4_check_dir_entry() and then uses its rec_len member to compute the location of '..' dir entry (in ext4_next_entry). It assumes the '..' dir entry fits into the same data block. If the rec_len of '.' is precisely one block (4KB), it slips through the sanity checks (it is considered the last directory entry in the data block) and leaves "struct ext4_dir_entry_2 *de" point exactly past the memory slot allocated to the data block. The following call to ext4_check_dir_entry() on new value of de then dereferences this pointer which results in out-of-bounds mem access. Fix this by extending __ext4_check_dir_entry() to check for '.' dir entries that reach the end of data block. Make sure to ignore the phony dir entries for checksum (by checking name_len for non-zero). Note: This is reported by KASAN as use-after-free in case another structure was recently freed from the slot past the bound, but it is really an OOB read. This issue was found by syzkaller tool. Call Trace: [ 38.594108] BUG: KASAN: slab-use-after-free in __ext4_check_dir_entry+0x67e/0x710 [ 38.594649] Read of size 2 at addr ffff88802b41a004 by task syz-executor/5375 [ 38.595158] [ 38.595288] CPU: 0 UID: 0 PID: 5375 Comm: syz-executor Not tainted 6.14.0-rc7 #1 [ 38.595298] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 38.595304] Call Trace: [ 38.595308] <TASK> [ 38.595311] dump_stack_lvl+0xa7/0xd0 [ 38.595325] print_address_description.constprop.0+0x2c/0x3f0 [ 38.595339] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595349] print_report+0xaa/0x250 [ 38.595359] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595368] ? kasan_addr_to_slab+0x9/0x90 [ 38.595378] kasan_report+0xab/0xe0 [ 38.595389] ? __ext4_check_dir_entry+0x67e/0x710 [ 38.595400] __ext4_check_dir_entry+0x67e/0x710 [ 38.595410] ext4_empty_dir+0x465/0x990 [ 38.595421] ? __pfx_ext4_empty_dir+0x10/0x10 [ 38.595432] ext4_rmdir.part.0+0x29a/0xd10 [ 38.595441] ? __dquot_initialize+0x2a7/0xbf0 [ 38.595455] ? __pfx_ext4_rmdir.part.0+0x10/0x10 [ 38.595464] ? __pfx___dquot_initialize+0x10/0x10 [ 38.595478] ? down_write+0xdb/0x140 [ 38.595487] ? __pfx_down_write+0x10/0x10 [ 38.595497] ext4_rmdir+0xee/0x140 [ 38.595506] vfs_rmdir+0x209/0x670 [ 38.595517] ? lookup_one_qstr_excl+0x3b/0x190 [ 38.595529] do_rmdir+0x363/0x3c0 [ 38.595537] ? __pfx_do_rmdir+0x10/0x10 [ 38.595544] ? strncpy_from_user+0x1ff/0x2e0 [ 38.595561] __x64_sys_unlinkat+0xf0/0x130 [ 38.595570] do_syscall_64+0x5b/0x180 [ 38.595583] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: ac27a0ec112a0 ("[PATCH] ext4: initial copy of files from ext3") Signed-off-by: Jakub Acs <acsjakub@amazon.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Mahmoud Adam <mngyadam@amazon.com> Cc: stable@vger.kernel.org Cc: security@kernel.org Link: https://patch.msgid.link/b3ae36a6794c4a01944c7d70b403db5b@amazon.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10ext4: don't over-report free space or inodes in statvfsTheodore Ts'o1-10/+17
commit f87d3af7419307ae26e705a2b2db36140db367a2 upstream. This fixes an analogus bug that was fixed in xfs in commit 4b8d867ca6e2 ("xfs: don't over-report free space or inodes in statvfs") where statfs can report misleading / incorrect information where project quota is enabled, and the free space is less than the remaining quota. This commit will resolve a test failure in generic/762 which tests for this bug. Cc: stable@kernel.org Fixes: 689c958cbe6b ("ext4: add project quota support") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10ext4: avoid journaling sb update on error if journal is destroyingOjaswin Mujoo3-9/+25
[ Upstream commit ce2f26e73783b4a7c46a86e3af5b5c8de0971790 ] Presently we always BUG_ON if trying to start a transaction on a journal marked with JBD2_UNMOUNT, since this should never happen. However, while ltp running stress tests, it was observed that in case of some error handling paths, it is possible for update_super_work to start a transaction after the journal is destroyed eg: (umount) ext4_kill_sb kill_block_super generic_shutdown_super sync_filesystem /* commits all txns */ evict_inodes /* might start a new txn */ ext4_put_super flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */ jbd2_journal_destroy journal_kill_thread journal->j_flags |= JBD2_UNMOUNT; jbd2_journal_commit_transaction jbd2_journal_get_descriptor_buffer jbd2_journal_bmap ext4_journal_bmap ext4_map_blocks ... ext4_inode_error ext4_handle_error schedule_work(&sbi->s_sb_upd_work) /* work queue kicks in */ update_super_work jbd2_journal_start start_this_handle BUG_ON(journal->j_flags & JBD2_UNMOUNT) Hence, introduce a new mount flag to indicate journal is destroying and only do a journaled (and deferred) update of sb if this flag is not set. Otherwise, just fallback to an un-journaled commit. Further, in the journal destroy path, we have the following sequence: 1. Set mount flag indicating journal is destroying 2. force a commit and wait for it 3. flush pending sb updates This sequence is important as it ensures that, after this point, there is no sb update that might be journaled so it is safe to update the sb outside the journal. (To avoid race discussed in 2d01ddc86606) Also, we don't need a similar check in ext4_grp_locked_error since it is only called from mballoc and AFAICT it would be always valid to schedule work here. Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/9613c465d6ff00cd315602f99283d5f24018c3f7.1742279837.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: define ext4_journal_destroy wrapperOjaswin Mujoo2-10/+20
[ Upstream commit 5a02a6204ca37e7c22fbb55a789c503f05e8e89a ] Define an ext4 wrapper over jbd2_journal_destroy to make sure we have consistent behavior during journal destruction. This will also come useful in the next patch where we add some ext4 specific logic in the destroy path. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/c3ba78c5c419757e6d5f2d8ebb4a8ce9d21da86a.1742279837.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: ce2f26e73783 ("ext4: avoid journaling sb update on error if journal is destroying") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: goto right label 'out_mmap_sem' in ext4_setattr()Baokun Li1-1/+1
[ Upstream commit 7e91ae31e2d264155dfd102101afc2de7bd74a64 ] Otherwise, if ext4_inode_attach_jinode() fails, a hung task will happen because filemap_invalidate_unlock() isn't called to unlock mapping->invalidate_lock. Like this: EXT4-fs error (device sda) in ext4_setattr:5557: Out of memory INFO: task fsstress:374 blocked for more than 122 seconds. Not tainted 6.14.0-rc1-next-20250206-xfstests-dirty #726 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:fsstress state:D stack:0 pid:374 tgid:374 ppid:373 task_flags:0x440140 flags:0x00000000 Call Trace: <TASK> __schedule+0x2c9/0x7f0 schedule+0x27/0xa0 schedule_preempt_disabled+0x15/0x30 rwsem_down_read_slowpath+0x278/0x4c0 down_read+0x59/0xb0 page_cache_ra_unbounded+0x65/0x1b0 filemap_get_pages+0x124/0x3e0 filemap_read+0x114/0x3d0 vfs_read+0x297/0x360 ksys_read+0x6c/0xe0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: c7fc0366c656 ("ext4: partial zero eof block on unaligned inode size extension") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20250213112247.3168709-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()Ye Bin3-25/+13
[ Upstream commit 5701875f9609b000d91351eaa6bfd97fe2f157f4 ] There's issue as follows: BUG: KASAN: use-after-free in ext4_xattr_inode_dec_ref_all+0x6ff/0x790 Read of size 4 at addr ffff88807b003000 by task syz-executor.0/15172 CPU: 3 PID: 15172 Comm: syz-executor.0 Call Trace: __dump_stack lib/dump_stack.c:82 [inline] dump_stack+0xbe/0xfd lib/dump_stack.c:123 print_address_description.constprop.0+0x1e/0x280 mm/kasan/report.c:400 __kasan_report.cold+0x6c/0x84 mm/kasan/report.c:560 kasan_report+0x3a/0x50 mm/kasan/report.c:585 ext4_xattr_inode_dec_ref_all+0x6ff/0x790 fs/ext4/xattr.c:1137 ext4_xattr_delete_inode+0x4c7/0xda0 fs/ext4/xattr.c:2896 ext4_evict_inode+0xb3b/0x1670 fs/ext4/inode.c:323 evict+0x39f/0x880 fs/inode.c:622 iput_final fs/inode.c:1746 [inline] iput fs/inode.c:1772 [inline] iput+0x525/0x6c0 fs/inode.c:1758 ext4_orphan_cleanup fs/ext4/super.c:3298 [inline] ext4_fill_super+0x8c57/0xba40 fs/ext4/super.c:5300 mount_bdev+0x355/0x410 fs/super.c:1446 legacy_get_tree+0xfe/0x220 fs/fs_context.c:611 vfs_get_tree+0x8d/0x2f0 fs/super.c:1576 do_new_mount fs/namespace.c:2983 [inline] path_mount+0x119a/0x1ad0 fs/namespace.c:3316 do_mount+0xfc/0x110 fs/namespace.c:3329 __do_sys_mount fs/namespace.c:3540 [inline] __se_sys_mount+0x219/0x2e0 fs/namespace.c:3514 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x67/0xd1 Memory state around the buggy address: ffff88807b002f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88807b002f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88807b003000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88807b003080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88807b003100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Above issue happens as ext4_xattr_delete_inode() isn't check xattr is valid if xattr is in inode. To solve above issue call xattr_check_inode() check if xattr if valid in inode. In fact, we can directly verify in ext4_iget_extra_inode(), so that there is no divergent verification. Fixes: e50e5129f384 ("ext4: xattr-in-inode support") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250208063141.1539283-3-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: introduce ITAIL helperYe Bin2-5/+8
[ Upstream commit 69f3a3039b0d0003de008659cafd5a1eaaa0a7a4 ] Introduce ITAIL helper to get the bound of xattr in inode. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250208063141.1539283-2-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 5701875f9609 ("ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: verify fast symlink lengthJan Kara1-2/+10
[ Upstream commit 5f920d5d60839f7cbbb1ed50eac68d8bc0a73a7c ] Verify fast symlink length stored in inode->i_size matches the string stored in the inode to avoid surprises from corrupted filesystems. Reported-by: syzbot+48a99e426f29859818c0@syzkaller.appspotmail.com Tested-by: syzbot+48a99e426f29859818c0@syzkaller.appspotmail.com Fixes: bae80473f7b0 ("ext4: use inode_set_cached_link()") Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20250206094454.20522-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: add missing brelse() for bh2 in ext4_dx_add_entry()Kemeng Shi1-4/+12
[ Upstream commit eb640af64db6d4702a85ab001b9cc7f4c5dd6abb ] Add missing brelse() for bh2 in ext4_dx_add_entry(). Fixes: ac27a0ec112a ("[PATCH] ext4: initial copy of files from ext3") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250123162050.2114499-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: show 'emergency_ro' when EXT4_FLAGS_EMERGENCY_RO is setBaokun Li1-0/+3
[ Upstream commit 6b76715d5e41fc332b0b879e66fad6ef3db07a3f ] After commit d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors") in v6.12-rc1, the 'errors=remount-ro' mode no longer sets SB_RDONLY on errors, which results in us seeing the filesystem is still in rw state after errors. Therefore, after setting EXT4_FLAGS_EMERGENCY_RO, display the emergency_ro option so that users can query whether the current file system has become emergency read-only due to errors through commands such as 'mount' or 'cat /proc/fs/ext4/sdx/options'. Fixes: d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-7-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: correct behavior under errors=remount-ro modeBaokun Li1-11/+8
[ Upstream commit 8f984530c242c569bafecfa35bce969a9b8fb0dd ] And after commit 95257987a638 ("ext4: drop EXT4_MF_FS_ABORTED flag") in v6.6-rc1, the EXT4_FLAGS_SHUTDOWN bit is set in ext4_handle_error() under errors=remount-ro mode. This causes the read to fail even when the error is triggered in errors=remount-ro mode. To correct the behavior under errors=remount-ro, EXT4_FLAGS_SHUTDOWN is replaced by the newly introduced EXT4_FLAGS_EMERGENCY_RO. This new flag only prevents writes, matching the previous behavior with SB_RDONLY. Fixes: 95257987a638 ("ext4: drop EXT4_MF_FS_ABORTED flag") Closes: https://lore.kernel.org/all/22d652f6-cb3c-43f5-b2fe-0a4bb6516a04@huawei.com/ Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250122114130.229709-6-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: add EXT4_FLAGS_EMERGENCY_RO bitBaokun Li1-0/+6
[ Upstream commit f3054e53c2f367bd3cf6535afe9ab13198d2d739 ] EXT4_FLAGS_EMERGENCY_RO Indicates that the current file system has become read-only due to some error. Compared to SB_RDONLY, setting it does not require a lock because we won't clear it, which avoids over-coupling with vfs freeze. Also, add a helper function ext4_emergency_ro() to check if the bit is set. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8f984530c242 ("ext4: correct behavior under errors=remount-ro mode") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: convert EXT4_FLAGS_* defines to enumBaokun Li1-3/+5
[ Upstream commit 99708f8a9d30081a71638d6f4e216350a4340cc3 ] Do away with the defines and use an enum as it's cleaner. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8f984530c242 ("ext4: correct behavior under errors=remount-ro mode") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ext4: fix potential null dereference in ext4 kunit testCharles Han1-0/+2
[ Upstream commit 57e7239ce0ed14e81e414c99d57f516f6220a995 ] kunit_kzalloc() may return a NULL pointer, dereferencing it without NULL check may lead to NULL dereference. Add a NULL check for grp. Fixes: ac96b56a2fbd ("ext4: Add unit test for mb_mark_used") Fixes: b7098e1fa7bc ("ext4: Add unit test for mb_free_blocks") Signed-off-by: Charles Han <hanchunchao@inspur.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20250110092421.35619-1-hanchunchao@inspur.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13Revert "ext4: add pre-content fsnotify hook for DAX faults"Amir Goldstein1-3/+0
This reverts commit bb480760ffc7018e21ee6f60241c2b99ff26ee0e. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-3-amir73il@gmail.com
2025-01-30Merge tag 'pull-revalidate' of ↵Linus Torvalds2-26/+6
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs d_revalidate updates from Al Viro: "Provide stable parent and name to ->d_revalidate() instances Most of the filesystem methods where we care about dentry name and parent have their stability guaranteed by the callers; ->d_revalidate() is the major exception. It's easy enough for callers to supply stable values for expected name and expected parent of the dentry being validated. That kills quite a bit of boilerplate in ->d_revalidate() instances, along with a bunch of races where they used to access ->d_name without sufficient precautions" * tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: 9p: fix ->rename_sem exclusion orangefs_d_revalidate(): use stable parent inode and name passed by caller ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller nfs: fix ->d_revalidate() UAF on ->d_name accesses nfs{,4}_lookup_validate(): use stable parent inode passed by caller gfs2_drevalidate(): use stable parent inode and name passed by caller fuse_dentry_revalidate(): use stable parent inode and name passed by caller vfat_revalidate{,_ci}(): use stable parent inode passed by caller exfat_d_revalidate(): use stable parent inode passed by caller fscrypt_d_revalidate(): use stable parent inode passed by caller ceph_d_revalidate(): propagate stable name down into request encoding ceph_d_revalidate(): use stable parent inode passed by caller afs_d_revalidate(): use stable name and parent inode passed by caller Pass parent directory inode and expected name to ->d_revalidate() generic_ci_d_compare(): use shortname_storage ext4 fast_commit: make use of name_snapshot primitives dissolve external_name.u into separate members make take_dentry_name_snapshot() lockless dcache: back inline names with a struct-wrapped array of unsigned long make sure that DNAME_INLINE_LEN is a multiple of word size
2025-01-28ext4 fast_commit: make use of name_snapshot primitivesAl Viro2-26/+6
... rather than open-coding them. As a bonus, that avoids the pointless work with extra allocations, etc. for long names. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-24Merge tag 'fsnotify_hsm_for_v6.14-rc1' of ↵Linus Torvalds2-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify pre-content notification support from Jan Kara: "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets generated before a file contents is accessed. The event is synchronous so if there is listener for this event, the kernel waits for reply. On success the execution continues as usual, on failure we propagate the error to userspace. This allows userspace to fill in file content on demand from slow storage. The context in which the events are generated has been picked so that we don't hold any locks and thus there's no risk of a deadlock for the userspace handler. The new pre-content event is available only for users with global CAP_SYS_ADMIN capability (similarly to other parts of fanotify functionality) and it is an administrator responsibility to make sure the userspace event handler doesn't do stupid stuff that can DoS the system. Based on your feedback from the last submission, fsnotify code has been improved and now file->f_mode encodes whether pre-content event needs to be generated for the file so the fast path when nobody wants pre-content event for the file just grows the additional file->f_mode check. As a bonus this also removes the checks whether the old FS_ACCESS event needs to be generated from the fast path. Also the place where the event is generated during page fault has been moved so now filemap_fault() generates the event if and only if there is no uptodate folio in the page cache. Also we have dropped FS_PRE_MODIFY event as current real-world users of the pre-content functionality don't really use it so let's start with the minimal useful feature set" * tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) fanotify: Fix crash in fanotify_init(2) fs: don't block write during exec on pre-content watched files fs: enable pre-content events on supported file systems ext4: add pre-content fsnotify hook for DAX faults btrfs: disable defrag on pre-content watched files xfs: add pre-content fsnotify hook for DAX faults fsnotify: generate pre-content permission event on page fault mm: don't allow huge faults for files with pre content watches fanotify: disable readahead if we have pre-content watches fanotify: allow to set errno in FAN_DENY permission response fanotify: report file range info with pre-content events fanotify: introduce FAN_PRE_ACCESS permission event fsnotify: generate pre-content permission event on truncate fsnotify: pass optional file access range in pre-content event fsnotify: introduce pre-content permission events fanotify: reserve event bit of deprecated FAN_DIR_MODIFY fanotify: rename a misnamed constant fanotify: don't skip extra event info if no info_mode is set fsnotify: check if file is actually being watched for pre-content events on open fsnotify: opt-in for permission events at file open time ...
2025-01-23Merge tag 'crc-for-linus' of ↵Linus Torvalds3-39/+4
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits) MAINTAINERS: add entry for CRC library powerpc/crc: delete obsolete crc-vpmsum_test.c lib/crc32test: delete obsolete crc32test.c lib/crc16_kunit: delete obsolete crc16_kunit.c lib/crc_kunit.c: add KUnit test suite for CRC library functions powerpc/crc-t10dif: expose CRC-T10DIF function through lib arm64/crc-t10dif: expose CRC-T10DIF function through lib arm/crc-t10dif: expose CRC-T10DIF function through lib x86/crc-t10dif: expose CRC-T10DIF function through lib crypto: crct10dif - expose arch-optimized lib function lib/crc-t10dif: add support for arch overrides lib/crc-t10dif: stop wrapping the crypto API scsi: target: iscsi: switch to using the crc32c library f2fs: switch to using the crc32 library jbd2: switch to using the crc32c library ext4: switch to using the crc32c library lib/crc32: make crc32c() go directly to lib bcachefs: Explicitly select CRYPTO from BCACHEFS_FS x86/crc32: expose CRC32 functions through lib x86/crc32: update prototype for crc32_pclmul_le_16() ...
2024-12-22ext4: use inode_set_cached_link()Mateusz Guzik2-2/+5
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20241120112037.822078-3-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-11fs: enable pre-content events on supported file systemsJosef Bacik1-0/+3
Now that all the code has been added for pre-content events, and the various file systems that need the page fault hooks for fsnotify have been updated, add SB_I_ALLOW_HSM to the supported file systems. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/46960dcb2725fa0317895ed66a8409ba1c306a82.1731684329.git.josef@toxicpanda.com
2024-12-11ext4: add pre-content fsnotify hook for DAX faultsJan Kara1-0/+3
ext4 has its own handling for DAX faults. Add the pre-content fsnotify hook for this case. Signed-off-by: Jan Kara <jack@suse.cz>
2024-12-02ext4: switch to using the crc32c libraryEric Biggers3-39/+4
Now that the crc32c() library function directly takes advantage of architecture-specific optimizations, it is unnecessary to go through the crypto API. Just use crc32c(). This is much simpler, and it improves performance due to eliminating the crypto API overhead. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20241202010844.144356-17-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-11-19Merge tag 'ext4_for_linus-6.13-rc1' of ↵Linus Torvalds20-102/+227
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most notably in the journaling code, bufered I/O, and compiler warning cleanups" * tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits) jbd2: Fix comment describing journal_init_common() ext4: prevent an infinite loop in the lazyinit thread ext4: use struct_size() to improve ext4_htree_store_dirent() ext4: annotate struct fname with __counted_by() jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings ext4: use str_yes_no() helper function ext4: prevent delalloc to nodelalloc on remount jbd2: make b_frozen_data allocation always succeed ext4: cleanup variable name in ext4_fc_del() ext4: use string choices helpers jbd2: remove the 'success' parameter from the jbd2_do_replay() function jbd2: remove useless 'block_error' variable jbd2: factor out jbd2_do_replay() jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass() jbd2: unified release of buffer_head in do_one_pass() jbd2: remove redundant judgments for check v1 checksum ext4: use ERR_CAST to return an error-valued pointer mm: zero range of eof folio exposed by inode size extension ext4: partial zero eof block on unaligned inode size extension ext4: disambiguate the return value of ext4_dio_write_end_io() ...
2024-11-18Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-14/+7
Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
2024-11-18Merge tag 'vfs-6.13.untorn.writes' of ↵Linus Torvalds4-5/+99
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs untorn write support from Christian Brauner: "An atomic write is a write issed with torn-write protection. This means for a power failure or any hardware failure all or none of the data from the write will be stored, never a mix of old and new data. This work is already supported for block devices. If a block device is opened with O_DIRECT and the block device supports atomic write, then FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block device. This contains the work to expand atomic write support to filesystems, specifically ext4 and XFS. Currently, only support for writing exactly one filesystem block atomically is added. Since it's now possible to have filesystem block size > page size for XFS, it's possible to write 4K+ blocks atomically on x86" * tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: drop an obsolete comment in iomap_dio_bio_iter ext4: Do not fallback to buffered-io for DIO atomic write ext4: Support setting FMODE_CAN_ATOMIC_WRITE ext4: Check for atomic writes support in write iter ext4: Add statx support for atomic writes xfs: Support setting FMODE_CAN_ATOMIC_WRITE xfs: Validate atomic writes xfs: Support atomic write for statx fs: iomap: Atomic write support fs: Export generic_atomic_write_valid() block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid()
2024-11-18Merge tag 'vfs-6.13.tmpfs' of ↵Linus Torvalds1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull tmpfs case folding updates from Christian Brauner: "This adds case-insensitive support for tmpfs. The work contained in here adds support for case-insensitive file names lookups in tmpfs. The main difference from other casefold filesystems is that tmpfs has no information on disk, just on RAM, so we can't use mkfs to create a case-insensitive tmpfs. For this implementation, there's a mount option for casefolding. The rest of the patchset follows a similar approach as ext4 and f2fs. The use case for this feature is similar to the use case for ext4, to better support compatibility layers (like Wine), particularly in combination with sandboxing/container tools (like Flatpak). Those containerization tools can share a subset of the host filesystem with an application. In the container, the root directory and any parent directories required for a shared directory are on tmpfs, with the shared directories bind-mounted into the container's view of the filesystem. If the host filesystem is using case-insensitive directories, then the application can do lookups inside those directories in a case-insensitive way, without this needing to be implemented in user-space. However, if the host is only sharing a subset of a case-insensitive directory with the application, then the parent directories of the mount point will be part of the container's root tmpfs. When the application tries to do case-insensitive lookups of those parent directories on a case-sensitive tmpfs, the lookup will fail" * tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs: Initialize sysfs during tmpfs init tmpfs: Fix type for sysfs' casefold attribute libfs: Fix kernel-doc warning in generic_ci_validate_strict_name docs: tmpfs: Add casefold options tmpfs: Expose filesystem features via sysfs tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirs tmpfs: Add casefold lookup support libfs: Export generic_ci_ dentry functions unicode: Recreate utf8_parse_version() unicode: Export latest available UTF-8 version number ext4: Use generic_ci_validate_strict_name helper libfs: Create the helper function generic_ci_validate_strict_name()
2024-11-18Merge tag 'vfs-6.13.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Fixup and improve NLM and kNFSD file lock callbacks Last year both GFS2 and OCFS2 had some work done to make their locking more robust when exported over NFS. Unfortunately, part of that work caused both NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send lock notifications to clients This in itself is not a huge problem because most NFS clients will still poll the server in order to acquire a conflicted lock It's important for NLM and kNFSD that they do not block their kernel threads inside filesystem's file_lock implementations because that can produce deadlocks. We used to make sure of this by only trusting that posix_lock_file() can correctly handle blocking lock calls asynchronously, so the lock managers would only setup their file_lock requests for async callbacks if the filesystem did not define its own lock() file operation However, when GFS2 and OCFS2 grew the capability to correctly handle blocking lock requests asynchronously, they started signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting posix_lock_file() was inadvertently dropped, so now most filesystems no longer produce lock notifications when exported over NFS Fix this by using an fop_flag which greatly simplifies the problem and grooms the way for future uses by both filesystems and lock managers alike - Add a sysctl to delete the dentry when a file is removed instead of making it a negative dentry Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s, prompting a revert in commit 4a4be1ad3a6e ("Revert "vfs: Delete the associated dentry when deleting a file""). This reintroduces the concept conditionally through a sysctl - Expand the statmount() system call: * Report the filesystem subtype in a new fs_subtype field to e.g., report fuse filesystem subtypes * Report the superblock source in a new sb_source field * Add a new way to return filesystem specific mount options in an option array that returns filesystem specific mount options separated by zero bytes and unescaped. This allows caller's to retrieve filesystem specific mount options and immediately pass them to e.g., fsconfig() without having to unescape or split them * Report security (LSM) specific mount options in a separate security option array. We don't lump them together with filesystem specific mount options as security mount options are generic and most users aren't interested in them The format is the same as for the filesystem specific mount option array - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command - Optimize acl_permission_check() to avoid costly {g,u}id ownership checks if possible - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict() - Add synchronous wakeup support for ep_poll_callback. Currently, epoll only uses wake_up() to wake up task. But sometimes there are epoll users which want to use the synchronous wakeup flag to give a hint to the scheduler, e.g., the Android binder driver. So add a wake_up_sync() define, and use wake_up_sync() when sync is true in ep_poll_callback() Fixes: - Fix kernel documentation for inode_insert5() and iget5_locked() - Annotate racy epoll check on file->f_ep - Make F_DUPFD_QUERY associative - Avoid filename buffer overrun in initramfs - Don't let statmount() return empty strings - Add a cond_resched() to dump_user_range() to avoid hogging the CPU - Don't query the device logical blocksize multiple times for hfsplus - Make filemap_read() check that the offset is positive or zero Cleanups: - Various typo fixes - Cleanup wbc_attach_fdatawrite_inode() - Add __releases annotation to wbc_attach_and_unlock_inode() - Add hugetlbfs tracepoints - Fix various vfs kernel doc parameters - Remove obsolete TODO comment from io_cancel() - Convert wbc_account_cgroup_owner() to take a folio - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add() - Reorder struct posix_acl to save 8 bytes - Annotate struct posix_acl with __counted_by() - Replace one-element array with flexible array member in freevxfs - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()" * tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) statmount: retrieve security mount options vfs: make evict() use smp_mb__after_spinlock instead of smp_mb statmount: add flag to retrieve unescaped options fs: add the ability for statmount() to report the sb_source writeback: wbc_attach_fdatawrite_inode out of line writeback: add a __releases annoation to wbc_attach_and_unlock_inode fs: add the ability for statmount() to report the fs_subtype fs: don't let statmount return empty strings fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel() hfsplus: don't query the device logical block size multiple times freevxfs: Replace one-element array with flexible array member fs: optimize acl_permission_check() initramfs: avoid filename buffer overrun fs/writeback: convert wbc_account_cgroup_owner to take a folio acl: Annotate struct posix_acl with __counted_by() acl: Realign struct posix_acl to save 8 bytes epoll: Add synchronous wakeup support for ep_poll_callback coredump: add cond_resched() to dump_user_range mm/page-writeback.c: Fix comment of wb_domain_writeout_add() mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL ...
2024-11-13ext4: prevent an infinite loop in the lazyinit threadMathieu Othacehe1-7/+13
Use ktime_get_ns instead of ktime_get_real_ns when computing the lr_timeout not to be affected by system time jumps. Use a boolean instead of the MAX_JIFFY_OFFSET value to determine whether the next_wakeup value has been set. Comparing elr->lr_next_sched to MAX_JIFFY_OFFSET can cause the lazyinit thread to loop indefinitely. Co-developed-by: Lukas Skupinski <lukas.skupinski@landisgyr.com> Signed-off-by: Lukas Skupinski <lukas.skupinski@landisgyr.com> Signed-off-by: Mathieu Othacehe <othacehe@gnu.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241106134741.26948-2-othacehe@gnu.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use struct_size() to improve ext4_htree_store_dirent()Thorsten Blum1-3/+2
Inline and use struct_size() to calculate the number of bytes to allocate for new_fn and remove the local variable len. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241105103353.11590-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: annotate struct fname with __counted_by()Thorsten Blum1-1/+1
Add the __counted_by compiler attribute to the flexible array member name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241105101813.10864-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use str_yes_no() helper functionThorsten Blum1-1/+1
Remove hard-coded strings by using the str_yes_no() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20241021100056.5521-2-thorsten.blum@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: prevent delalloc to nodelalloc on remountNicolas Bretz1-0/+7
Implemented the suggested solution mentioned in the bug https://bugzilla.kernel.org/show_bug.cgi?id=218820 Preventing the disabling of delayed allocation mode on remount. delalloc to nodelalloc not permitted anymore nodelalloc to delalloc permitted, not affected Signed-off-by: Nicolas Bretz <bretznic@gmail.com> Link: https://patch.msgid.link/20241014034143.59779-1-bretznic@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: cleanup variable name in ext4_fc_del()Dan Carpenter1-2/+2
The variables "&EXT4_SB(inode->i_sb)->s_fc_lock" and "&sbi->s_fc_lock" are the same lock. This function uses a mix of both, which is a bit unsightly and confuses Smatch. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/96008557-8ff4-44cc-b5e3-ce242212f1a3@stanley.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use string choices helpersR Sundar1-1/+1
Use string choice helpers for better readability and to fix cocci warning Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/r/202410062256.BoynX3c2-lkp@intel.com/ Signed-off-by: R Sundar <prosunofficial@gmail.com> Link: https://patch.msgid.link/20241007172006.83339-1-prosunofficial@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: use ERR_CAST to return an error-valued pointerYu Jiaoliang1-2/+2
Instead of directly casting and returning an error-valued pointer, use ERR_CAST to make the error handling more explicit and improve code clarity. Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com> Link: https://patch.msgid.link/20240920021440.1959243-1-yujiaoliang@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: partial zero eof block on unaligned inode size extensionBrian Foster2-16/+42
Using mapped writes, it's technically possible to expose stale post-eof data on a truncate up operation. Consider the following example: $ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \ -c "truncate 8k" -c "pread -v 2k 16" <file> ... 00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX ... This shows that the post-eof data written via mwrite lands within EOF after a truncate up. While this is deliberate of the test case, behavior is somewhat unpredictable because writeback does post-eof zeroing, and writeback can occur at any time in the background. For example, an fsync inserted between the mwrite and truncate causes the subsequent read to instead return zeroes. This basically means that there is a race window in this situation between any subsequent extending operation and writeback that dictates whether post-eof data is exposed to the file or zeroed. To prevent this problem, perform partial block zeroing as part of the various inode size extending operations that are susceptible to it. For truncate extension, zero around the original eof similar to how truncate down does partial zeroing of the new eof. For extension via writes and fallocate related operations, zero the newly exposed range of the file to cover any partial zeroing that must occur at the original and new eof blocks. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-13ext4: disambiguate the return value of ext4_dio_write_end_io()Jinliang Zheng1-2/+3
The commit 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") causes confusion about the meaning of the return value of ext4_dio_write_end_io(). Specifically, when the ext4_handle_inode_extension() operation succeeds, ext4_dio_write_end_io() directly returns count instead of 0. This does not cause a bug in the current kernel, but the semantics of the return value of the ext4_dio_write_end_io() function are wrong, which is likely to introduce bugs in the future code evolution. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240919082539.381626-1-alexjlzheng@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>