summaryrefslogtreecommitdiff
path: root/fs/ext4/inline.c
AgeCommit message (Collapse)AuthorFilesLines
2025-05-14ext4: inline: fix len overflow in ext4_prepare_inline_dataThadeu Lima de Souza Cascardo1-1/+1
When running the following code on an ext4 filesystem with inline_data feature enabled, it will lead to the bug below. fd = open("file1", O_RDWR | O_CREAT | O_TRUNC, 0666); ftruncate(fd, 30); pwrite(fd, "a", 1, (1UL << 40) + 5UL); That happens because write_begin will succeed as when ext4_generic_write_inline_data calls ext4_prepare_inline_data, pos + len will be truncated, leading to ext4_prepare_inline_data parameter to be 6 instead of 0x10000000006. Then, later when write_end is called, we hit: BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); at ext4_write_inline_data. Fix it by using a loff_t type for the len parameter in ext4_prepare_inline_data instead of an unsigned int. [ 44.545164] ------------[ cut here ]------------ [ 44.545530] kernel BUG at fs/ext4/inline.c:240! [ 44.545834] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 44.546172] CPU: 3 UID: 0 PID: 343 Comm: test Not tainted 6.15.0-rc2-00003-g9080916f4863 #45 PREEMPT(full) 112853fcebfdb93254270a7959841d2c6aa2c8bb [ 44.546523] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 44.546523] RIP: 0010:ext4_write_inline_data+0xfe/0x100 [ 44.546523] Code: 3c 0e 48 83 c7 48 48 89 de 5b 41 5c 41 5d 41 5e 41 5f 5d e9 e4 fa 43 01 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 0b <0f> 0b 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53 48 83 ec 20 49 [ 44.546523] RSP: 0018:ffffb342008b79a8 EFLAGS: 00010216 [ 44.546523] RAX: 0000000000000001 RBX: ffff9329c579c000 RCX: 0000010000000006 [ 44.546523] RDX: 000000000000003c RSI: ffffb342008b79f0 RDI: ffff9329c158e738 [ 44.546523] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [ 44.546523] R10: 00007ffffffff000 R11: ffffffff9bd0d910 R12: 0000006210000000 [ 44.546523] R13: fffffc7e4015e700 R14: 0000010000000005 R15: ffff9329c158e738 [ 44.546523] FS: 00007f4299934740(0000) GS:ffff932a60179000(0000) knlGS:0000000000000000 [ 44.546523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 44.546523] CR2: 00007f4299a1ec90 CR3: 0000000002886002 CR4: 0000000000770eb0 [ 44.546523] PKRU: 55555554 [ 44.546523] Call Trace: [ 44.546523] <TASK> [ 44.546523] ext4_write_inline_data_end+0x126/0x2d0 [ 44.546523] generic_perform_write+0x17e/0x270 [ 44.546523] ext4_buffered_write_iter+0xc8/0x170 [ 44.546523] vfs_write+0x2be/0x3e0 [ 44.546523] __x64_sys_pwrite64+0x6d/0xc0 [ 44.546523] do_syscall_64+0x6a/0xf0 [ 44.546523] ? __wake_up+0x89/0xb0 [ 44.546523] ? xas_find+0x72/0x1c0 [ 44.546523] ? next_uptodate_folio+0x317/0x330 [ 44.546523] ? set_pte_range+0x1a6/0x270 [ 44.546523] ? filemap_map_pages+0x6ee/0x840 [ 44.546523] ? ext4_setattr+0x2fa/0x750 [ 44.546523] ? do_pte_missing+0x128/0xf70 [ 44.546523] ? security_inode_post_setattr+0x3e/0xd0 [ 44.546523] ? ___pte_offset_map+0x19/0x100 [ 44.546523] ? handle_mm_fault+0x721/0xa10 [ 44.546523] ? do_user_addr_fault+0x197/0x730 [ 44.546523] ? do_syscall_64+0x76/0xf0 [ 44.546523] ? arch_exit_to_user_mode_prepare+0x1e/0x60 [ 44.546523] ? irqentry_exit_to_user_mode+0x79/0x90 [ 44.546523] entry_SYSCALL_64_after_hwframe+0x55/0x5d [ 44.546523] RIP: 0033:0x7f42999c6687 [ 44.546523] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff [ 44.546523] RSP: 002b:00007ffeae4a7930 EFLAGS: 00000202 ORIG_RAX: 0000000000000012 [ 44.546523] RAX: ffffffffffffffda RBX: 00007f4299934740 RCX: 00007f42999c6687 [ 44.546523] RDX: 0000000000000001 RSI: 000055ea6149200f RDI: 0000000000000003 [ 44.546523] RBP: 00007ffeae4a79a0 R08: 0000000000000000 R09: 0000000000000000 [ 44.546523] R10: 0000010000000005 R11: 0000000000000202 R12: 0000000000000000 [ 44.546523] R13: 00007ffeae4a7ac8 R14: 00007f4299b86000 R15: 000055ea61493dd8 [ 44.546523] </TASK> [ 44.546523] Modules linked in: [ 44.568501] ---[ end trace 0000000000000000 ]--- [ 44.568889] RIP: 0010:ext4_write_inline_data+0xfe/0x100 [ 44.569328] Code: 3c 0e 48 83 c7 48 48 89 de 5b 41 5c 41 5d 41 5e 41 5f 5d e9 e4 fa 43 01 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 0b <0f> 0b 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53 48 83 ec 20 49 [ 44.570931] RSP: 0018:ffffb342008b79a8 EFLAGS: 00010216 [ 44.571356] RAX: 0000000000000001 RBX: ffff9329c579c000 RCX: 0000010000000006 [ 44.571959] RDX: 000000000000003c RSI: ffffb342008b79f0 RDI: ffff9329c158e738 [ 44.572571] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [ 44.573148] R10: 00007ffffffff000 R11: ffffffff9bd0d910 R12: 0000006210000000 [ 44.573748] R13: fffffc7e4015e700 R14: 0000010000000005 R15: ffff9329c158e738 [ 44.574335] FS: 00007f4299934740(0000) GS:ffff932a60179000(0000) knlGS:0000000000000000 [ 44.575027] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 44.575520] CR2: 00007f4299a1ec90 CR3: 0000000002886002 CR4: 0000000000770eb0 [ 44.576112] PKRU: 55555554 [ 44.576338] Kernel panic - not syncing: Fatal exception [ 44.576517] Kernel Offset: 0x1a600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Reported-by: syzbot+fe2a25dae02a207717a0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fe2a25dae02a207717a0 Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Cc: stable@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20250415-ext4-prepare-inline-overflow-v1-1-f4c13d900967@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-09ext4: for committing inode, make ext4_fc_track_inode waitHarshad Shirwadkar1-0/+1
If the inode that's being requested to track using ext4_fc_track_inode is being committed, then wait until the inode finishes the commit. Also, add calls to ext4_fc_track_inode at the right places. With this patch, now calling ext4_reserve_inode_write() results in inode being tracked for next fast commit. This ensures that by the time ext4_reserve_inode_write() returns, it is ready to be modified and won't be committed until the corresponding handle is open. A subtle lock ordering requirement with i_data_sem (which is documented in the code) requires that ext4_fc_track_inode() be called before grabbing i_data_sem. So, this patch also adds explicit ext4_fc_track_inode() calls in places where i_data_sem grabbed. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://patch.msgid.link/20250508175908.1004880-3-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-04-01Merge tag 'mm-stable-2025-03-30-16-52' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-03-17ext4: remove redundant function ext4_has_metadata_csumEric Biggers1-1/+1
Since commit f2b4fa19647e ("ext4: switch to using the crc32c library"), ext4_has_metadata_csum() is just an alias for ext4_has_feature_metadata_csum(). ext4_has_feature_metadata_csum() is generated by EXT4_FEATURE_RO_COMPAT_FUNCS and uses the regular naming convention for checking a single ext4 feature. Therefore, remove ext4_has_metadata_csum() and update all its callers to use ext4_has_feature_metadata_csum() directly. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20250207031335.42637-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-17fs: convert block_commit_write() to take a folioMatthew Wilcox (Oracle)1-1/+1
All callers now have a folio, so pass it in instead of converting folio->page->folio. Link: https://lkml.kernel.org/r/20250217192009.437916-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17ext4: remove unused input "inode" in ext4_find_dest_deKemeng Shi1-1/+1
Remove unused input "inode" in ext4_find_dest_de. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250123162050.2114499-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: add ext4_emergency_state() helper functionBaokun Li1-1/+1
Since both SHUTDOWN and EMERGENCY_RO are emergency states of the ext4 file system, and they are checked in similar locations, we have added a helper function, ext4_emergency_state(), to determine whether the current file system is in one of these two emergency states. Then, replace calls to ext4_forced_shutdown() with ext4_emergency_state() in those functions that could potentially trigger write operations. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-4-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: Refactor out ext4_try_to_write_inline_data()Julian Sun1-74/+3
Refactor ext4_try_to_write_inline_data() to simplify its implementation by directly invoking ext4_generic_write_inline_data(). Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250107045730.1837808-1-sunjunchao2870@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: Replace ext4_da_write_inline_data_begin() with ↵Julian Sun1-89/+9
ext4_generic_write_inline_data(). Replace the call to ext4_da_write_inline_data_begin() with ext4_generic_write_inline_data(), and delete the ext4_da_write_inline_data_begin(). Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250107045710.1837756-1-sunjunchao2870@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: Introduce a new helper function ext4_generic_write_inline_data()Julian Sun1-0/+86
A new function, ext4_generic_write_inline_data(), is introduced to provide a generic implementation of the common logic found in ext4_da_write_inline_data_begin() and ext4_try_to_write_inline_data(). This function will be utilized in the subsequent two patches. Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250107045549.1837589-1-sunjunchao2870@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-21Merge tag 'ext4_for_linus-6.12-rc1' of ↵Linus Torvalds1-15/+31
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Lots of cleanups and bug fixes this cycle, primarily in the block allocation, extent management, fast commit, and journalling" * tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (93 commits) ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C ext4: check stripe size compatibility on remount as well ext4: fix i_data_sem unlock order in ext4_ind_migrate() ext4: remove the special buffer dirty handling in do_journal_get_write_access ext4: fix a potential assertion failure due to improperly dirtied buffer ext4: hoist ext4_block_write_begin and replace the __block_write_begin ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers ext4: dax: keep orphan list before truncate overflow allocated blocks ext4: fix error message when rejecting the default hash ext4: save unnecessary indentation in ext4_ext_create_new_leaf() ext4: make some fast commit functions reuse extents path ext4: refactor ext4_swap_extents() to reuse extents path ext4: get rid of ppath in convert_initialized_extent() ext4: get rid of ppath in ext4_ext_handle_unwritten_extents() ext4: get rid of ppath in ext4_ext_convert_to_initialized() ext4: get rid of ppath in ext4_convert_unwritten_extents_endio() ext4: get rid of ppath in ext4_split_convert_extents() ext4: get rid of ppath in ext4_split_extent() ext4: get rid of ppath in ext4_force_split_extent_at() ext4: get rid of ppath in ext4_split_extent_at() ...
2024-09-16Merge tag 'vfs-6.12.file' of ↵Linus Torvalds1-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This is the work to cleanup and shrink struct file significantly. Right now, (focusing on x86) struct file is 232 bytes. After this series struct file will be 184 bytes aka 3 cacheline and a spare 8 bytes for future extensions at the end of the struct. With struct file being as ubiquitous as it is this should make a difference for file heavy workloads and allow further optimizations in the future. - struct fown_struct was embedded into struct file letting it take up 32 bytes in total when really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct now allocate the struct on demand. This frees up 24 bytes. - Move struct file_ra_state into the union containg the cleanup hooks and move f_iocb_flags out of the union. This closes a 4 byte hole we created earlier and brings struct file to 192 bytes. Which means struct file is 3 cachelines and we managed to shrink it by 40 bytes. - Reorder struct file so that nothing crosses a cacheline. I suspect that in the future we will end up reordering some members to mitigate false sharing issues or just because someone does actually provide really good perf data. - Shrinking struct file to 192 bytes is only part of the work. Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer must be located outside of the object because the cache doesn't know what part of the memory can safely be overwritten as it may be needed to prevent object recycling. That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a new cacheline. So this also contains work to add a new kmem_cache_create_rcu() function that allows the caller to specify an offset where the freelist pointer is supposed to be placed. Thus avoiding the implicit addition of a fourth cacheline. - And finally this removes the f_version member in struct file. The f_version member isn't particularly well-defined. It is mainly used as a cookie to detect concurrent seeks when iterating directories. But it is also abused by some subsystems for completely unrelated things. It is mostly a directory and filesystem specific thing that doesn't really need to live in struct file and with its wonky semantics it really lacks a specific function. For pipes, f_version is (ab)used to defer poll notifications until a write has happened. And struct pipe_inode_info is used by multiple struct files in their ->private_data so there's no chance of pushing that down into file->private_data without introducing another pointer indirection. But pipes don't rely on f_pos_lock so this adds a union into struct file encompassing f_pos_lock and a pipe specific f_pipe member that pipes can use. This union of course can be extended to other file types and is similar to what we do in struct inode already" * tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits) fs: remove f_version pipe: use f_pipe fs: add f_pipe ubifs: store cookie in private data ufs: store cookie in private data udf: store cookie in private data proc: store cookie in private data ocfs2: store cookie in private data input: remove f_version abuse ext4: store cookie in private data ext2: store cookie in private data affs: store cookie in private data fs: add generic_llseek_cookie() fs: use must_set_pos() fs: add must_set_pos() fs: add vfs_setpos_cookie() s390: remove unused f_version ceph: remove unused f_version adi: remove unused f_version mm: Removed @freeptr_offset to prevent doc warning ...
2024-09-09ext4: store cookie in private dataChristian Brauner1-3/+4
Store the cookie to detect concurrent seeks on directories in file->private_data. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-11-6d3e4816aa7b@kernel.org Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-04ext4: fix a potential assertion failure due to improperly dirtied bufferShida Zhang1-3/+4
On an old kernel version(4.19, ext3, data=journal, pagesize=64k), an assertion failure will occasionally be triggered by the line below: ----------- jbd2_journal_commit_transaction { ... J_ASSERT_BH(bh, !buffer_dirty(bh)); /* * The buffer on BJ_Forget list and not jbddirty means ... } ----------- The same condition may also be applied to the lattest kernel version. When blocksize < pagesize and we truncate a file, there can be buffers in the mapping tail page beyond i_size. These buffers will be filed to transaction's BJ_Forget list by ext4_journalled_invalidatepage() during truncation. When the transaction doing truncate starts committing, we can grow the file again. This calls __block_write_begin() which allocates new blocks under these buffers in the tail page we go through the branch: if (buffer_new(bh)) { clean_bdev_bh_alias(bh); if (folio_test_uptodate(folio)) { clear_buffer_new(bh); set_buffer_uptodate(bh); mark_buffer_dirty(bh); continue; } ... } Hence buffers on BJ_Forget list of the committing transaction get marked dirty and this triggers the jbd2 assertion. Teach ext4_block_write_begin() to properly handle files with data journalling by avoiding dirtying them directly. Instead of folio_zero_new_buffers() we use ext4_journalled_zero_new_buffers() which takes care of handling journalling. We also don't need to mark new uptodate buffers as dirty in ext4_block_write_begin(). That will be either done either by block_commit_write() in case of success or by folio_zero_new_buffers() in case of failure. Reported-by: Baolin Liu <liubaolin@kylinos.cn> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-4-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-04ext4: hoist ext4_block_write_begin and replace the __block_write_beginShida Zhang1-5/+5
Using __block_write_begin() make it inconvenient to journal the user data dirty process. We can't tell the block layer maintainer, ‘Hey, we want to trace the dirty user data in ext4, can we add some special code for ext4 in __block_write_begin?’:P So use ext4_block_write_begin() instead. The two functions are basically doing the same thing except for the fscrypt related code. Remove the unnecessary #ifdef since fscrypt_inode_uses_fs_layer_crypto() returns false (and it's known at compile time) when !CONFIG_FS_ENCRYPTION. And hoist the ext4_block_write_begin so that it can be used in other files. Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-3-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-04ext4: avoid OOB when system.data xattr changes underneath the filesystemThadeu Lima de Souza Cascardo1-10/+21
When looking up for an entry in an inlined directory, if e_value_offs is changed underneath the filesystem by some change in the block device, it will lead to an out-of-bounds access that KASAN detects as an UAF. EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none. loop0: detected capacity change from 2048 to 2047 ================================================================== BUG: KASAN: use-after-free in ext4_search_dir+0xf2/0x1c0 fs/ext4/namei.c:1500 Read of size 1 at addr ffff88803e91130f by task syz-executor269/5103 CPU: 0 UID: 0 PID: 5103 Comm: syz-executor269 Not tainted 6.11.0-rc4-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 ext4_search_dir+0xf2/0x1c0 fs/ext4/namei.c:1500 ext4_find_inline_entry+0x4be/0x5e0 fs/ext4/inline.c:1697 __ext4_find_entry+0x2b4/0x1b30 fs/ext4/namei.c:1573 ext4_lookup_entry fs/ext4/namei.c:1727 [inline] ext4_lookup+0x15f/0x750 fs/ext4/namei.c:1795 lookup_one_qstr_excl+0x11f/0x260 fs/namei.c:1633 filename_create+0x297/0x540 fs/namei.c:3980 do_symlinkat+0xf9/0x3a0 fs/namei.c:4587 __do_sys_symlinkat fs/namei.c:4610 [inline] __se_sys_symlinkat fs/namei.c:4607 [inline] __x64_sys_symlinkat+0x95/0xb0 fs/namei.c:4607 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f3e73ced469 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 21 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff4d40c258 EFLAGS: 00000246 ORIG_RAX: 000000000000010a RAX: ffffffffffffffda RBX: 0032656c69662f2e RCX: 00007f3e73ced469 RDX: 0000000020000200 RSI: 00000000ffffff9c RDI: 00000000200001c0 RBP: 0000000000000000 R08: 00007fff4d40c290 R09: 00007fff4d40c290 R10: 0023706f6f6c2f76 R11: 0000000000000246 R12: 00007fff4d40c27c R13: 0000000000000003 R14: 431bde82d7b634db R15: 00007fff4d40c2b0 </TASK> Calling ext4_xattr_ibody_find right after reading the inode with ext4_get_inode_loc will lead to a check of the validity of the xattrs, avoiding this problem. Reported-by: syzbot+0c2508114d912a54ee79@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0c2508114d912a54ee79 Fixes: e8e948e7802a ("ext4: let ext4_find_entry handle inline data") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://patch.msgid.link/20240821152324.3621860-5-cascardo@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-04ext4: return error on ext4_find_inline_entryThadeu Lima de Souza Cascardo1-3/+7
In case of errors when reading an inode from disk or traversing inline directory entries, return an error-encoded ERR_PTR instead of returning NULL. ext4_find_inline_entry only caller, __ext4_find_entry already returns such encoded errors. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://patch.msgid.link/20240821152324.3621860-3-cascardo@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-08-07buffer: Convert __block_write_begin() to take a folioMatthew Wilcox (Oracle)1-3/+3
Almost all callers have a folio now, so change __block_write_begin() to take a folio and remove a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_begin to take a folioMatthew Wilcox (Oracle)1-4/+4
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-27ext4: fix uninitialized variable in ext4_inlinedir_to_treeXiaxi Shen1-1/+5
Syzbot has found an uninit-value bug in ext4_inlinedir_to_tree This error happens because ext4_inlinedir_to_tree does not handle the case when ext4fs_dirhash returns an error This can be avoided by checking the return value of ext4fs_dirhash and propagating the error, similar to how it's done with ext4_htree_store_dirent Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com> Reported-and-tested-by: syzbot+eaba5abe296837a640c0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=eaba5abe296837a640c0 Link: https://patch.msgid.link/20240501033017.220000-1-shenxiaxi26@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-12-11mm: add folio_zero_tail() and use it in ext4Matthew Wilcox (Oracle)1-2/+1
Patch series "Add folio_zero_tail() and folio_fill_tail()". I'm trying to make it easier for filesystems with tailpacking / stuffing / inline data to use folios. The primary function here is folio_fill_tail(). You give it a pointer to memory where the data currently is, and it takes care of copying it into the folio at that offset. That works for gfs2 & iomap. Then There's Ext4. Rather than gin up some kind of specialist "Here's a two pointers to two blocks of memory" routine, just let it do its current thing, and let it call folio_zero_tail(), which is also called by folio_fill_tail(). Other filesystems can be converted later; these ones seemed like good examples as they're already partly or completely converted to folios. This patch (of 3): Instead of unmapping the folio after copying the data to it, then mapping it again to zero the tail, provide folio_zero_tail() to zero the tail of an already-mapped folio. [akpm@linux-foundation.org: fix kerneldoc argument ordering] Link: https://lkml.kernel.org/r/20231107212643.3490372-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231107212643.3490372-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18ext4: convert to new timestamp accessorsJeff Layton1-2/+2
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-33-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-01Merge tag 'ext4_for_linus-6.6-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many ext4 and jbd2 cleanups and bug fixes: - Cleanups in the ext4 remount code when going to and from read-only - Cleanups in ext4's multiblock allocator - Cleanups in the jbd2 setup/mounting code paths - Performance improvements when appending to a delayed allocation file - Miscellaneous syzbot and other bug fixes" * tag 'ext4_for_linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits) ext4: fix slab-use-after-free in ext4_es_insert_extent() libfs: remove redundant checks of s_encoding ext4: remove redundant checks of s_encoding ext4: reject casefold inode flag without casefold feature ext4: use LIST_HEAD() to initialize the list_head in mballoc.c ext4: do not mark inode dirty every time when appending using delalloc ext4: rename s_error_work to s_sb_upd_work ext4: add periodic superblock update check ext4: drop dio overwrite only flag and associated warning ext4: add correct group descriptors and reserved GDT blocks to system zone ext4: remove unused function declaration ext4: mballoc: avoid garbage value from err ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple() ext4: change the type of blocksize in ext4_mb_init_cache() ext4: fix unttached inode after power cut with orphan file feature enabled jbd2: correct the end of the journal recovery scan range ext4: ext4_get_{dev}_journal return proper error value ext4: cleanup ext4_get_dev_journal() and ext4_get_journal() jbd2: jbd2_journal_init_{dev,inode} return proper error return value jbd2: drop useless error tag in jbd2_journal_wipe() ...
2023-07-30ext4: make ext4_forced_shutdown() take struct super_blockJan Kara1-1/+1
Currently ext4_forced_shutdown() takes struct ext4_sb_info but most callers need to get it from struct super_block anyway. So just pass in struct super_block to save all callers from some boilerplate code. No functional changes. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-24ext4: convert to ctime accessor functionsJeff Layton1-2/+2
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-40-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-27ext4: make ext4_es_remove_extent() return voidBaokun Li1-10/+2
Now ext4_es_remove_extent() never fails, so make it return void. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-10-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15ext4: Make ext4_write_inline_data_end() use folioRitesh Harjani1-2/+1
ext4_write_inline_data_end() is completely converted to work with folio. Also all callers of ext4_write_inline_data_end() already works on folio except ext4_da_write_end(). Mostly for consistency and saving few instructions maybe, this patch just converts ext4_da_write_end() to work with folio which makes the last caller of ext4_write_inline_data_end() also converted to work with folio. We then make ext4_write_inline_data_end() take folio instead of page. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/1bcea771720ff451a5a59b3f1bcd5fae51cb7ce7.1684122756.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15ext4: kill unused function ext4_journalled_write_inline_dataRitesh Harjani1-24/+0
Commit 3f079114bf522 ("ext4: Convert data=journal writeback to use ext4_writepages()") Added support for writeback of journalled data into ext4_writepages() and killed function __ext4_journalled_writepage() which used to call ext4_journalled_write_inline_data() for inline data. This function got left over by mistake. Hence kill it's definition as no one uses it. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/122b2a8d5e0650686f23ed6da26ed9e04105562b.1684122756.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: bail out of ext4_xattr_ibody_get() fails for any reasonTheodore Ts'o1-1/+1
In ext4_update_inline_data(), if ext4_xattr_ibody_get() fails for any reason, it's best if we just fail as opposed to stumbling on, especially if the failure is EFSCORRUPTED. Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: add bounds checking in get_max_inline_xattr_value_size()Theodore Ts'o1-1/+11
Normally the extended attributes in the inode body would have been checked when the inode is first opened, but if someone is writing to the block device while the file system is mounted, it's possible for the inode table to get corrupted. Add bounds checking to avoid reading beyond the end of allocated memory if this happens. Reported-by: syzbot+1966db24521e5f6e23f7@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=1966db24521e5f6e23f7 Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-14ext4: fix deadlock when converting an inline directory in nojournal modeTheodore Ts'o1-1/+2
In no journal mode, ext4_finish_convert_inline_dir() can self-deadlock by calling ext4_handle_dirty_dirblock() when it already has taken the directory lock. There is a similar self-deadlock in ext4_incvert_inline_data_nolock() for data files which we'll fix at the same time. A simple reproducer demonstrating the problem: mke2fs -Fq -t ext2 -O inline_data -b 4k /dev/vdc 64 mount -t ext4 -o dirsync /dev/vdc /vdc cd /vdc mkdir file0 cd file0 touch file0 touch file1 attr -s BurnSpaceInEA -V abcde . touch supercalifragilisticexpialidocious Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230507021608.1290720-1-tytso@mit.edu Reported-by: syzbot+91dccab7c64e2850a4e5@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=ba84cc80a9491d65416bc7877e1650c87530fe8a Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-28Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds1-9/+10
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-06ext4: Convert ext4_write_inline_data_end() to use a folioMatthew Wilcox1-14/+15
Convert the incoming page to a folio so that we call compound_head() only once instead of seven times. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-16-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_read_inline_page() to ext4_read_inline_folio()Matthew Wilcox1-13/+14
All callers now have a folio, so pass it and use it. The folio may be large, although I doubt we'll want to use a large folio for an inline file. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-15-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_da_write_inline_data_begin() to use a folioMatthew Wilcox1-11/+9
Saves a number of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-14-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_da_convert_inline_data_to_extent() to use a folioMatthew Wilcox1-13/+14
Saves a number of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-13-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_try_to_write_inline_data() to use a folioMatthew Wilcox1-13/+11
Saves a number of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-12-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_convert_inline_data_to_extent() to use a folioMatthew Wilcox1-21/+19
Saves a number of calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20230324180129.1220691-11-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_readpage_inline() to take a folioMatthew Wilcox1-7/+7
Use the folio API in this function, saves a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-10-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-11ext4: move where set the MAY_INLINE_DATA flag is setYe Bin1-1/+0
The only caller of ext4_find_inline_data_nolock() that needs setting of EXT4_STATE_MAY_INLINE_DATA flag is ext4_iget_extra_inode(). In ext4_write_inline_data_end() we just need to update inode->i_inline_off. Since we are going to add one more caller that does not need to set EXT4_STATE_MAY_INLINE_DATA, just move setting of EXT4_STATE_MAY_INLINE_DATA out to ext4_iget_extra_inode(). Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230307015253.2232062-2-yebin@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-30fs/ext4: replace ternary operator with min()/max() and min_t()Jiangshan Yi1-2/+1
Fix the following coccicheck warning: fs/ext4/inline.c:183: WARNING opportunity for min(). fs/ext4/extents.c:2631: WARNING opportunity for max(). fs/ext4/extents.c:2632: WARNING opportunity for min(). fs/ext4/extents.c:5559: WARNING opportunity for max(). fs/ext4/super.c:6908: WARNING opportunity for min(). min()/max() and min_t() macro is defined in include/linux/minmax.h. It avoids multiple evaluations of the arguments when non-constant and performs strict type-checking. Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220817025928.612851-1-13667453960@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-03ext4: correct max_inline_xattr_value_size computingBaokun Li1-0/+3
If the ext4 inode does not have xattr space, 0 is returned in the get_max_inline_xattr_value_size function. Otherwise, the function returns a negative value when the inode does not contain EXT4_STATE_XATTR. Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220616021358.2504451-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-03ext4: fix reading leftover inlined symlinksZhang Yi1-0/+30
Since commit 6493792d3299 ("ext4: convert symlink external data block mapping to bdev"), create new symlink with inline_data is not supported, but it missing to handle the leftover inlined symlinks, which could cause below error message and fail to read symlink. ls: cannot read symbolic link 'foo': Structure needs cleaning EXT4-fs error (device sda): ext4_map_blocks:605: inode #12: block 2021161080: comm ls: lblock 0 mapped to illegal pblock 2021161080 (length 1) Fix this regression by adding ext4_read_inline_link(), which read the inline data directly and convert it through a kmalloced buffer. Fixes: 6493792d3299 ("ext4: convert symlink external data block mapping to bdev") Cc: stable@kernel.org Reported-by: Torge Matthies <openglfreak@googlemail.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Tested-by: Torge Matthies <openglfreak@googlemail.com> Link: https://lore.kernel.org/r/20220630090100.2769490-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-25Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-22/+19
Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...
2022-05-22ext4: fix bug_on in ext4_writepagesYe Bin1-0/+12
we got issue as follows: EXT4-fs error (device loop0): ext4_mb_generate_buddy:1141: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free cls ------------[ cut here ]------------ kernel BUG at fs/ext4/inode.c:2708! invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 2 PID: 2147 Comm: rep Not tainted 5.18.0-rc2-next-20220413+ #155 RIP: 0010:ext4_writepages+0x1977/0x1c10 RSP: 0018:ffff88811d3e7880 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88811c098000 RDX: 0000000000000000 RSI: ffff88811c098000 RDI: 0000000000000002 RBP: ffff888128140f50 R08: ffffffffb1ff6387 R09: 0000000000000000 R10: 0000000000000007 R11: ffffed10250281ea R12: 0000000000000001 R13: 00000000000000a4 R14: ffff88811d3e7bb8 R15: ffff888128141028 FS: 00007f443aed9740(0000) GS:ffff8883aef00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020007200 CR3: 000000011c2a4000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> do_writepages+0x130/0x3a0 filemap_fdatawrite_wbc+0x83/0xa0 filemap_flush+0xab/0xe0 ext4_alloc_da_blocks+0x51/0x120 __ext4_ioctl+0x1534/0x3210 __x64_sys_ioctl+0x12c/0x170 do_syscall_64+0x3b/0x90 It may happen as follows: 1. write inline_data inode vfs_write new_sync_write ext4_file_write_iter ext4_buffered_write_iter generic_perform_write ext4_da_write_begin ext4_da_write_inline_data_begin -> If inline data size too small will allocate block to write, then mapping will has dirty page ext4_da_convert_inline_data_to_extent ->clear EXT4_STATE_MAY_INLINE_DATA 2. fallocate do_vfs_ioctl ioctl_preallocate vfs_fallocate ext4_fallocate ext4_convert_inline_data ext4_convert_inline_data_nolock ext4_map_blocks -> fail will goto restore data ext4_restore_inline_data ext4_create_inline_data ext4_write_inline_data ext4_set_inode_state -> set inode EXT4_STATE_MAY_INLINE_DATA 3. writepages __ext4_ioctl ext4_alloc_da_blocks filemap_flush filemap_fdatawrite_wbc do_writepages ext4_writepages if (ext4_has_inline_data(inode)) BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) The root cause of this issue is we destory inline data until call ext4_writepages under delay allocation mode. But there maybe already convert from inline to extent. To solve this issue, we call filemap_flush first.. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220516122634.1690462-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-11ext4: remove unnecessary type castingsYu Zhe1-3/+3
remove unnecessary void* type castings. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Link: https://lore.kernel.org/r/20220401081321.73735-1-yuzhe@nfschina.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-08fs: Remove aop flags parameter from grab_cache_page_write_begin()Matthew Wilcox (Oracle)1-4/+4
There are no more aop flags left, so remove the parameter. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08ext4: Use scoped memory APIs in ext4_write_begin()Matthew Wilcox (Oracle)1-11/+10
Instead of setting AOP_FLAG_NOFS, use memalloc_nofs_save() and memalloc_nofs_restore() to prevent GFP_FS allocations recursing into the filesystem with a journal already started. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Theodore Ts'o <tytso@mit.edu>
2022-05-08ext4: Use scoped memory APIs in ext4_da_write_begin()Matthew Wilcox (Oracle)1-8/+8
Instead of setting AOP_FLAG_NOFS, use memalloc_nofs_save() and memalloc_nofs_restore() to prevent GFP_FS allocations recursing into the filesystem with a journal already started. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Theodore Ts'o <tytso@mit.edu>
2022-05-08ext4: Allow GFP_FS allocations in ext4_da_convert_inline_data_to_extent()Matthew Wilcox (Oracle)1-3/+1
Since commit 8bc1379b82b8, the transaction is stopped before calling ext4_da_convert_inline_data_to_extent(), which means we can do GFP_FS allocations and recurse into the filesystem. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Theodore Ts'o <tytso@mit.edu>