kernel/linux.git/mm/huge_memory.c, branch v6.18.21

mm/huge_memory: fix a folio_split() race condition with folio_try_get()

2026-03-25T10:10:33+00:00

During a pagecache folio split, the values in the related xarray should not be changed from the original folio at xarray split time until all after-split folios are well formed and stored in the xarray. Current use of xas_try_split() in __split_unmapped_folio() lets some after-split folios show up at wrong indices in the xarray. When these misplaced after-split folios are unfrozen, before correct folios are stored via __xa_store(), and grabbed by folio_try_get(), they are returned to userspace at wrong file indices, causing data corruption. More detailed explanation is at the bottom. The reproducer is at: https://github.com/dfinity/thp-madv-remove-test It 1. creates a memfd, 2. forks, 3. in the child process, maps the file with large folios (via shmem code path) and reads the mapped file continuously with 16 threads, 4. in the parent process, uses madvise(MADV_REMOVE) to punch poles in the large folio. Data corruption can be observed without the fix. Basically, data from a wrong page->index is returned. Fix it by using the original folio in xas_try_split() calls, so that folio_try_get() can get the right after-split folios after the original folio is unfrozen. Uniform split, split_huge_page*(), is not affected, since it uses xas_split_alloc() and xas_split() only once and stores the original folio in the xarray. Change xas_split() used in uniform split branch to use the original folio to avoid confusion. Fixes below points to the commit introduces the code, but folio_split() is used in a later commit 7460b470a131f ("mm/truncate: use folio_split() in truncate operation"). More details: For example, a folio f is split non-uniformly into f, f2, f3, f4 like below: +----------------+---------+----+----+ | f | f2 | f3 | f4 | +----------------+---------+----+----+ but the xarray would look like below after __split_unmapped_folio() is done: +----------------+---------+----+----+ | f | f2 | f3 | f3 | +----------------+---------+----+----+ After __split_unmapped_folio(), the code changes the xarray and unfreezes after-split folios: 1. unfreezes f2, __xa_store(f2) 2. unfreezes f3, __xa_store(f3) 3. unfreezes f4, __xa_store(f4), which overwrites the second f3 to f4. 4. unfreezes f. Meanwhile, a parallel filemap_get_entry() can read the second f3 from the xarray and use folio_try_get() on it at step 2 when f3 is unfrozen. Then, f3 is wrongly returned to user. After the fix, the xarray looks like below after __split_unmapped_folio(): +----------------+---------+----+----+ | f | f | f | f | +----------------+---------+----+----+ so that the race window no longer exists. [ziy@nvidia.com: move comment, per David] Link: https://lkml.kernel.org/r/5C9FA053-A4C6-4615-BE05-74E47A6462B3@nvidia.com Link: https://lkml.kernel.org/r/20260302203159.3208341-1-ziy@nvidia.com Fixes: 00527733d0dc ("mm/huge_memory: add two new (not yet used) functions for folio_split()") Signed-off-by: Zi Yan Reported-by: Bas van Dijk Closes: https://lore.kernel.org/all/CAKNNEtw5_kZomhkugedKMPOG-sxs5Q5OLumWJdiWXv+C9Yct0w@mail.gmail.com/ Tested-by: Lance Yang Reviewed-by: Lorenzo Stoakes Reviewed-by: Wei Yang Reviewed-by: Baolin Wang Cc: Barry Song Cc: David Hildenbrand Cc: Dev Jain Cc: Hugh Dickins Cc: Liam Howlett Cc: Matthew Wilcox (Oracle) Cc: Nico Pache Cc: Ryan Roberts Cc: Signed-off-by: Andrew Morton (cherry picked from commit 577a1f495fd78d8fb61b67ac3d3b595b01f6fcb0) Signed-off-by: Zi Yan Signed-off-by: Greg Kroah-Hartman

mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd()

2026-03-25T10:10:30+00:00

commit fae654083bfa409bb2244f390232e2be47f05bfc upstream. move_pages_huge_pmd() handles UFFDIO_MOVE for both normal THPs and huge zero pages. For the huge zero page path, src_folio is explicitly set to NULL, and is used as a sentinel to skip folio operations like lock and rmap. In the huge zero page branch, src_folio is NULL, so folio_mk_pmd(NULL, pgprot) passes NULL through folio_pfn() and page_to_pfn(). With SPARSEMEM_VMEMMAP this silently produces a bogus PFN, installing a PMD pointing to non-existent physical memory. On other memory models it is a NULL dereference. Use page_folio(src_page) to obtain the valid huge zero folio from the page, which was obtained from pmd_page() and remains valid throughout. After commit d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special"), moved huge zero PMDs must remain special so vm_normal_page_pmd() continues to treat them as special mappings. move_pages_huge_pmd() currently reconstructs the destination PMD in the huge zero page branch, which drops PMD state such as pmd_special() on architectures with CONFIG_ARCH_HAS_PTE_SPECIAL. As a result, vm_normal_page_pmd() can treat the moved huge zero PMD as a normal page and corrupt its refcount. Instead of reconstructing the PMD from the folio, derive the destination entry from src_pmdval after pmdp_huge_clear_flush(), then handle the PMD metadata the same way move_huge_pmd() does for moved entries by marking it soft-dirty and clearing uffd-wp. Link: https://lkml.kernel.org/r/a1e787dd-b911-474d-8570-f37685357d86@lucifer.local Fixes: e3981db444a0 ("mm: add folio_mk_pmd()") Signed-off-by: Chris Down Signed-off-by: Lorenzo Stoakes Reviewed-by: Lorenzo Stoakes Tested-by: Lorenzo Stoakes Acked-by: David Hildenbrand (Arm) Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm: thp: deny THP for files on anonymous inodes

2026-03-12T11:09:41+00:00

commit dd085fe9a8ebfc5d10314c60452db38d2b75e609 upstream. file_thp_enabled() incorrectly allows THP for files on anonymous inodes (e.g. guest_memfd and secretmem). These files are created via alloc_file_pseudo(), which does not call get_write_access() and leaves inode->i_writecount at 0. Combined with S_ISREG(inode->i_mode) being true, they appear as read-only regular files when CONFIG_READ_ONLY_THP_FOR_FS is enabled, making them eligible for THP collapse. Anonymous inodes can never pass the inode_is_open_for_write() check since their i_writecount is never incremented through the normal VFS open path. The right thing to do is to exclude them from THP eligibility altogether, since CONFIG_READ_ONLY_THP_FOR_FS was designed for real filesystem files (e.g. shared libraries), not for pseudo-filesystem inodes. For guest_memfd, this allows khugepaged and MADV_COLLAPSE to create large folios in the page cache via the collapse path, but the guest_memfd fault handler does not support large folios. This triggers WARN_ON_ONCE(folio_test_large(folio)) in kvm_gmem_fault_user_mapping(). For secretmem, collapse_file() tries to copy page contents through the direct map, but secretmem pages are removed from the direct map. This can result in a kernel crash: BUG: unable to handle page fault for address: ffff88810284d000 RIP: 0010:memcpy_orig+0x16/0x130 Call Trace: collapse_file hpage_collapse_scan_file madvise_collapse Secretmem is not affected by the crash on upstream as the memory failure recovery handles the failed copy gracefully, but it still triggers confusing false memory failure reports: Memory failure: 0x106d96f: recovery action for clean unevictable LRU page: Recovered Check IS_ANON_FILE(inode) in file_thp_enabled() to deny THP for all anonymous inode files. Link: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 Link: https://lore.kernel.org/linux-mm/CAEvNRgHegcz3ro35ixkDw39ES8=U6rs6S7iP0gkR9enr7HoGtA@mail.gmail.com Link: https://lkml.kernel.org/r/20260214001535.435626-1-kartikey406@gmail.com Fixes: 7fbb5e188248 ("mm: remove VM_EXEC requirement for THP eligibility") Signed-off-by: Deepanshu Kartikey Reported-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 Tested-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com Tested-by: Lance Yang Acked-by: David Hildenbrand (Arm) Reviewed-by: Barry Song Reviewed-by: Ackerley Tng Tested-by: Ackerley Tng Reviewed-by: Lorenzo Stoakes Cc: Baolin Wang Cc: Dev Jain Cc: Fangrui Song Cc: Liam Howlett Cc: Nico Pache Cc: Ryan Roberts Cc: Yang Shi Cc: Zi Yan Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm/huge_memory: merge uniform_split_supported() and non_uniform_split_supported()

2026-01-08T09:16:41+00:00

[ Upstream commit 8a0e4bdddd1c998b894d879a1d22f1e745606215 ] uniform_split_supported() and non_uniform_split_supported() share significantly similar logic. The only functional difference is that uniform_split_supported() includes an additional check on the requested @new_order. The reason for this check comes from the following two aspects: * some file system or swap cache just supports order-0 folio * the behavioral difference between uniform/non-uniform split The behavioral difference between uniform split and non-uniform: * uniform split splits folio directly to @new_order * non-uniform split creates after-split folios with orders from folio_order(folio) - 1 to new_order. This means for non-uniform split or !new_order split we should check the file system and swap cache respectively. This commit unifies the logic and merge the two functions into a single combined helper, removing redundant code and simplifying the split support checking mechanism. Link: https://lkml.kernel.org/r/20251106034155.21398-3-richard.weiyang@gmail.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Wei Yang Reviewed-by: Zi Yan Cc: Zi Yan Cc: "David Hildenbrand (Red Hat)" Cc: Baolin Wang Cc: Barry Song Cc: Dev Jain Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Nico Pache Cc: Ryan Roberts Cc: Signed-off-by: Andrew Morton [ split_type => uniform_split and replaced SPLIT_TYPE_NON_UNIFORM checks ] Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

mm/huge_memory: add pmd folio to ds_queue in do_huge_zero_wp_pmd()

2026-01-02T11:57:11+00:00

commit 2a1351cd4176ee1809b0900d386919d03b7652f8 upstream. We add pmd folio into ds_queue on the first page fault in __do_huge_pmd_anonymous_page(), so that we can split it in case of memory pressure. This should be the same for a pmd folio during wp page fault. Commit 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") miss to add it to ds_queue, which means system may not reclaim enough memory in case of memory pressure even the pmd folio is under used. Move deferred_split_folio() into map_anon_folio_pmd() to make the pmd folio installation consistent. Link: https://lkml.kernel.org/r/20251008095453.18772-2-richard.weiyang@gmail.com Fixes: 1ced09e0331f ("mm: allocate THP on hugezeropage wp-fault") Signed-off-by: Wei Yang Acked-by: David Hildenbrand Reviewed-by: Lance Yang Reviewed-by: Dev Jain Acked-by: Usama Arif Reviewed-by: Zi Yan Reviewed-by: Baolin Wang Cc: Matthew Wilcox Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm/huge_memory: fix NULL pointer deference when splitting folio

2025-11-24T22:25:17+00:00

Commit c010d47f107f ("mm: thp: split huge page to any lower order pages") introduced an early check on the folio's order via mapping->flags before proceeding with the split work. This check introduced a bug: for shmem folios in the swap cache and truncated folios, the mapping pointer can be NULL. Accessing mapping->flags in this state leads directly to a NULL pointer dereference. This commit fixes the issue by moving the check for mapping != NULL before any attempt to access mapping->flags. Link: https://lkml.kernel.org/r/20251119235302.24773-1-richard.weiyang@gmail.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Wei Yang Reviewed-by: Zi Yan Acked-by: David Hildenbrand (Red Hat) Reviewed-by: Baolin Wang Cc: Signed-off-by: Andrew Morton

mm/huge_memory: fix folio split check for anon folios in swapcache

2025-11-15T18:52:01+00:00

Both uniform and non uniform split check missed the check to prevent splitting anon folios in swapcache to non-zero order. Splitting anon folios in swapcache to non-zero order can cause data corruption since swapcache only support PMD order and order-0 entries. This can happen when one use split_huge_pages under debugfs to split anon folios in swapcache. In-tree callers do not perform such an illegal operation. Only debugfs interface could trigger it. I will put adding a test case on my TODO list. Fix the check. Link: https://lkml.kernel.org/r/20251105162910.752266-1-ziy@nvidia.com Fixes: 58729c04cf10 ("mm/huge_memory: add buddy allocator like (non-uniform) folio_split()") Signed-off-by: Zi Yan Reported-by: "David Hildenbrand (Red Hat)" Closes: https://lore.kernel.org/all/dc0ecc2c-4089-484f-917f-920fdca4c898@kernel.org/ Acked-by: David Hildenbrand (Red Hat) Cc: Baolin Wang Cc: Barry Song Cc: Dev Jain Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Nico Pache Cc: Ryan Roberts Cc: Wei Yang Cc: Signed-off-by: Andrew Morton

mm/huge_memory: initialise the tags of the huge zero folio

2025-11-10T05:19:46+00:00

On arm64 with MTE enabled, a page mapped as Normal Tagged (PROT_MTE) in user space will need to have its allocation tags initialised. This is normally done in the arm64 set_pte_at() after checking the memory attributes. Such page is also marked with the PG_mte_tagged flag to avoid subsequent clearing. Since this relies on having a struct page, pte_special() mappings are ignored. Commit d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special") maps the huge zero folio special and the arm64 set_pmd_at() will no longer zero the tags. There is no guarantee that the tags are zero, especially if parts of this huge page have been previously tagged. It's fairly easy to detect this by regularly dropping the caches to force the reallocation of the huge zero folio. Allocate the huge zero folio with the __GFP_ZEROTAGS flag. In addition, do not warn in the arm64 __access_remote_tags() when reading tags from the huge zero page. I bundled the arm64 change in here as well since they are both related to the commit mapping the huge zero folio as special. [catalin.marinas@arm.com: handle arch mte_zero_clear_page_tags() code issuing MTE instructions] Link: https://lkml.kernel.org/r/aQi8dA_QpXM8XqrE@arm.com Link: https://lkml.kernel.org/r/20251031170133.280742-1-catalin.marinas@arm.com Fixes: d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special") Signed-off-by: Catalin Marinas Acked-by: David Hildenbrand Reviewed-by: Lance Yang Tested-by: Beleswar Padhi Cc: Will Deacon Cc: Mark Brown Cc: Aishwarya TCV Cc: David Hildenbrand (Red Hat) Signed-off-by: Andrew Morton

mm/huge_memory: preserve PG_has_hwpoisoned if a folio is split to >0 order

2025-11-10T05:19:43+00:00

folio split clears PG_has_hwpoisoned, but the flag should be preserved in after-split folios containing pages with PG_hwpoisoned flag if the folio is split to >0 order folios. Scan all pages in a to-be-split folio to determine which after-split folios need the flag. An alternatives is to change PG_has_hwpoisoned to PG_maybe_hwpoisoned to avoid the scan and set it on all after-split folios, but resulting false positive has undesirable negative impact. To remove false positive, caller of folio_test_has_hwpoisoned() and folio_contain_hwpoisoned_page() needs to do the scan. That might be causing a hassle for current and future callers and more costly than doing the scan in the split code. More details are discussed in [1]. This issue can be exposed via: 1. splitting a has_hwpoisoned folio to >0 order from debugfs interface; 2. truncating part of a has_hwpoisoned folio in truncate_inode_partial_folio(). And later accesses to a hwpoisoned page could be possible due to the missing has_hwpoisoned folio flag. This will lead to MCE errors. Link: https://lore.kernel.org/all/CAHbLzkoOZm0PXxE9qwtF4gKR=cpRXrSrJ9V9Pm2DJexs985q4g@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20251023030521.473097-1-ziy@nvidia.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan Acked-by: David Hildenbrand Reviewed-by: Yang Shi Reviewed-by: Lorenzo Stoakes Reviewed-by: Lance Yang Reviewed-by: Miaohe Lin Reviewed-by: Baolin Wang Reviewed-by: Wei Yang Cc: Pankaj Raghav Cc: Barry Song Cc: Dev Jain Cc: Jane Chu Cc: Liam Howlett Cc: Luis Chamberalin Cc: Matthew Wilcox (Oracle) Cc: Naoya Horiguchi Cc: Nico Pache Cc: Ryan Roberts Cc: Signed-off-by: Andrew Morton

mm/huge_memory: do not change split_huge_page*() target order silently

2025-11-10T05:19:41+00:00

Page cache folios from a file system that support large block size (LBS) can have minimal folio order greater than 0, thus a high order folio might not be able to be split down to order-0. Commit e220917fa507 ("mm: split a folio in minimum folio order chunks") bumps the target order of split_huge_page*() to the minimum allowed order when splitting a LBS folio. This causes confusion for some split_huge_page*() callers like memory failure handling code, since they expect after-split folios all have order-0 when split succeeds but in reality get min_order_for_split() order folios and give warnings. Fix it by failing a split if the folio cannot be split to the target order. Rename try_folio_split() to try_folio_split_to_order() to reflect the added new_order parameter. Remove its unused list parameter. [The test poisons LBS folios, which cannot be split to order-0 folios, and also tries to poison all memory. The non split LBS folios take more memory than the test anticipated, leading to OOM. The patch fixed the kernel warning and the test needs some change to avoid OOM.] Link: https://lkml.kernel.org/r/20251017013630.139907-1-ziy@nvidia.com Fixes: e220917fa507 ("mm: split a folio in minimum folio order chunks") Signed-off-by: Zi Yan Reported-by: syzbot+e6367ea2fdab6ed46056@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68d2c943.a70a0220.1b52b.02b3.GAE@google.com/ Reviewed-by: Luis Chamberlain Reviewed-by: Pankaj Raghav Reviewed-by: Wei Yang Acked-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes Reviewed-by: Miaohe Lin Cc: Baolin Wang Cc: Barry Song Cc: David Hildenbrand Cc: Dev Jain Cc: Jane Chu Cc: Lance Yang Cc: Liam Howlett Cc: Mariano Pache Cc: Matthew Wilcox (Oracle) Cc: Naoya Horiguchi Cc: Ryan Roberts Cc: Christian Brauner Cc: Signed-off-by: Andrew Morton