kernel/linux.git/include/linux/hugetlb.h, branch v6.18.36

mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison

2026-06-19T11:44:02+00:00

commit 3c2d42b8ee345b17a4ba56b0f6492d1ff4c1178e upstream. Two concurrent madvise(MADV_HWPOISON) calls on the same hugetlb page can trigger a recursive spinlock self-deadlock (AA deadlock) on hugetlb_lock when racing with a concurrent unmap: thread#0 thread#1 -------- -------- madvise(folio, MADV_HWPOISON) -> poisons the folio successfully madvise(folio, MADV_HWPOISON) unmap(folio) try_memory_failure_hugetlb get_huge_page_for_hwpoison spin_lock_irq(&hugetlb_lock) <- held __get_huge_page_for_hwpoison hugetlb_update_hwpoison() -> MF_HUGETLB_FOLIO_PRE_POISONED goto out: folio_put() refcount: 1 -> 0 free_huge_folio() spin_lock_irqsave(&hugetlb_lock) -> AA DEADLOCK! The out: path in __get_huge_page_for_hwpoison() calls folio_put() to drop the GUP reference while the hugetlb_lock is still held by the hugetlb.c wrapper get_huge_page_for_hwpoison(). If concurrent unmap has released the page table mapping reference, folio_put() drops the folio refcount to zero, triggering free_huge_folio() which attempts to re-acquire the non-recursive hugetlb_lock. Fix this by moving hugetlb_lock acquisition from the hugetlb.c wrapper into get_huge_page_for_hwpoison(). Place spin_unlock_irq() before the folio_put() at the out: label so the folio is always released outside the lock. [akpm@linux-foundation.org: fix race, rename label per Miaohe] Link: https://sashiko.dev/#/patchset/20260522010305.4099834-1-mawupeng1@huawei.com Link: https://lore.kernel.org/f39f405e-4b4b-8f79-70fe-a2b5b62114eb@huawei.com Link: https://lore.kernel.org/20260522010305.4099834-1-mawupeng1@huawei.com Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Wupeng Ma Acked-by: Oscar Salvador (SUSE) Acked-by: Muchun Song Reviewed-by: Kefeng Wang Acked-by: Miaohe Lin Cc: David Hildenbrand Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Naoya Horiguchi Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm/userfaultfd: fix hugetlb fault mutex hash calculation

2026-04-22T11:22:33+00:00

commit 0217c7fb4de4a40cee667eb21901f3204effe5ac upstream. In mfill_atomic_hugetlb(), linear_page_index() is used to calculate the page index for hugetlb_fault_mutex_hash(). However, linear_page_index() returns the index in PAGE_SIZE units, while hugetlb_fault_mutex_hash() expects the index in huge page units. This mismatch means that different addresses within the same huge page can produce different hash values, leading to the use of different mutexes for the same huge page. This can cause races between faulting threads, which can corrupt the reservation map and trigger the BUG_ON in resv_map_release(). Fix this by introducing hugetlb_linear_page_index(), which returns the page index in huge page granularity, and using it in place of linear_page_index(). Link: https://lkml.kernel.org/r/20260310110526.335749-1-jianhuizzzzz@gmail.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Jianhui Zhou Reported-by: syzbot+f525fd79634858f478e7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f525fd79634858f478e7 Acked-by: SeongJae Park Reviewed-by: David Hildenbrand (Arm) Acked-by: Mike Rapoport (Microsoft) Cc: Jane Chu Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: JonasZhou Cc: Muchun Song Cc: Oscar Salvador Cc: Peter Xu Cc: SeongJae Park Cc: Sidhartha Kumar Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather

2026-02-19T15:31:34+00:00

commit 8ce720d5bd91e9dc16db3604aa4b1bf76770a9a1 upstream. As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads. In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table. There are two optimizations to be had: (1) When we process (unshare) multiple such PMD tables, such as during exit(), it is sufficient to send a single IPI broadcast (as long as we respect locking rules) instead of one per PMD table. Locking prevents that any of these PMD tables could get reused before we drop the lock. (2) When we are not the last sharer (> 2 users including us), there is no need to send the IPI broadcast. The shared PMD tables cannot become exclusive (fully unshared) before an IPI will be broadcasted by the last sharer. Concurrent GUP-fast could walk into a PMD table just before we unshared it. It could then succeed in grabbing a page from the shared page table even after munmap() etc succeeded (and supressed an IPI). But there is not difference compared to GUP-fast just sleeping for a while after grabbing the page and re-enabling IRQs. Most importantly, GUP-fast will never walk into page tables that are no-longer shared, because the last sharer will issue an IPI broadcast. (if ever required, checking whether the PUD changed in GUP-fast after grabbing the page like we do in the PTE case could handle this) So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather infrastructure so we can implement these optimizations and demystify the code at least a bit. Extend the mmu_gather infrastructure to be able to deal with our special hugetlb PMD table sharing implementation. To make initialization of the mmu_gather easier when working on a single VMA (in particular, when dealing with hugetlb), provide tlb_gather_mmu_vma(). We'll consolidate the handling for (full) unsharing of PMD tables in tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track in "struct mmu_gather" whether we had (full) unsharing of PMD tables. Because locking is very special (concurrent unsharing+reuse must be prevented), we disallow deferring flushing to tlb_finish_mmu() and instead require an explicit earlier call to tlb_flush_unshared_tables(). From hugetlb code, we call huge_pmd_unshare_flush() where we make sure that the expected lock protecting us from concurrent unsharing+reuse is still held. Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that tlb_flush_unshared_tables() was properly called earlier. Document it all properly. Notes about tlb_remove_table_sync_one() interaction with unsharing: There are two fairly tricky things: (1) tlb_remove_table_sync_one() is a NOP on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE. Here, the assumption is that the previous TLB flush would send an IPI to all relevant CPUs. Careful: some architectures like x86 only send IPIs to all relevant CPUs when tlb->freed_tables is set. The relevant architectures should be selecting MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable kernels and it might have been problematic before this patch. Also, the arch flushing behavior (independent of IPIs) is different when tlb->freed_tables is set. Do we have to enlighten them to also take care of tlb->unshared_tables? So far we didn't care, so hopefully we are fine. Of course, we could be setting tlb->freed_tables as well, but that might then unnecessarily flush too much, because the semantics of tlb->freed_tables are a bit fuzzy. This patch changes nothing in this regard. (2) tlb_remove_table_sync_one() is not a NOP on architectures with CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync. Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) we still issue IPIs during TLB flushes and don't actually need the second tlb_remove_table_sync_one(). This optimized can be implemented on top of this, by checking e.g., in tlb_remove_table_sync_one() whether we really need IPIs. But as described in (1), it really must honor tlb->freed_tables then to send IPIs to all relevant CPUs. Notes on TLB flushing changes: (1) Flushing for non-shared PMD tables We're converting from flush_hugetlb_tlb_range() to tlb_remove_huge_tlb_entry(). Given that we properly initialize the MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to __unmap_hugepage_range(), that should be fine. (2) Flushing for shared PMD tables We're converting from various things (flush_hugetlb_tlb_range(), tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range(). tlb_flush_pmd_range() achieves the same that tlb_remove_huge_tlb_entry() would achieve in these scenarios. Note that tlb_remove_huge_tlb_entry() also calls __tlb_remove_tlb_entry(), however that is only implemented on powerpc, which does not support PMD table sharing. Similar to (1), tlb_gather_mmu_vma() should make sure that TLB flushing keeps on working as expected. Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a concern, as we are holding the i_mmap_lock the whole time, preventing concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed separately as a cleanup later. There are plenty more cleanups to be had, but they have to wait until this is fixed. [david@kernel.org: fix kerneldoc] Link: https://lkml.kernel.org/r/f223dd74-331c-412d-93fc-69e360a5006c@kernel.org Link: https://lkml.kernel.org/r/20251223214037.580860-5-david@kernel.org Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Signed-off-by: David Hildenbrand (Red Hat) Reported-by: "Uschakow, Stanislav" Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman Acked-by: Harry Yoo Reviewed-by: Lorenzo Stoakes Cc: Lance Yang Cc: Liu Shixin Cc: Oscar Salvador Cc: Rik van Riel Cc: Signed-off-by: Andrew Morton Signed-off-by: David Hildenbrand (Arm) Signed-off-by: Greg Kroah-Hartman

mm/hugetlb: fix hugetlb_pmd_shared()

2026-01-30T09:32:15+00:00

commit ca1a47cd3f5f4c46ca188b1c9a27af87d1ab2216 upstream. Patch series "mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)", v3. One functional fix, one performance regression fix, and two related comment fixes. I cleaned up my prototype I recently shared [1] for the performance fix, deferring most of the cleanups I had in the prototype to a later point. While doing that I identified the other things. The goal of this patch set is to be backported to stable trees "fairly" easily. At least patch #1 and #4. Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing Patch #2 + #3 are simple comment fixes that patch #4 interacts with. Patch #4 is a fix for the reported performance regression due to excessive IPI broadcasts during fork()+exit(). The last patch is all about TLB flushes, IPIs and mmu_gather. Read: complicated There are plenty of cleanups in the future to be had + one reasonable optimization on x86. But that's all out of scope for this series. Runtime tested, with a focus on fixing the performance regression using the original reproducer [2] on x86. This patch (of 4): We switched from (wrongly) using the page count to an independent shared count. Now, shared page tables have a refcount of 1 (excluding speculative references) and instead use ptdesc->pt_share_count to identify sharing. We didn't convert hugetlb_pmd_shared(), so right now, we would never detect a shared PMD table as such, because sharing/unsharing no longer touches the refcount of a PMD table. Page migration, like mbind() or migrate_pages() would allow for migrating folios mapped into such shared PMD tables, even though the folios are not exclusive. In smaps we would account them as "private" although they are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the pagemap interface. Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared(). Link: https://lkml.kernel.org/r/20251223214037.580860-1-david@kernel.org Link: https://lkml.kernel.org/r/20251223214037.580860-2-david@kernel.org Link: https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/ [1] Link: https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/ [2] Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) Reviewed-by: Rik van Riel Reviewed-by: Lance Yang Tested-by: Lance Yang Reviewed-by: Harry Yoo Tested-by: Laurence Oberman Reviewed-by: Lorenzo Stoakes Acked-by: Oscar Salvador Cc: Liu Shixin Cc: Uschakow, Stanislav" Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm: hugetlb: directly pass order when allocate a hugetlb folio

2025-09-21T21:22:11+00:00

Use order instead of struct hstate to remove huge_page_order() call from all hugetlb folio allocation, also order_is_gigantic() is added to check whether it is a gigantic order. Link: https://lkml.kernel.org/r/20250910133958.301467-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: Oscar Salvador Reviewed-by: Sidhartha Kumar Reviewed-by: Jane Chu Reviewed-by: Zi Yan Cc: Brendan Jackman Cc: David Hildenbrand Cc: Johannes Weiner Cc: Muchun Song Cc: Vlastimil Babka Signed-off-by: Andrew Morton

mm: drop hugetlb_free_pgd_range()

2025-07-25T02:12:32+00:00

There are no longer any callers of hugetlb_free_pgd_range(). Link: https://lkml.kernel.org/r/20250716012611.10369-4-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga Acked-by: Mike Rapoport (Microsoft) Acked-by: Oscar Salvador Cc: Alexander Gordeev Cc: Alexandre Ghiti Cc: Andreas Larsson Cc: Anshuman Khandual Cc: Arnd Bergmann Cc: Christophe Leroy Cc: David Hildenbrand Cc: David S. Miller Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Muchun Song Cc: Ryan Roberts Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Will Deacon Signed-off-by: Andrew Morton

mm/hugetlb: remove prepare_hugepage_range()

2025-07-13T23:38:19+00:00

Only mips and loongarch implemented this API, however what it does was checking against stack overflow for either len or addr. That's already done in arch's arch_get_unmapped_area*() functions, even though it may not be 100% identical checks. For example, for both of the architectures, there will be a trivial difference on how stack top was defined. The old code uses STACK_TOP which may be slightly smaller than TASK_SIZE on either of them, but the hope is that shouldn't be a problem. It means the whole API is pretty much obsolete at least now, remove it completely. Link: https://lkml.kernel.org/r/20250627160707.2124580-1-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Jason Gunthorpe Reviewed-by: Oscar Salvador Acked-by: David Hildenbrand Reviewed-by: Liam R. Howlett Reviewed-by: Anshuman Khandual Cc: Huacai Chen Cc: Thomas Bogendoerfer Cc: Muchun Song Cc: Jann Horn Cc: Lorenzo Stoakes Cc: Pedro Falcato Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton

mm/memfd: reserve hugetlb folios before allocation

2025-07-10T05:42:15+00:00

When we try to allocate a folio via alloc_hugetlb_folio_reserve(), we need to ensure that there is an active reservation associated with the allocation. Otherwise, our allocation request would fail if there are no active reservations made at that moment against any other allocations. This is because alloc_hugetlb_folio_reserve() checks h->resv_huge_pages before proceeding with the allocation. Therefore, to address this issue, we just need to make a reservation (by calling hugetlb_reserve_pages()) before we try to allocate the folio. This will also ensure that proper region/subpool accounting is done associated with our allocation. Link: https://lkml.kernel.org/r/20250618053415.1036185-3-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy Cc: Steve Sistare Cc: Muchun Song Cc: David Hildenbrand Cc: Gerd Hoffmann Cc: Oscar Salvador Signed-off-by: Andrew Morton

mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updated

2025-07-10T05:42:14+00:00

Patch series "mm/memfd: Reserve hugetlb folios before allocation", v4. There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but the allocation request may not succeed if there are no active reservations in the system at that instant. Therefore, making a reservation (by calling hugetlb_reserve_pages()) associated with the allocation will ensure that our request would not fail due to lack of reservations. This will also ensure that proper region/subpool accounting is done with our allocation. This patch (of 3): Currently, hugetlb_reserve_pages() returns a bool to indicate whether the reservation map update for the range [from, to] was successful or not. This is not sufficient for the case where the caller needs to determine how many entries were updated for the range. Therefore, have hugetlb_reserve_pages() return the number of entries updated in the reservation map associated with the range [from, to]. Also, update the callers of hugetlb_reserve_pages() to handle the new return value. Link: https://lkml.kernel.org/r/20250618053415.1036185-1-vivek.kasireddy@intel.com Link: https://lkml.kernel.org/r/20250618053415.1036185-2-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy Cc: Steve Sistare Cc: Muchun Song Cc: David Hildenbrand Cc: Gerd Hoffmann Cc: Oscar Salvador Signed-off-by: Andrew Morton

mm/hugetlb: unshare page tables during VMA split, not before

2025-06-06T05:02:24+00:00

Currently, __split_vma() triggers hugetlb page table unsharing through vm_ops->may_split(). This happens before the VMA lock and rmap locks are taken - which is too early, it allows racing VMA-locked page faults in our process and racing rmap walks from other processes to cause page tables to be shared again before we actually perform the split. Fix it by explicitly calling into the hugetlb unshare logic from __split_vma() in the same place where THP splitting also happens. At that point, both the VMA and the rmap(s) are write-locked. An annoying detail is that we can now call into the helper hugetlb_unshare_pmds() from two different locking contexts: 1. from hugetlb_split(), holding: - mmap lock (exclusively) - VMA lock - file rmap lock (exclusively) 2. hugetlb_unshare_all_pmds(), which I think is designed to be able to call us with only the mmap lock held (in shared mode), but currently only runs while holding mmap lock (exclusively) and VMA lock Backporting note: This commit fixes a racy protection that was introduced in commit b30c14cd6102 ("hugetlb: unshare some PMDs when splitting VMAs"); that commit claimed to fix an issue introduced in 5.13, but it should actually also go all the way back. [jannh@google.com: v2] Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58a@google.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn Cc: Liam Howlett Reviewed-by: Lorenzo Stoakes Reviewed-by: Oscar Salvador Cc: Lorenzo Stoakes Cc: Vlastimil Babka Cc: [b30c14cd6102: hugetlb: unshare some PMDs when splitting VMAs] Cc: Signed-off-by: Andrew Morton