<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/mm, branch v5.10.257</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v5.10.257</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v5.10.257'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-04-18T08:31:14+00:00</updated>
<entry>
<title>mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather</title>
<updated>2026-04-18T08:31:14+00:00</updated>
<author>
<name>David Hildenbrand (Red Hat)</name>
<email>david@kernel.org</email>
</author>
<published>2026-02-18T13:05:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6495204f8219a57859669d618b25d06176d5a872'/>
<id>urn:sha1:6495204f8219a57859669d618b25d06176d5a872</id>
<content type='text'>
commit 8ce720d5bd91e9dc16db3604aa4b1bf76770a9a1 upstream.

As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix
huge_pmd_unshare() vs GUP-fast race") we can end up in some situations
where we perform so many IPI broadcasts when unsharing hugetlb PMD page
tables that it severely regresses some workloads.

In particular, when we fork()+exit(), or when we munmap() a large
area backed by many shared PMD tables, we perform one IPI broadcast per
unshared PMD table.

There are two optimizations to be had:

(1) When we process (unshare) multiple such PMD tables, such as during
    exit(), it is sufficient to send a single IPI broadcast (as long as
    we respect locking rules) instead of one per PMD table.

    Locking prevents that any of these PMD tables could get reused before
    we drop the lock.

(2) When we are not the last sharer (&gt; 2 users including us), there is
    no need to send the IPI broadcast. The shared PMD tables cannot
    become exclusive (fully unshared) before an IPI will be broadcasted
    by the last sharer.

    Concurrent GUP-fast could walk into a PMD table just before we
    unshared it. It could then succeed in grabbing a page from the
    shared page table even after munmap() etc succeeded (and supressed
    an IPI). But there is not difference compared to GUP-fast just
    sleeping for a while after grabbing the page and re-enabling IRQs.

    Most importantly, GUP-fast will never walk into page tables that are
    no-longer shared, because the last sharer will issue an IPI
    broadcast.

    (if ever required, checking whether the PUD changed in GUP-fast
     after grabbing the page like we do in the PTE case could handle
     this)

So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather
infrastructure so we can implement these optimizations and demystify the
code at least a bit. Extend the mmu_gather infrastructure to be able to
deal with our special hugetlb PMD table sharing implementation.

To make initialization of the mmu_gather easier when working on a single
VMA (in particular, when dealing with hugetlb), provide
tlb_gather_mmu_vma().

We'll consolidate the handling for (full) unsharing of PMD tables in
tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track
in "struct mmu_gather" whether we had (full) unsharing of PMD tables.

Because locking is very special (concurrent unsharing+reuse must be
prevented), we disallow deferring flushing to tlb_finish_mmu() and instead
require an explicit earlier call to tlb_flush_unshared_tables().

&gt;From hugetlb code, we call huge_pmd_unshare_flush() where we make sure
that the expected lock protecting us from concurrent unsharing+reuse is
still held.

Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that
tlb_flush_unshared_tables() was properly called earlier.

Document it all properly.

Notes about tlb_remove_table_sync_one() interaction with unsharing:

There are two fairly tricky things:

(1) tlb_remove_table_sync_one() is a NOP on architectures without
    CONFIG_MMU_GATHER_RCU_TABLE_FREE.

    Here, the assumption is that the previous TLB flush would send an
    IPI to all relevant CPUs. Careful: some architectures like x86 only
    send IPIs to all relevant CPUs when tlb-&gt;freed_tables is set.

    The relevant architectures should be selecting
    MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable
    kernels and it might have been problematic before this patch.

    Also, the arch flushing behavior (independent of IPIs) is different
    when tlb-&gt;freed_tables is set. Do we have to enlighten them to also
    take care of tlb-&gt;unshared_tables? So far we didn't care, so
    hopefully we are fine. Of course, we could be setting
    tlb-&gt;freed_tables as well, but that might then unnecessarily flush
    too much, because the semantics of tlb-&gt;freed_tables are a bit
    fuzzy.

    This patch changes nothing in this regard.

(2) tlb_remove_table_sync_one() is not a NOP on architectures with
    CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.

    Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB)
    we still issue IPIs during TLB flushes and don't actually need the
    second tlb_remove_table_sync_one().

    This optimized can be implemented on top of this, by checking e.g., in
    tlb_remove_table_sync_one() whether we really need IPIs. But as
    described in (1), it really must honor tlb-&gt;freed_tables then to
    send IPIs to all relevant CPUs.

Notes on TLB flushing changes:

(1) Flushing for non-shared PMD tables

    We're converting from flush_hugetlb_tlb_range() to
    tlb_remove_huge_tlb_entry(). Given that we properly initialize the
    MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to
    __unmap_hugepage_range(), that should be fine.

(2) Flushing for shared PMD tables

    We're converting from various things (flush_hugetlb_tlb_range(),
    tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range().

    tlb_flush_pmd_range() achieves the same that
    tlb_remove_huge_tlb_entry() would achieve in these scenarios.
    Note that tlb_remove_huge_tlb_entry() also calls
    __tlb_remove_tlb_entry(), however that is only implemented on
    powerpc, which does not support PMD table sharing.

    Similar to (1), tlb_gather_mmu_vma() should make sure that TLB
    flushing keeps on working as expected.

Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a
concern, as we are holding the i_mmap_lock the whole time, preventing
concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed
separately as a cleanup later.

There are plenty more cleanups to be had, but they have to wait until
this is fixed.

[david@kernel.org: fix kerneldoc]
  Link: https://lkml.kernel.org/r/f223dd74-331c-412d-93fc-69e360a5006c@kernel.org
Link: https://lkml.kernel.org/r/20251223214037.580860-5-david@kernel.org
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race")
Signed-off-by: David Hildenbrand (Red Hat) &lt;david@kernel.org&gt;
Reported-by: "Uschakow, Stanislav" &lt;suschako@amazon.de&gt;
Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/
Tested-by: Laurence Oberman &lt;loberman@redhat.com&gt;
Acked-by: Harry Yoo &lt;harry.yoo@oracle.com&gt;
Reviewed-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Lance Yang &lt;lance.yang@linux.dev&gt;
Cc: Liu Shixin &lt;liushixin2@huawei.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
(cherry picked from commit 8ce720d5bd91e9dc16db3604aa4b1bf76770a9a1)
[ David: We don't have ptdesc and the wrappers, so work directly on
  page-&gt;pt_share_count and pass "struct page" instead of "struct ptdesc".
  CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING is still called
  CONFIG_ARCH_WANT_HUGE_PMD_SHARE and is set even without
  CONFIG_HUGETLB_PAGE. We don't have 550a7d60bd5e ("mm, hugepages: add
  mremap() support for hugepage backed vma"), so move_hugetlb_page_tables()
  does not exist. We don't have 40549ba8f8e0 ("hugetlb: use new vma_lock
  for pmd sharing synchronization") and a98a2f0c8ce1 ("mm/rmap: split
  migration into its own so changes in mm/rmap.c looks quite different. We
  don't have 4ddb4d91b82f ("hugetlb: do not update address
  in huge_pmd_unshare"), so huge_pmd_unshare() still gets a pointer to
  an address. tlb_gather_mmu() + tlb_finish_mmu() still consume ranges, so
  also teach tlb_gather_mmu_vma() to forward ranges.  Some smaller
  contextual stuff, in particular, around tlb_gather_mmu_full(). ]
Signed-off-by: David Hildenbrand (Arm) &lt;david@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm/rmap: fix two comments related to huge_pmd_unshare()</title>
<updated>2026-04-18T08:31:14+00:00</updated>
<author>
<name>David Hildenbrand (Red Hat)</name>
<email>david@kernel.org</email>
</author>
<published>2026-02-18T13:05:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2723af9e1e283cee312946c3d6cb20d2c731c139'/>
<id>urn:sha1:2723af9e1e283cee312946c3d6cb20d2c731c139</id>
<content type='text'>
commit a8682d500f691b6dfaa16ae1502d990aeb86e8be upstream.

PMD page table unsharing no longer touches the refcount of a PMD page
table.  Also, it is not about dropping the refcount of a "PMD page" but
the "PMD page table".

Let's just simplify by saying that the PMD page table was unmapped,
consequently also unmapping the folio that was mapped into this page.

This code should be deduplicated in the future.

Link: https://lkml.kernel.org/r/20251223214037.580860-4-david@kernel.org
Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: David Hildenbrand (Red Hat) &lt;david@kernel.org&gt;
Reviewed-by: Rik van Riel &lt;riel@surriel.com&gt;
Tested-by: Laurence Oberman &lt;loberman@redhat.com&gt;
Reviewed-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Acked-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Cc: Liu Shixin &lt;liushixin2@huawei.com&gt;
Cc: Harry Yoo &lt;harry.yoo@oracle.com&gt;
Cc: Lance Yang &lt;lance.yang@linux.dev&gt;
Cc: "Uschakow, Stanislav" &lt;suschako@amazon.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
(cherry picked from commit a8682d500f691b6dfaa16ae1502d990aeb86e8be)
[ David: We don't have 40549ba8f8e0 ("hugetlb: use new vma_lock
  for pmd sharing synchronization") and a98a2f0c8ce1 ("mm/rmap: split
  migration into its own so changes in mm/rmap.c looks quite different. ]
Signed-off-by: David Hildenbrand (Arm) &lt;david@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm/hugetlb: fix two comments related to huge_pmd_unshare()</title>
<updated>2026-04-18T08:31:14+00:00</updated>
<author>
<name>David Hildenbrand (Red Hat)</name>
<email>david@kernel.org</email>
</author>
<published>2026-02-18T13:05:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=44b18bac2b3063784ebb871a922803b11d9c339d'/>
<id>urn:sha1:44b18bac2b3063784ebb871a922803b11d9c339d</id>
<content type='text'>
commit 3937027caecb4f8251e82dd857ba1d749bb5a428 upstream.

Ever since we stopped using the page count to detect shared PMD page
tables, these comments are outdated.

The only reason we have to flush the TLB early is because once we drop the
i_mmap_rwsem, the previously shared page table could get freed (to then
get reallocated and used for other purpose).  So we really have to flush
the TLB before that could happen.

So let's simplify the comments a bit.

The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
correctly after huge_pmd_unshare") was confusing: sure it is recorded in
the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do anything.
So let's drop that comment while at it as well.

We'll centralize these comments in a single helper as we rework the code
next.

Link: https://lkml.kernel.org/r/20251223214037.580860-3-david@kernel.org
Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: David Hildenbrand (Red Hat) &lt;david@kernel.org&gt;
Reviewed-by: Rik van Riel &lt;riel@surriel.com&gt;
Tested-by: Laurence Oberman &lt;loberman@redhat.com&gt;
Reviewed-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Acked-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Reviewed-by: Harry Yoo &lt;harry.yoo@oracle.com&gt;
Cc: Liu Shixin &lt;liushixin2@huawei.com&gt;
Cc: Lance Yang &lt;lance.yang@linux.dev&gt;
Cc: "Uschakow, Stanislav" &lt;suschako@amazon.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
(cherry picked from commit 3937027caecb4f8251e82dd857ba1d749bb5a428)
Signed-off-by: David Hildenbrand (Arm) &lt;david@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm/hugetlb: fix copy_hugetlb_page_range() to use -&gt;pt_share_count</title>
<updated>2026-04-18T08:31:14+00:00</updated>
<author>
<name>Jane Chu</name>
<email>jane.chu@oracle.com</email>
</author>
<published>2026-02-18T13:05:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ee67c4ac1aed403688dd359918099a745b60fb69'/>
<id>urn:sha1:ee67c4ac1aed403688dd359918099a745b60fb69</id>
<content type='text'>
commit 14967a9c7d247841b0312c48dcf8cd29e55a4cc8 upstream.

commit 59d9094df3d79 ("mm: hugetlb: independent PMD page table shared
count") introduced -&gt;pt_share_count dedicated to hugetlb PMD share count
tracking, but omitted fixing copy_hugetlb_page_range(), leaving the
function relying on page_count() for tracking that no longer works.

When lazy page table copy for hugetlb is disabled, that is, revert commit
bcd51a3c679d ("hugetlb: lazy page table copies in fork()") fork()'ing with
hugetlb PMD sharing quickly lockup -

[  239.446559] watchdog: BUG: soft lockup - CPU#75 stuck for 27s!
[  239.446611] RIP: 0010:native_queued_spin_lock_slowpath+0x7e/0x2e0
[  239.446631] Call Trace:
[  239.446633]  &lt;TASK&gt;
[  239.446636]  _raw_spin_lock+0x3f/0x60
[  239.446639]  copy_hugetlb_page_range+0x258/0xb50
[  239.446645]  copy_page_range+0x22b/0x2c0
[  239.446651]  dup_mmap+0x3e2/0x770
[  239.446654]  dup_mm.constprop.0+0x5e/0x230
[  239.446657]  copy_process+0xd17/0x1760
[  239.446660]  kernel_clone+0xc0/0x3e0
[  239.446661]  __do_sys_clone+0x65/0xa0
[  239.446664]  do_syscall_64+0x82/0x930
[  239.446668]  ? count_memcg_events+0xd2/0x190
[  239.446671]  ? syscall_trace_enter+0x14e/0x1f0
[  239.446676]  ? syscall_exit_work+0x118/0x150
[  239.446677]  ? arch_exit_to_user_mode_prepare.constprop.0+0x9/0xb0
[  239.446681]  ? clear_bhb_loop+0x30/0x80
[  239.446684]  ? clear_bhb_loop+0x30/0x80
[  239.446686]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

There are two options to resolve the potential latent issue:
  1. warn against PMD sharing in copy_hugetlb_page_range(),
  2. fix it.
This patch opts for the second option.
While at it, simplify the comment, the details are not actually relevant
anymore.

Link: https://lkml.kernel.org/r/20250916004520.1604530-1-jane.chu@oracle.com
Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
Signed-off-by: Jane Chu &lt;jane.chu@oracle.com&gt;
Reviewed-by: Harry Yoo &lt;harry.yoo@oracle.com&gt;
Acked-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Liu Shixin &lt;liushixin2@huawei.com&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
(cherry picked from commit 14967a9c7d247841b0312c48dcf8cd29e55a4cc8)
[ David: We don't have ptdesc and the wrappers, so work directly on the
  page-&gt;pt_share_count. CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING is still
  called CONFIG_ARCH_WANT_HUGE_PMD_SHARE. ]
Signed-off-by: David Hildenbrand (Arm) &lt;david@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm/hugetlb: make detecting shared pte more reliable</title>
<updated>2026-04-18T08:31:14+00:00</updated>
<author>
<name>Miaohe Lin</name>
<email>linmiaohe@huawei.com</email>
</author>
<published>2026-02-18T13:05:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d08af325b2860112593fabb55a486c4796fa85d0'/>
<id>urn:sha1:d08af325b2860112593fabb55a486c4796fa85d0</id>
<content type='text'>
commit 3aa4ed8040e1535d95c03cef8b52cf11bf0d8546 upstream.

If the pagetables are shared, we shouldn't copy or take references.  Since
src could have unshared and dst shares with another vma, huge_pte_none()
is thus used to determine whether dst_pte is shared.  But this check isn't
reliable.  A shared pte could have pte none in pagetable in fact.  The
page count of ptep page should be checked here in order to reliably
determine whether pte is shared.

[lukas.bulwahn@gmail.com: remove unused local variable dst_entry in copy_hugetlb_page_range()]
  Link: https://lkml.kernel.org/r/20220822082525.26071-1-lukas.bulwahn@gmail.com
Link: https://lkml.kernel.org/r/20220816130553.31406-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Signed-off-by: Lukas Bulwahn &lt;lukas.bulwahn@gmail.com&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
(cherry picked from commit 3aa4ed8040e1535d95c03cef8b52cf11bf0d8546)
[ David: We don't have 4eae4efa2c29 ("hugetlb: do early cow when page
  pinned on src mm", so there are some contextual conflicts. ]
Signed-off-by: David Hildenbrand (Arm) &lt;david@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm/hugetlb: fix skipping of unsharing of pmd page tables</title>
<updated>2026-04-18T08:31:13+00:00</updated>
<author>
<name>David Hildenbrand (Arm)</name>
<email>david@kernel.org</email>
</author>
<published>2026-02-18T13:05:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=47d2253032a13d844de1e8c1ff3a0f3db6dfd8ee'/>
<id>urn:sha1:47d2253032a13d844de1e8c1ff3a0f3db6dfd8ee</id>
<content type='text'>
In the 5.10 backport of commit b30c14cd6102 ("hugetlb: unshare some PMDs
when splitting VMAs") we seemed to have missed that huge_pmd_unshare()
still adjusts the address itself.

For this reason, commit 6dfeaff93be1 ("hugetlb/userfaultfd: unshare all
pmds for hugetlbfs when register wp") explicitly handled this case by
passing a temporary variable instead.

Fix it in 5.10 by doing the same thing.

Fixes: f1082f5f3d02 ("hugetlb: unshare some PMDs when splitting VMAs")
Signed-off-by: David Hildenbrand (Arm) &lt;david@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations</title>
<updated>2026-03-04T12:20:17+00:00</updated>
<author>
<name>Vlastimil Babka</name>
<email>vbabka@suse.cz</email>
</author>
<published>2025-12-19T16:31:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=91e76e324c009609569d50052d45b5182727b13f'/>
<id>urn:sha1:91e76e324c009609569d50052d45b5182727b13f</id>
<content type='text'>
[ Upstream commit 9c9828d3ead69416d731b1238802af31760c823e ]

Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of
local-only and all-node allocations"), THP page fault allocations have
settled on the following scheme (from the commit log):

1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
   allocation from any node with effort determined by global defrag setting
   and VMA madvise
3. fallback to base pages on any node

Recent customer reports however revealed we have a gap in step 1 above.
What we have seen is excessive reclaim due to THP page faults on a NUMA
node that's close to its high watermark, while other nodes have plenty of
free memory.

The problem with step 1 is that it promises no reclaim after the
compaction attempt, however reclaim is only avoided for certain compaction
outcomes (deferred, or skipped due to insufficient free base pages), and
not e.g.  when compaction is actually performed but fails (we did see
compact_fail vmstat counter increasing).

THP page faults can therefore exhibit a zone_reclaim_mode-like behavior,
which is not the intention.

Thus add a check for __GFP_THISNODE that corresponds to this exact
situation and prevents continuing with reclaim/compaction once the initial
compaction attempt isn't successful in allocating the page.

Note that commit cc638f329ef6 has not introduced this over-reclaim
possibility; it appears to exist in some form since commit 2f0799a0ffc0
("mm, thp: restore node-local hugepage allocations").  Followup commits
b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may
not succeed") and cc638f329ef6 have moved in the right direction, but left
the abovementioned gap.

Link: https://lkml.kernel.org/r/20251219-costly-noretry-thisnode-fix-v1-1-e1085a4a0c34@suse.cz
Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Pedro Falcato &lt;pfalcato@suse.de&gt;
Acked-by: Zi Yan &lt;ziy@nvidia.com&gt;
Cc: Brendan Jackman &lt;jackmanb@google.com&gt;
Cc: "David Hildenbrand (Red Hat)" &lt;david@kernel.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Joshua Hahn &lt;joshua.hahnjy@gmail.com&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Mike Rapoport &lt;rppt@kernel.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>ksm: use range-walk function to jump over holes in scan_get_next_rmap_item</title>
<updated>2026-02-06T15:40:13+00:00</updated>
<author>
<name>Pedro Demarchi Gomes</name>
<email>pedrodemargomes@gmail.com</email>
</author>
<published>2026-01-17T23:38:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=220cb3e425e17587f560335924cba9f16a842c64'/>
<id>urn:sha1:220cb3e425e17587f560335924cba9f16a842c64</id>
<content type='text'>
[ Upstream commit f5548c318d6520d4fa3c5ed6003eeb710763cbc5 ]

Currently, scan_get_next_rmap_item() walks every page address in a VMA to
locate mergeable pages.  This becomes highly inefficient when scanning
large virtual memory areas that contain mostly unmapped regions, causing
ksmd to use large amount of cpu without deduplicating much pages.

This patch replaces the per-address lookup with a range walk using
walk_page_range().  The range walker allows KSM to skip over entire
unmapped holes in a VMA, avoiding unnecessary lookups.  This problem was
previously discussed in [1].

Consider the following test program which creates a 32 TiB mapping in the
virtual address space but only populates a single page:

/* 32 TiB */
const size_t size = 32ul * 1024 * 1024 * 1024 * 1024;

int main() {
        char *area = mmap(NULL, size, PROT_READ | PROT_WRITE,
                          MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0);

        if (area == MAP_FAILED) {
                perror("mmap() failed\n");
                return -1;
        }

        /* Populate a single page such that we get an anon_vma. */
        *area = 0;

        /* Enable KSM. */
        madvise(area, size, MADV_MERGEABLE);
        pause();
        return 0;
}

$ ./ksm-sparse  &amp;
$ echo 1 &gt; /sys/kernel/mm/ksm/run

Without this patch ksmd uses 100% of the cpu for a long time (more then 1
hour in my test machine) scanning all the 32 TiB virtual address space
that contain only one mapped page.  This makes ksmd essentially deadlocked
not able to deduplicate anything of value.  With this patch ksmd walks
only the one mapped page and skips the rest of the 32 TiB virtual address
space, making the scan fast using little cpu.

Link: https://lkml.kernel.org/r/20251023035841.41406-1-pedrodemargomes@gmail.com
Link: https://lkml.kernel.org/r/20251022153059.22763-1-pedrodemargomes@gmail.com
Link: https://lore.kernel.org/linux-mm/423de7a3-1c62-4e72-8e79-19a6413e420c@redhat.com/ [1]
Fixes: 31dbd01f3143 ("ksm: Kernel SamePage Merging")
Signed-off-by: Pedro Demarchi Gomes &lt;pedrodemargomes@gmail.com&gt;
Co-developed-by: David Hildenbrand &lt;david@redhat.com&gt;
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
Reported-by: craftfever &lt;craftfever@airmail.cc&gt;
Closes: https://lkml.kernel.org/r/020cf8de6e773bb78ba7614ef250129f11a63781@murena.io
Suggested-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Chengming Zhou &lt;chengming.zhou@linux.dev&gt;
Cc: xu xin &lt;xu.xin16@zte.com.cn&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
[ change folio to page, replace pmdp_get_lockless with pmd_read_atomic and pmdp_get with
 READ_ONCE(*pmdp) ]
Signed-off-by: Pedro Demarchi Gomes &lt;pedrodemargomes@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm/pagewalk: add walk_page_range_vma()</title>
<updated>2026-02-06T15:40:13+00:00</updated>
<author>
<name>David Hildenbrand</name>
<email>david@redhat.com</email>
</author>
<published>2026-01-17T23:38:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c9316144e09475d1bc3b1573d9c2683a7395b532'/>
<id>urn:sha1:c9316144e09475d1bc3b1573d9c2683a7395b532</id>
<content type='text'>
[ Upstream commit e07cda5f232fac4de0925d8a4c92e51e41fa2f6e ]

Let's add walk_page_range_vma(), which is similar to walk_page_vma(),
however, is only interested in a subset of the VMA range.

To be used in KSM code to stop using follow_page() next.

Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.com
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Shuah Khan &lt;shuah@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Stable-dep-of: f5548c318d6 ("ksm: use range-walk function to jump over holes in scan_get_next_rmap_item")
Signed-off-by: Pedro Demarchi Gomes &lt;pedrodemargomes@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>migrate: correct lock ordering for hugetlb file folios</title>
<updated>2026-02-06T15:40:09+00:00</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2026-01-09T04:13:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e7396d23f9d5739f56cf9ab430c3a169f5508394'/>
<id>urn:sha1:e7396d23f9d5739f56cf9ab430c3a169f5508394</id>
<content type='text'>
commit b7880cb166ab62c2409046b2347261abf701530e upstream.

Syzbot has found a deadlock (analyzed by Lance Yang):

1) Task (5749): Holds folio_lock, then tries to acquire i_mmap_rwsem(read lock).
2) Task (5754): Holds i_mmap_rwsem(write lock), then tries to acquire
folio_lock.

migrate_pages()
  -&gt; migrate_hugetlbs()
    -&gt; unmap_and_move_huge_page()     &lt;- Takes folio_lock!
      -&gt; remove_migration_ptes()
        -&gt; __rmap_walk_file()
          -&gt; i_mmap_lock_read()       &lt;- Waits for i_mmap_rwsem(read lock)!

hugetlbfs_fallocate()
  -&gt; hugetlbfs_punch_hole()           &lt;- Takes i_mmap_rwsem(write lock)!
    -&gt; hugetlbfs_zero_partial_page()
     -&gt; filemap_lock_hugetlb_folio()
      -&gt; filemap_lock_folio()
        -&gt; __filemap_get_folio        &lt;- Waits for folio_lock!

The migration path is the one taking locks in the wrong order according to
the documentation at the top of mm/rmap.c.  So expand the scope of the
existing i_mmap_lock to cover the calls to remove_migration_ptes() too.

This is (mostly) how it used to be after commit c0d0381ade79.  That was
removed by 336bf30eb765 for both file &amp; anon hugetlb pages when it should
only have been removed for anon hugetlb pages.

Link: https://lkml.kernel.org/r/20260109041345.3863089-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Fixes: 336bf30eb765 ("hugetlbfs: fix anon huge page migration race")
Reported-by: syzbot+2d9c96466c978346b55f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com
Debugged-by: Lance Yang &lt;lance.yang@linux.dev&gt;
Acked-by: David Hildenbrand (Red Hat) &lt;david@kernel.org&gt;
Acked-by: Zi Yan &lt;ziy@nvidia.com&gt;
Cc: Alistair Popple &lt;apopple@nvidia.com&gt;
Cc: Byungchul Park &lt;byungchul@sk.com&gt;
Cc: Gregory Price &lt;gourry@gourry.net&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Joshua Hahn &lt;joshua.hahnjy@gmail.com&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Matthew Brost &lt;matthew.brost@intel.com&gt;
Cc: Rakie Kim &lt;rakie.kim@sk.com&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Ying Huang &lt;ying.huang@linux.alibaba.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
</feed>
