<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/mm/userfaultfd.c, branch v6.18.21</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.21</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.21'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-01-30T09:32:28+00:00</updated>
<entry>
<title>mm: fix some typos in mm module</title>
<updated>2026-01-30T09:32:28+00:00</updated>
<author>
<name>jianyun.gao</name>
<email>jianyungao89@gmail.com</email>
</author>
<published>2026-01-26T19:12:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=918ba220debc4705e0b2ee3518c15c268c39b84d'/>
<id>urn:sha1:918ba220debc4705e0b2ee3518c15c268c39b84d</id>
<content type='text'>
[ Upstream commit b6c46600bfb28b4be4e9cff7bad4f2cf357e0fb7 ]

Below are some typos in the code comments:

  intevals ==&gt; intervals
  addesses ==&gt; addresses
  unavaliable ==&gt; unavailable
  facor ==&gt; factor
  droping ==&gt; dropping
  exlusive ==&gt; exclusive
  decription ==&gt; description
  confict ==&gt; conflict
  desriptions ==&gt; descriptions
  otherwize ==&gt; otherwise
  vlaue ==&gt; value
  cheching ==&gt; checking
  exisitng ==&gt; existing
  modifed ==&gt; modified
  differenciate ==&gt; differentiate
  refernece ==&gt; reference
  permissons ==&gt; permissions
  indepdenent ==&gt; independent
  spliting ==&gt; splitting

Just fix it.

Link: https://lkml.kernel.org/r/20250929002608.1633825-1-jianyungao89@gmail.com
Signed-off-by: jianyun.gao &lt;jianyungao89@gmail.com&gt;
Reviewed-by: SeongJae Park &lt;sj@kernel.org&gt;
Reviewed-by: Wei Yang &lt;richard.weiyang@gmail.com&gt;
Reviewed-by: Dev Jain &lt;dev.jain@arm.com&gt;
Reviewed-by: Liam R. Howlett &lt;Liam.Howlett@oracle.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Stable-dep-of: 3937027caecb ("mm/hugetlb: fix two comments related to huge_pmd_unshare()")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm, swap: use unified helper for swap cache look up</title>
<updated>2025-09-21T21:22:22+00:00</updated>
<author>
<name>Kairui Song</name>
<email>kasong@tencent.com</email>
</author>
<published>2025-09-16T16:00:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f28124617f34153b34d7716eaff084e25a2d71fa'/>
<id>urn:sha1:f28124617f34153b34d7716eaff084e25a2d71fa</id>
<content type='text'>
The swap cache lookup helper swap_cache_get_folio currently does readahead
updates as well, so callers that are not doing swapin from any VMA or
mapping are forced to reuse filemap helpers instead, and have to access
the swap cache space directly.

So decouple readahead update with swap cache lookup.  Move the readahead
update part into a standalone helper.  Let the caller call the readahead
update helper if they do readahead.  And convert all swap cache lookups to
use swap_cache_get_folio.

After this commit, there are only three special cases for accessing swap
cache space now: huge memory splitting, migration, and shmem replacing,
because they need to lock the XArray.  The following commits will wrap
their accesses to the swap cache too, with special helpers.

And worth noting, currently dropbehind is not supported for anon folio,
and we will never see a dropbehind folio in swap cache.  The unified
helper can be updated later to handle that.

While at it, add proper kernedoc for touched helpers.

No functional change.

Link: https://lkml.kernel.org/r/20250916160100.31545-3-ryncsn@gmail.com
Signed-off-by: Kairui Song &lt;kasong@tencent.com&gt;
Reviewed-by: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Reviewed-by: Barry Song &lt;baohua@kernel.org&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Acked-by: Nhat Pham &lt;nphamcs@gmail.com&gt;
Suggested-by: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: "Huang, Ying" &lt;ying.huang@linux.alibaba.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Cc: kernel test robot &lt;oliver.sang@intel.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Zi Yan &lt;ziy@nvidia.com&gt;
Cc: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>userfaultfd: opportunistic TLB-flush batching for present pages in MOVE</title>
<updated>2025-09-13T23:55:00+00:00</updated>
<author>
<name>Lokesh Gidra</name>
<email>lokeshgidra@google.com</email>
</author>
<published>2025-08-13T19:30:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=50944692052b1e76fd211f4da0798ab19d7fc276'/>
<id>urn:sha1:50944692052b1e76fd211f4da0798ab19d7fc276</id>
<content type='text'>
MOVE ioctl's runtime is dominated by TLB-flush cost, which is required for
moving present pages.  Mitigate this cost by opportunistically batching
present contiguous pages for TLB flushing.

Without batching, in our testing on an arm64 Android device with UFFD GC,
which uses MOVE ioctl for compaction, we observed that out of the total
time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
~20% in vm_normal_folio().

With batching, the proportion of vm_normal_folio() increases to over 70%
of move_pages_pte() without any changes to vm_normal_folio(). 
Furthermore, time spent within move_pages_pte() is only ~20%, which
includes TLB-flush overhead.

When the GC intensive benchmark, which was used to gather the above
numbers, is run on cuttlefish (qemu android instance on x86_64), the
completion time of the benchmark went down from ~45mins to ~20mins.

Furthermore, system_server, one of the most performance critical system
processes on android, saw over 50% reduction in GC compaction time on an
arm64 android device.

[lokeshgidra@google.com: make calculation of largest extent that can be batched unconditional on length, per Barry]
  Link: https://lkml.kernel.org/r/20250816191123.3601561-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20250813193024.2279805-1-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: Barry Song &lt;baohua@kernel.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE</title>
<updated>2025-08-28T05:45:41+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sashal@kernel.org</email>
</author>
<published>2025-07-31T14:44:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9614d8bee66387501f48718fa306e17f2aa3f2f3'/>
<id>urn:sha1:9614d8bee66387501f48718fa306e17f2aa3f2f3</id>
<content type='text'>
With CONFIG_HIGHPTE on 32-bit ARM, move_pages_pte() maps PTE pages using
kmap_local_page(), which requires unmapping in Last-In-First-Out order.

The current code maps dst_pte first, then src_pte, but unmaps them in the
same order (dst_pte, src_pte), violating the LIFO requirement.  This
causes the warning in kunmap_local_indexed():

  WARNING: CPU: 0 PID: 604 at mm/highmem.c:622 kunmap_local_indexed+0x178/0x17c
  addr \!= __fix_to_virt(FIX_KMAP_BEGIN + idx)

Fix this by reversing the unmap order to respect LIFO ordering.

This issue follows the same pattern as similar fixes:
- commit eca6828403b8 ("crypto: skcipher - fix mismatch between mapping and unmapping order")
- commit 8cf57c6df818 ("nilfs2: eliminate staggered calls to kunmap in nilfs_rename")

Both of which addressed the same fundamental requirement that kmap_local
operations must follow LIFO ordering.

Link: https://lkml.kernel.org/r/20250731144431.773923-1-sashal@kernel.org
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>userfaultfd: fix a crash in UFFDIO_MOVE when PMD is a migration entry</title>
<updated>2025-08-12T06:00:59+00:00</updated>
<author>
<name>Suren Baghdasaryan</name>
<email>surenb@google.com</email>
</author>
<published>2025-08-06T22:00:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=aba6faec0103ed8f169be8dce2ead41fcb689446'/>
<id>urn:sha1:aba6faec0103ed8f169be8dce2ead41fcb689446</id>
<content type='text'>
When UFFDIO_MOVE encounters a migration PMD entry, it proceeds with
obtaining a folio and accessing it even though the entry is swp_entry_t. 
Add the missing check and let split_huge_pmd() handle migration entries. 
While at it also remove unnecessary folio check.

[surenb@google.com: remove extra folio check, per David]
  Link: https://lkml.kernel.org/r/20250807200418.1963585-1-surenb@google.com
Link: https://lkml.kernel.org/r/20250806220022.926763-1-surenb@google.com
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reported-by: syzbot+b446dbe27035ef6bd6c2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68794b5c.a70a0220.693ce.0050.GAE@google.com/
Reviewed-by: Peter Xu &lt;peterx@redhat.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: remove redundant pXd_devmap calls</title>
<updated>2025-07-10T05:42:17+00:00</updated>
<author>
<name>Alistair Popple</name>
<email>apopple@nvidia.com</email>
</author>
<published>2025-06-19T08:57:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8a6a984c2e0ea406459b445a3910a454bece3aa1'/>
<id>urn:sha1:8a6a984c2e0ea406459b445a3910a454bece3aa1</id>
<content type='text'>
DAX was the only thing that created pmd_devmap and pud_devmap entries
however it no longer does as DAX pages are now refcounted normally and
pXd_trans_huge() returns true for those.  Therefore checking both
pXd_devmap and pXd_trans_huge() is redundant and the former can be removed
without changing behaviour as it will always be false.

Link: https://lkml.kernel.org/r/d58f089dc16b7feb7c6728164f37dea65d64a0d3.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple &lt;apopple@nvidia.com&gt;
Cc: Balbir Singh &lt;balbirs@nvidia.com&gt;
Cc: Björn Töpel &lt;bjorn@kernel.org&gt;
Cc: Björn Töpel &lt;bjorn@rivosinc.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Chunyan Zhang &lt;zhang.lyra@gmail.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Deepak Gupta &lt;debug@rivosinc.com&gt;
Cc: Gerald Schaefer &lt;gerald.schaefer@linux.ibm.com&gt;
Cc: Inki Dae &lt;m.szyprowski@samsung.com&gt;
Cc: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Cc: John Groves &lt;john@groves.net&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: convert pXd_devmap checks to vma_is_dax</title>
<updated>2025-07-10T05:42:15+00:00</updated>
<author>
<name>Alistair Popple</name>
<email>apopple@nvidia.com</email>
</author>
<published>2025-06-19T08:57:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0544f3f78da3d7d2baf546e70773a8d66cdb127e'/>
<id>urn:sha1:0544f3f78da3d7d2baf546e70773a8d66cdb127e</id>
<content type='text'>
Patch series "mm: Remove pXX_devmap page table bit and pfn_t type", v3.

All users of dax now require a ZONE_DEVICE page which is properly
refcounted.  This means there is no longer any need for the PFN_DEV,
PFN_MAP and PFN_SPECIAL flags.  Furthermore the PFN_SG_CHAIN and
PFN_SG_LAST flags never appear to have been used.  It is therefore
possible to remove the pfn_t type and replace any usage with raw pfns.

The remaining users of PFN_DEV have simply passed this to
vmf_insert_mixed() to create pte_devmap() mappings.  It is unclear why
this was the case but presumably to ensure vm_normal_page() does not
return these pages.  These users can be trivially converted to raw pfns
and creating a pXX_special() mapping to ensure vm_normal_page() still
doesn't return these pages.

Now that there are no users of PFN_DEV we can remove the devmap page table
bit and all associated functions and macros, freeing up a software page
table bit.


This patch (of 14):

Currently dax is the only user of pmd and pud mapped ZONE_DEVICE pages. 
Therefore page walkers that want to exclude DAX pages can check pmd_devmap
or pud_devmap.  However soon dax will no longer set PFN_DEV, meaning dax
pages are mapped as normal pages.

Ensure page walkers that currently use pXd_devmap to skip DAX pages
continue to do so by adding explicit checks of the VMA instead.

Remove vma_is_dax() check from mm/userfaultfd.c as validate_move_areas()
will already skip DAX VMA's on account of them not being anonymous.

The check in userfaultfd_must_wait() is also redundant as
vma_can_userfault() should have already filtered out dax vma's.

For HMM the pud_devmap check seems unnecessary as there is no reason we
shouldn't be able to handle any leaf PUD here so remove it also.

Link: https://lkml.kernel.org/r/cover.176965585864cb8d2cf41464b44dcc0471e643a0.1750323463.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/f0611f6f475f48fcdf34c65084a359aefef4cccc.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple &lt;apopple@nvidia.com&gt;
Reviewed-by: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
Reviewed-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Balbir Singh &lt;balbirs@nvidia.com&gt;
Cc: Björn Töpel &lt;bjorn@kernel.org&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Chunyan Zhang &lt;zhang.lyra@gmail.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Deepak Gupta &lt;debug@rivosinc.com&gt;
Cc: Gerald Schaefer &lt;gerald.schaefer@linux.ibm.com&gt;
Cc: Inki Dae &lt;m.szyprowski@samsung.com&gt;
Cc: John Groves &lt;john@groves.net&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Björn Töpel &lt;bjorn@rivosinc.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: update core kernel code to use vm_flags_t consistently</title>
<updated>2025-07-10T05:42:13+00:00</updated>
<author>
<name>Lorenzo Stoakes</name>
<email>lorenzo.stoakes@oracle.com</email>
</author>
<published>2025-06-18T19:42:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bfbe71109fa40e8cc05a0f99e6734b7d76ee00b0'/>
<id>urn:sha1:bfbe71109fa40e8cc05a0f99e6734b7d76ee00b0</id>
<content type='text'>
The core kernel code is currently very inconsistent in its use of
vm_flags_t vs.  unsigned long.  This prevents us from changing the type of
vm_flags_t in the future and is simply not correct, so correct this.

While this results in rather a lot of churn, it is a critical
pre-requisite for a future planned change to VMA flag type.

Additionally, update VMA userland tests to account for the changes.

To make review easier and to break things into smaller parts, driver and
architecture-specific changes is left for a subsequent commit.

The code has been adjusted to cascade the changes across all calling code
as far as is needed.

We will adjust architecture-specific and driver code in a subsequent patch.

Overall, this patch does not introduce any functional change.

Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Acked-by: Kees Cook &lt;kees@kernel.org&gt;
Acked-by: Mike Rapoport (Microsoft) &lt;rppt@kernel.org&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Acked-by: Christian Brauner &lt;brauner@kernel.org&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Acked-by: Oscar Salvador &lt;osalvador@suse.de&gt;
Reviewed-by: Pedro Falcato &lt;pfalcato@suse.de&gt;
Acked-by: Zi Yan &lt;ziy@nvidia.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Reviewed-by: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Liam R. Howlett &lt;Liam.Howlett@oracle.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Jarkko Sakkinen &lt;jarkko@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>userfaultfd: remove (VM_)BUG_ON()s</title>
<updated>2025-07-10T05:42:01+00:00</updated>
<author>
<name>Tal Zussman</name>
<email>tz2294@columbia.edu</email>
</author>
<published>2025-06-20T01:24:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=31defc3b01d907e3e899de9e7002a6a63998f07a'/>
<id>urn:sha1:31defc3b01d907e3e899de9e7002a6a63998f07a</id>
<content type='text'>
BUG_ON() is deprecated [1].  Convert all the BUG_ON()s and VM_BUG_ON()s to
use VM_WARN_ON_ONCE().

There are a few additional cases that are converted or modified:

- Convert the printk(KERN_WARNING ...) in handle_userfault() to use
  pr_warn().

- Convert the WARN_ON_ONCE()s in move_pages() to use VM_WARN_ON_ONCE(),
  as the relevant conditions are already checked in validate_range() in
  move_pages()'s caller.

- Convert the VM_WARN_ON()'s in move_pages() to VM_WARN_ON_ONCE(). These
  cases should never happen and are similar to those in mfill_atomic()
  and mfill_atomic_hugetlb(), which were previously BUG_ON()s.
  move_pages() was added later than those functions and makes use of
  VM_WARN_ON() as a replacement for the deprecated BUG_ON(), but.
  VM_WARN_ON_ONCE() is likely a better direct replacement.

- Convert the WARN_ON() for !VM_MAYWRITE in userfaultfd_unregister() and
  userfaultfd_register_range() to VM_WARN_ON_ONCE(). This condition is
  enforced in userfaultfd_register() so it should never happen, and can
  be converted to a debug check.

[1] https://www.kernel.org/doc/html/v6.15/process/coding-style.html#use-warn-rather-than-bug

Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-3-a7274d3bd5e4@columbia.edu
Signed-off-by: Tal Zussman &lt;tz2294@columbia.edu&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: userfaultfd: fix race of userfaultfd_move and swap cache</title>
<updated>2025-06-20T03:48:01+00:00</updated>
<author>
<name>Kairui Song</name>
<email>kasong@tencent.com</email>
</author>
<published>2025-06-04T15:10:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0ea148a799198518d8ebab63ddd0bb6114a103bc'/>
<id>urn:sha1:0ea148a799198518d8ebab63ddd0bb6114a103bc</id>
<content type='text'>
This commit fixes two kinds of races, they may have different results:

Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same
BUG_ON if the filemap lookup returned NULL and folio is added to swap
cache after that.

If another kind of race is triggered (folio changed after lookup) we
may see RSS counter is corrupted:

[  406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
type:MM_ANONPAGES val:-1
[  406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
type:MM_SHMEMPAGES val:1

Because the folio is being accounted to the wrong VMA.

I'm not sure if there will be any data corruption though, seems no. 
The issues above are critical already.


On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache
lookup, and tries to move the found folio to the faulting vma.  Currently,
it relies on checking the PTE value to ensure that the moved folio still
belongs to the src swap entry and that no new folio has been added to the
swap cache, which turns out to be unreliable.

While working and reviewing the swap table series with Barry, following
existing races are observed and reproduced [1]:

In the example below, move_pages_pte is moving src_pte to dst_pte, where
src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the
swap cache:

CPU1                               CPU2
userfaultfd_move
  move_pages_pte()
    entry = pte_to_swp_entry(orig_src_pte);
    // Here it got entry = S1
    ... &lt; interrupted&gt; ...
                                   &lt;swapin src_pte, alloc and use folio A&gt;
                                   // folio A is a new allocated folio
                                   // and get installed into src_pte
                                   &lt;frees swap entry S1&gt;
                                   // src_pte now points to folio A, S1
                                   // has swap count == 0, it can be freed
                                   // by folio_swap_swap or swap
                                   // allocator's reclaim.
                                   &lt;try to swap out another folio B&gt;
                                   // folio B is a folio in another VMA.
                                   &lt;put folio B to swap cache using S1 &gt;
                                   // S1 is freed, folio B can use it
                                   // for swap out with no problem.
                                   ...
    folio = filemap_get_folio(S1)
    // Got folio B here !!!
    ... &lt; interrupted again&gt; ...
                                   &lt;swapin folio B and free S1&gt;
                                   // Now S1 is free to be used again.
                                   &lt;swapout src_pte &amp; folio A using S1&gt;
                                   // Now src_pte is a swap entry PTE
                                   // holding S1 again.
    folio_trylock(folio)
    move_swap_pte
      double_pt_lock
      is_pte_pages_stable
      // Check passed because src_pte == S1
      folio_move_anon_rmap(...)
      // Moved invalid folio B here !!!

The race window is very short and requires multiple collisions of multiple
rare events, so it's very unlikely to happen, but with a deliberately
constructed reproducer and increased time window, it can be reproduced
easily.

This can be fixed by checking if the folio returned by filemap is the
valid swap cache folio after acquiring the folio lock.

Another similar race is possible: filemap_get_folio may return NULL, but
folio (A) could be swapped in and then swapped out again using the same
swap entry after the lookup.  In such a case, folio (A) may remain in the
swap cache, so it must be moved too:

CPU1                               CPU2
userfaultfd_move
  move_pages_pte()
    entry = pte_to_swp_entry(orig_src_pte);
    // Here it got entry = S1, and S1 is not in swap cache
    folio = filemap_get_folio(S1)
    // Got NULL
    ... &lt; interrupted again&gt; ...
                                   &lt;swapin folio A and free S1&gt;
                                   &lt;swapout folio A re-using S1&gt;
    move_swap_pte
      double_pt_lock
      is_pte_pages_stable
      // Check passed because src_pte == S1
      folio_move_anon_rmap(...)
      // folio A is ignored !!!

Fix this by checking the swap cache again after acquiring the src_pte
lock.  And to avoid the filemap overhead, we check swap_map directly [2].

The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so far
we don't need to worry about that, since folios can only be exposed to the
swap cache in the swap out path, and this is covered in this patch by
checking the swap cache again after acquiring the src_pte lock.

Testing with a simple C program that allocates and moves several GB of
memory did not show any observable performance change.

Link: https://lkml.kernel.org/r/20250604151038.21968-1-ryncsn@gmail.com
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Kairui Song &lt;kasong@tencent.com&gt;
Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2]
Reviewed-by: Lokesh Gidra &lt;lokeshgidra@google.com&gt;
Acked-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Barry Song &lt;baohua@kernel.org&gt;
Reviewed-by: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
