<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/mm/mprotect.c, branch v6.1.174</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.174</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.174'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-02-06T15:44:22+00:00</updated>
<entry>
<title>Revert "mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()"</title>
<updated>2026-02-06T15:44:22+00:00</updated>
<author>
<name>Harry Yoo</name>
<email>harry.yoo@oracle.com</email>
</author>
<published>2026-01-16T03:38:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d729fa456af2ae7734b1930ff280b108204cc14c'/>
<id>urn:sha1:d729fa456af2ae7734b1930ff280b108204cc14c</id>
<content type='text'>
This reverts commit 91750c8a4be42d73b6810a1c35d73c8a3cd0b481 which is
commit 670ddd8cdcbd1d07a4571266ae3517f821728c3a upstream.

While the commit fixes a race condition between NUMA balancing and THP
migration, it causes a NULL-pointer-deref when the pmd temporarily
transitions from pmd_trans_huge() to pmd_none(). Verifying whether the
pmd value has changed under page table lock does not prevent the crash,
as it occurs when acquiring the lock.

Since the original issue addressed by the commit is quite rare and
non-fatal, revert the commit. A better backport solution that more
closely matches the upstream semantics will be provided as a follow-up.

Signed-off-by: Harry Yoo &lt;harry.yoo@oracle.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()</title>
<updated>2026-01-11T14:19:25+00:00</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2026-01-06T11:47:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=91750c8a4be42d73b6810a1c35d73c8a3cd0b481'/>
<id>urn:sha1:91750c8a4be42d73b6810a1c35d73c8a3cd0b481</id>
<content type='text'>
commit 670ddd8cdcbd1d07a4571266ae3517f821728c3a upstream.

change_pmd_range() had special pmd_none_or_clear_bad_unless_trans_huge(),
required to avoid "bad" choices when setting automatic NUMA hinting under
mmap_read_lock(); but most of that is already covered in pte_offset_map()
now.  change_pmd_range() just wants a pmd_none() check before wasting time
on MMU notifiers, then checks on the read-once _pmd value to work out
what's needed for huge cases.  If change_pte_range() returns -EAGAIN to
retry if pte_offset_map_lock() fails, nothing more special is needed.

Link: https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Alistair Popple &lt;apopple@nvidia.com&gt;
Cc: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: Christophe Leroy &lt;christophe.leroy@csgroup.eu&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Ira Weiny &lt;ira.weiny@intel.com&gt;
Cc: Jason Gunthorpe &lt;jgg@ziepe.ca&gt;
Cc: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Mike Rapoport (IBM) &lt;rppt@kernel.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@nec.com&gt;
Cc: Pavel Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Qi Zheng &lt;zhengqi.arch@bytedance.com&gt;
Cc: Ralph Campbell &lt;rcampbell@nvidia.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: SeongJae Park &lt;sj@kernel.org&gt;
Cc: Song Liu &lt;song@kernel.org&gt;
Cc: Steven Price &lt;steven.price@arm.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Thomas Hellström &lt;thomas.hellstrom@linux.intel.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: Zack Rusin &lt;zackr@vmware.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
[ Background: It was reported that a bad pmd is seen when automatic NUMA
  balancing is marking page table entries as prot_numa:

      [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
      [2437548.235022] Call Trace:
      [2437548.238234]  &lt;TASK&gt;
      [2437548.241060]  dump_stack_lvl+0x46/0x61
      [2437548.245689]  panic+0x106/0x2e5
      [2437548.249497]  pmd_clear_bad+0x3c/0x3c
      [2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
      [2437548.259537]  change_p4d_range+0x156/0x20e
      [2437548.264392]  change_protection_range+0x116/0x1a9
      [2437548.269976]  change_prot_numa+0x15/0x37
      [2437548.274774]  task_numa_work+0x1b8/0x302
      [2437548.279512]  task_work_run+0x62/0x95
      [2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
      [2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
      [2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
      [2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
      [2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b

    This is due to a race condition between change_prot_numa() and
    THP migration because the kernel doesn't check is_swap_pmd() and
    pmd_trans_huge() atomically:

    change_prot_numa()                      THP migration
    ======================================================================
    - change_pmd_range()
    -&gt; is_swap_pmd() returns false,
    meaning it's not a PMD migration
    entry.
                                      - do_huge_pmd_numa_page()
                                      -&gt; migrate_misplaced_page() sets
                                         migration entries for the THP.
    - change_pmd_range()
    -&gt; pmd_none_or_clear_bad_unless_trans_huge()
    -&gt; pmd_none() and pmd_trans_huge() returns false
    - pmd_none_or_clear_bad_unless_trans_huge()
    -&gt; pmd_bad() returns true for the migration entry!

  The upstream commit 670ddd8cdcbd ("mm/mprotect: delete
  pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition
  by checking is_swap_pmd() and pmd_trans_huge() atomically.

  Backporting note:
    Unlike the mainline, pte_offset_map_lock() does not check if the pmd
    entry is a migration entry or a hugepage; acquires PTL unconditionally
    instead of returning failure. Therefore, it is necessary to keep the
    !is_swap_pmd() &amp;&amp; !pmd_trans_huge() &amp;&amp; !pmd_devmap() check before
    acquiring the PTL.

    After acquiring the lock, open-code the semantics of
    pte_offset_map_lock() in the mainline kernel; change_pte_range() fails
    if the pmd value has changed. This requires adding pmd_old parameter
    (pmd_t value that is read before calling the function) to
    change_pte_range(). ]

Signed-off-by: Harry Yoo &lt;harry.yoo@oracle.com&gt;
Acked-by: David Hildenbrand (Red Hat) &lt;david@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mprotect: use long for page accountings and retval</title>
<updated>2026-01-11T14:19:25+00:00</updated>
<author>
<name>Peter Xu</name>
<email>peterx@redhat.com</email>
</author>
<published>2026-01-06T11:47:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=238ff5056d392fcb1022b20966d6e1c58dade6a1'/>
<id>urn:sha1:238ff5056d392fcb1022b20966d6e1c58dade6a1</id>
<content type='text'>
commit a79390f5d6a78647fd70856bd42b22d994de0ba2 upstream.

Switch to use type "long" for page accountings and retval across the whole
procedure of change_protection().

The change should have shrinked the possible maximum page number to be
half comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on
any system either because the maximum possible pages touched by change
protection should be ULONG_MAX / PAGE_SIZE.

Two reasons to switch from "unsigned long" to "long":

  1. It suites better on count_vm_numa_events(), whose 2nd parameter takes
     a long type.

  2. It paves way for returning negative (error) values in the future.

Currently the only caller that consumes this retval is change_prot_numa(),
where the unsigned long was converted to an int.  Since at it, touching up
the numa code to also take a long, so it'll avoid any possible overflow
too during the int-size convertion.

Link: https://lkml.kernel.org/r/20230104225207.1066932-3-peterx@redhat.com
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Acked-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Acked-by: James Houghton &lt;jthoughton@google.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Nadav Amit &lt;nadav.amit@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
[ Adjust context ]
Signed-off-by: Harry Yoo &lt;harry.yoo@oracle.com&gt;
Acked-by: David Hildenbrand (Red Hat) &lt;david@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in</title>
<updated>2022-10-12T22:56:46+00:00</updated>
<author>
<name>Peter Xu</name>
<email>peterx@redhat.com</email>
</author>
<published>2022-10-01T00:25:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=515778e2d790652a38a24554fdb7f21420d91efc'/>
<id>urn:sha1:515778e2d790652a38a24554fdb7f21420d91efc</id>
<content type='text'>
When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
ifdefs to make sure the code won't be reached when not compiled in.

Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem &amp; hugetlbfs")
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Reported-by: &lt;syzbot+2b9b4f0895be09a6dec3@syzkaller.appspotmail.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: Edward Liaw &lt;edliaw@google.com&gt;
Cc: Liu Shixin &lt;liushixin2@huawei.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mprotect: use maple tree navigation instead of VMA linked list</title>
<updated>2022-09-27T02:46:24+00:00</updated>
<author>
<name>Liam R. Howlett</name>
<email>Liam.Howlett@Oracle.com</email>
</author>
<published>2022-09-06T19:49:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=70821e0b89dd477109c42a92d571f6dc6f6aa956'/>
<id>urn:sha1:70821e0b89dd477109c42a92d571f6dc6f6aa956</id>
<content type='text'>
Switch to navigating the VMA list with the maple tree operators in
preparation for removing the linked list.

Link: https://lkml.kernel.org/r/20220906194824.2110408-59-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett &lt;Liam.Howlett@Oracle.com&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Tested-by: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: David Howells &lt;dhowells@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: "Matthew Wilcox (Oracle)" &lt;willy@infradead.org&gt;
Cc: SeongJae Park &lt;sj@kernel.org&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/demotion: update node_is_toptier to work with memory tiers</title>
<updated>2022-09-27T02:46:12+00:00</updated>
<author>
<name>Aneesh Kumar K.V</name>
<email>aneesh.kumar@linux.ibm.com</email>
</author>
<published>2022-08-18T13:10:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=467b171af881282fc627328e6c164f044a6df888'/>
<id>urn:sha1:467b171af881282fc627328e6c164f044a6df888</id>
<content type='text'>
With memory tier support we can have memory only NUMA nodes in the top
tier from which we want to avoid promotion tracking NUMA faults.  Update
node_is_toptier to work with memory tiers.  All NUMA nodes are by default
top tier nodes.  With lower(slower) memory tiers added we consider all
memory tiers above a memory tier having CPU NUMA nodes as a top memory
tier

[sj@kernel.org: include missed header file, memory-tiers.h]
  Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
[akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
[aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
  Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.ibm.com&gt;
Reviewed-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Acked-by: Wei Xu &lt;weixugc@google.com&gt;
Cc: Alistair Popple &lt;apopple@nvidia.com&gt;
Cc: Bharata B Rao &lt;bharata@amd.com&gt;
Cc: Dan Williams &lt;dan.j.williams@intel.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: Hesham Almatary &lt;hesham.almatary@huawei.com&gt;
Cc: Jagdish Gediya &lt;jvgediya.oss@gmail.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Jonathan Cameron &lt;Jonathan.Cameron@huawei.com&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Tim Chen &lt;tim.c.chen@intel.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memory tiering: hot page selection with hint page fault latency</title>
<updated>2022-09-12T03:25:54+00:00</updated>
<author>
<name>Huang Ying</name>
<email>ying.huang@intel.com</email>
</author>
<published>2022-07-13T08:39:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=33024536bafd9129f1d16ade0974671c648700ac'/>
<id>urn:sha1:33024536bafd9129f1d16ade0974671c648700ac</id>
<content type='text'>
Patch series "memory tiering: hot page selection", v4.

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified. 
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote.  But this isn't a perfect
algorithm to identify the hot pages.  Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g.  60 seconds).  So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patchset
as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed.  The test results are
as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.

Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.


This patch (of 3):

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified.  Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
identify the hot pages.  Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g.  60 seconds).  The most frequently
accessed (MFU) algorithm is better.

So, in this patch we implemented a better hot page selection algorithm. 
Which is based on NUMA balancing page table scanning and hint page fault
as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a better estimation of the page hot/cold.

It's hard to find some extra space in struct page to hold the scan time. 
Fortunately, we can reuse some bits used by the original NUMA balancing.

NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth.  But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node.  So the accessing CPU and PID information are unnecessary for the
slow memory pages.  We can reuse these bits in struct page to record the
scan time.  For the fast memory pages, these bits are used as before.

For the hot threshold, the default value is 1 second, which works well in
our performance test.  All pages with hint page fault latency &lt; hot
threshold will be considered hot.

It's hard for users to determine the hot threshold.  So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment.  We will continue to work on a hot threshold automatic
adjustment mechanism.

The downside of the above method is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.

Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.

Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Reviewed-by: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Tested-by: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Zi Yan &lt;ziy@nvidia.com&gt;
Cc: Wei Xu &lt;weixugc@google.com&gt;
Cc: osalvador &lt;osalvador@suse.de&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Zhong Jiang &lt;zhongjiang-ali@linux.alibaba.com&gt;
Cc: Oscar Salvador &lt;osalvador@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mprotect: only reference swap pfn page if type match</title>
<updated>2022-08-28T21:02:46+00:00</updated>
<author>
<name>Peter Xu</name>
<email>peterx@redhat.com</email>
</author>
<published>2022-08-23T22:11:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3d2f78f08cd8388035ac375e731ec1ac1b79b09d'/>
<id>urn:sha1:3d2f78f08cd8388035ac375e731ec1ac1b79b09d</id>
<content type='text'>
Yu Zhao reported a bug after the commit "mm/swap: Add swp_offset_pfn() to
fetch PFN from swap entry" added a check in swp_offset_pfn() for swap type [1]:

  kernel BUG at include/linux/swapops.h:117!
  CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S         O L 6.0.0-dbg-DEV #2
  RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0
  Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6
  c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e &lt;0f&gt; 0b
  48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48
  RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282
  RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000
  RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b
  RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000
  R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738
  R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a
  FS:  00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   &lt;TASK&gt;
   change_pte_range+0x36e/0x880
   change_p4d_range+0x2e8/0x670
   change_protection_range+0x14e/0x2c0
   mprotect_fixup+0x1ee/0x330
   do_mprotect_pkey+0x34c/0x440
   __x64_sys_mprotect+0x1d/0x30

It triggers because pfn_swap_entry_to_page() could be called upon e.g. a
genuine swap entry.

Fix it by only calling it when it's a write migration entry where the page*
is used.

[1] https://lore.kernel.org/lkml/CAOUHufaVC2Za-p8m0aiHw6YkheDcrO-C3wRGixwDS32VTS+k1w@mail.gmail.com/

Link: https://lkml.kernel.org/r/20220823221138.45602-1-peterx@redhat.com
Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Reported-by: Yu Zhao &lt;yuzhao@google.com&gt;
Tested-by: Yu Zhao &lt;yuzhao@google.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mprotect: fix soft-dirty check in can_change_pte_writable()</title>
<updated>2022-07-30T01:07:18+00:00</updated>
<author>
<name>Peter Xu</name>
<email>peterx@redhat.com</email>
</author>
<published>2022-07-25T14:20:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=76aefad628aae152207ee624a7981b9aa1a267d8'/>
<id>urn:sha1:76aefad628aae152207ee624a7981b9aa1a267d8</id>
<content type='text'>
Patch series "mm/mprotect: Fix soft-dirty checks", v4.


This patch (of 3):

The check wanted to make sure when soft-dirty tracking is enabled we won't
grant write bit by accident, as a page fault is needed for dirty tracking.
The intention is correct but we didn't check it right because
VM_SOFTDIRTY set actually means soft-dirty tracking disabled.  Fix it.

There's another thing tricky about soft-dirty is that, we can't check the
vma flag !(vma_flags &amp; VM_SOFTDIRTY) directly but only check it after we
checked CONFIG_MEM_SOFT_DIRTY because otherwise VM_SOFTDIRTY will be
defined as zero, and !(vma_flags &amp; VM_SOFTDIRTY) will constantly return
true.  To avoid misuse, introduce a helper for checking whether vma has
soft-dirty tracking enabled.

We can easily verify this with any exclusive anonymous page, like program
below:

=======8&lt;======
  #include &lt;stdio.h&gt;
  #include &lt;unistd.h&gt;
  #include &lt;stdlib.h&gt;
  #include &lt;assert.h&gt;
  #include &lt;inttypes.h&gt;
  #include &lt;stdint.h&gt;
  #include &lt;sys/types.h&gt;
  #include &lt;sys/mman.h&gt;
  #include &lt;sys/types.h&gt;
  #include &lt;sys/stat.h&gt;
  #include &lt;unistd.h&gt;
  #include &lt;fcntl.h&gt;
  #include &lt;stdbool.h&gt;

  #define BIT_ULL(nr)                   (1ULL &lt;&lt; (nr))
  #define PM_SOFT_DIRTY                 BIT_ULL(55)

  unsigned int psize;
  char *page;

  uint64_t pagemap_read_vaddr(int fd, void *vaddr)
  {
      uint64_t value;
      int ret;

      ret = pread(fd, &amp;value, sizeof(uint64_t),
                  ((uint64_t)vaddr &gt;&gt; 12) * sizeof(uint64_t));
      assert(ret == sizeof(uint64_t));

      return value;
  }

  void clear_refs_write(void)
  {
      int fd = open("/proc/self/clear_refs", O_RDWR);

      assert(fd &gt;= 0);
      write(fd, "4", 2);
      close(fd);
  }

  #define  check_soft_dirty(str, expect)  do {                            \
          bool dirty = pagemap_read_vaddr(fd, page) &amp; PM_SOFT_DIRTY;      \
          if (dirty != expect) {                                          \
              printf("ERROR: %s, soft-dirty=%d (expect: %d)
", str, dirty, expect); \
              exit(-1);                                                   \
          }                                                               \
  } while (0)

  int main(void)
  {
      int fd = open("/proc/self/pagemap", O_RDONLY);

      assert(fd &gt;= 0);
      psize = getpagesize();
      page = mmap(NULL, psize, PROT_READ|PROT_WRITE,
                  MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      assert(page != MAP_FAILED);

      *page = 1;
      check_soft_dirty("Just faulted in page", 1);
      clear_refs_write();
      check_soft_dirty("Clear_refs written", 0);
      mprotect(page, psize, PROT_READ);
      check_soft_dirty("Marked RO", 0);
      mprotect(page, psize, PROT_READ|PROT_WRITE);
      check_soft_dirty("Marked RW", 0);
      *page = 2;
      check_soft_dirty("Wrote page again", 1);

      munmap(page, psize);
      close(fd);
      printf("Test passed.
");

      return 0;
  }
=======8&lt;======

Here we attach a Fixes to commit 64fe24a3e05e only for easy tracking, as
this patch won't apply to a tree before that point.  However the commit
wasn't the source of problem, but instead 64e455079e1b.  It's just that
after 64fe24a3e05e anonymous memory will also suffer from this problem
with mprotect().

Link: https://lkml.kernel.org/r/20220725142048.30450-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220725142048.30450-2-peterx@redhat.com
Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
Fixes: 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection")
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Nadav Amit &lt;nadav.amit@gmail.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/mprotect: remove the redundant initialization for error</title>
<updated>2022-07-18T00:14:47+00:00</updated>
<author>
<name>Xiu Jianfeng</name>
<email>xiujianfeng@huawei.com</email>
</author>
<published>2022-07-04T11:41:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=48725bbc0c3828bb9e36b632c6bf0326ed292ffb'/>
<id>urn:sha1:48725bbc0c3828bb9e36b632c6bf0326ed292ffb</id>
<content type='text'>
The variable error will be assigned correctly before it is used, the
initialization is redundant, so remove it.

Link: https://lkml.kernel.org/r/20220704114112.163112-1-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng &lt;xiujianfeng@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
