<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/mm/swapfile.c, branch v6.12.80</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.12.80'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-10-06T09:17:52+00:00</updated>
<entry>
<title>mm: swap: check for stable address space before operating on the VMA</title>
<updated>2025-10-06T09:17:52+00:00</updated>
<author>
<name>Charan Teja Kalla</name>
<email>charan.kalla@oss.qualcomm.com</email>
</author>
<published>2025-09-24T18:11:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4e5f060d7347466f77aaff1c0d5a6c4f1fb217ac'/>
<id>urn:sha1:4e5f060d7347466f77aaff1c0d5a6c4f1fb217ac</id>
<content type='text'>
commit 1367da7eb875d01102d2ed18654b24d261ff5393 upstream.

It is possible to hit a zero entry while traversing the vmas in unuse_mm()
called from swapoff path and accessing it causes the OOPS:

Unable to handle kernel NULL pointer dereference at virtual address
0000000000000446--&gt; Loading the memory from offset 0x40 on the
XA_ZERO_ENTRY as address.
Mem abort info:
  ESR = 0x0000000096000005
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x05: level 1 translation fault

The issue is manifested from the below race between the fork() on a
process and swapoff:
fork(dup_mmap())			swapoff(unuse_mm)
---------------                         -----------------
1) Identical mtree is built using
   __mt_dup().

2) copy_pte_range()--&gt;
	copy_nonpresent_pte():
       The dst mm is added into the
    mmlist to be visible to the
    swapoff operation.

3) Fatal signal is sent to the parent
process(which is the current during the
fork) thus skip the duplication of the
vmas and mark the vma range with
XA_ZERO_ENTRY as a marker for this process
that helps during exit_mmap().

				     4) swapoff is tried on the
					'mm' added to the 'mmlist' as
					part of the 2.

				     5) unuse_mm(), that iterates
					through the vma's of this 'mm'
					will hit the non-NULL zero entry
					and operating on this zero entry
					as a vma is resulting into the
					oops.

The proper fix would be around not exposing this partially-valid tree to
others when droping the mmap lock, which is being solved with [1].  A
simpler solution would be checking for MMF_UNSTABLE, as it is set if
mm_struct is not fully initialized in dup_mmap().

Thanks to Liam/Lorenzo/David for all the suggestions in fixing this
issue.

Link: https://lkml.kernel.org/r/20250924181138.1762750-1-charan.kalla@oss.qualcomm.com
Link: https://lore.kernel.org/all/20250815191031.3769540-1-Liam.Howlett@oracle.com/ [1]
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Charan Teja Kalla &lt;charan.kalla@oss.qualcomm.com&gt;
Suggested-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Barry Song &lt;baohua@kernel.org&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: Peng Zhang &lt;zhangpeng.00@bytedance.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swap: fix potential buffer overflow in setup_clusters()</title>
<updated>2025-08-15T10:14:14+00:00</updated>
<author>
<name>Kemeng Shi</name>
<email>shikemeng@huaweicloud.com</email>
</author>
<published>2025-05-22T12:25:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=91b370800b3f2b3dda244c0ab06719c4971190a5'/>
<id>urn:sha1:91b370800b3f2b3dda244c0ab06719c4971190a5</id>
<content type='text'>
commit 152c1339dc13ad46f1b136e8693de15980750835 upstream.

In setup_swap_map(), we only ensure badpages are in range (0, last_page].
As maxpages might be &lt; last_page, setup_clusters() will encounter a buffer
overflow when a badpage is &gt;= maxpages.

Only call inc_cluster_info_page() for badpage which is &lt; maxpages to fix
the issue.

Link: https://lkml.kernel.org/r/20250522122554.12209-4-shikemeng@huaweicloud.com
Fixes: b843786b0bd0 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
Signed-off-by: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swap: correctly use maxpages in swapon syscall to avoid potential deadloop</title>
<updated>2025-08-15T10:14:14+00:00</updated>
<author>
<name>Kemeng Shi</name>
<email>shikemeng@huaweicloud.com</email>
</author>
<published>2025-05-22T12:25:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f7c75406b7e65f8cbf417f44c704393013abd018'/>
<id>urn:sha1:f7c75406b7e65f8cbf417f44c704393013abd018</id>
<content type='text'>
commit 255116c5b0fa2145ede28c2f7b248df5e73834d1 upstream.

We use maxpages from read_swap_header() to initialize swap_info_struct,
however the maxpages might be reduced in setup_swap_extents() and the
si-&gt;max is assigned with the reduced maxpages from the
setup_swap_extents().

Obviously, this could lead to memory waste as we allocated memory based on
larger maxpages, besides, this could lead to a potential deadloop as
following:

1) When calling setup_clusters() with larger maxpages, unavailable
   pages within range [si-&gt;max, larger maxpages) are not accounted with
   inc_cluster_info_page().  As a result, these pages are assumed
   available but can not be allocated.  The cluster contains these pages
   can be moved to frag_clusters list after it's all available pages were
   allocated.

2) When the cluster mentioned in 1) is the only cluster in
   frag_clusters list, cluster_alloc_swap_entry() assume order 0
   allocation will never failed and will enter a deadloop by keep trying
   to allocate page from the only cluster in frag_clusters which contains
   no actually available page.

Call setup_swap_extents() to get the final maxpages before
swap_info_struct initialization to fix the issue.

After this change, span will include badblocks and will become large
value which I think is correct value:
In summary, there are two kinds of swapfile_activate operations.

1. Filesystem style: Treat all blocks logical continuity and find
   usable physical extents in logical range.  In this way, si-&gt;pages will
   be actual usable physical blocks and span will be "1 + highest_block -
   lowest_block".

2. Block device style: Treat all blocks physically continue and only
   one single extent is added.  In this way, si-&gt;pages will be si-&gt;max and
   span will be "si-&gt;pages - 1".  Actually, si-&gt;pages and si-&gt;max is only
   used in block device style and span value is set with si-&gt;pages.  As a
   result, span value in block device style will become a larger value as
   you mentioned.

I think larger value is correct based on:

1. Span value in filesystem style is "1 + highest_block -
   lowest_block" which is the range cover all possible phisical blocks
   including the badblocks.

2. For block device style, si-&gt;pages is the actual usable block number
   and is already in pr_info.  The original span value before this patch
   is also refer to usable block number which is redundant in pr_info.

[shikemeng@huaweicloud.com: ensure si-&gt;pages == si-&gt;max - 1 after setup_swap_extents()]
  Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com
  Link: https://lkml.kernel.org/r/20250718065139.61989-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com
Fixes: 661383c6111a ("mm: swap: relaim the cached parts that got scanned")
Signed-off-by: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm, swap: fix allocation and scanning race with swapoff</title>
<updated>2024-11-14T23:25:07+00:00</updated>
<author>
<name>Kairui Song</name>
<email>kasong@tencent.com</email>
</author>
<published>2024-11-12T08:34:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0ec8bc9e880eb576dc4492e8e0c7153ed0a71031'/>
<id>urn:sha1:0ec8bc9e880eb576dc4492e8e0c7153ed0a71031</id>
<content type='text'>
There are two flags used to synchronize allocation and scanning with
swapoff: SWP_WRITEOK and SWP_SCANNING.

SWP_WRITEOK: Swapoff will first unset this flag, at this point any further
swap allocation or scanning on this device should just abort so no more
new entries will be referencing this device.  Swapoff will then unuse all
existing swap entries.

SWP_SCANNING: This flag is set when device is being scanned.  Swapoff will
wait for all scanner to stop before the final release of the swap device
structures to avoid UAF.  Note this flag is the highest used bit of
si-&gt;flags so it could be added up arithmetically, if there are multiple
scanner.

commit 5f843a9a3a1e ("mm: swap: separate SSD allocation from
scan_swap_map_slots()") ignored SWP_SCANNING and SWP_WRITEOK flags while
separating cluster allocation path from the old allocation path.  Add the
flags back to fix swapoff race.  The race is hard to trigger as si-&gt;lock
prevents most parallel operations, but si-&gt;lock could be dropped for
reclaim or discard.  This issue is found during code review.

This commit fixes this problem.  For SWP_SCANNING, Just like before, set
the flag before scan and remove it afterwards.

For SWP_WRITEOK, there are several places where si-&gt;lock could be dropped,
it will be error-prone and make the code hard to follow if we try to cover
these places one by one.  So just do one check before the real allocation,
which is also very similar like before.  With new cluster allocator it may
waste a bit of time iterating the clusters but won't take long, and
swapoff is not performance sensitive.

Link: https://lkml.kernel.org/r/20241112083414.78174-1-ryncsn@gmail.com
Fixes: 5f843a9a3a1e ("mm: swap: separate SSD allocation from scan_swap_map_slots()")
Reported-by: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Closes: https://lore.kernel.org/linux-mm/87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com/
Signed-off-by: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Barry Song &lt;v-songbaohua@oppo.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swapfile: fix cluster reclaim work crash on rotational devices</title>
<updated>2024-11-13T00:01:36+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2024-11-07T14:08:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dcf32ea7ecede94796fb30231b3969d7c838374c'/>
<id>urn:sha1:dcf32ea7ecede94796fb30231b3969d7c838374c</id>
<content type='text'>
syzbot and Daan report a NULL pointer crash in the new full swap cluster
reclaim work:

&gt; Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI
&gt; KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
&gt; CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0
&gt; Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
&gt; Workqueue: events swap_reclaim_work
&gt; RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49
&gt; Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 &lt;80&gt; 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00
&gt; RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202
&gt; RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078
&gt; RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
&gt; RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
&gt; R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000
&gt; R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000
&gt; FS:  0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
&gt; CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
&gt; CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0
&gt; DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
&gt; DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
&gt; Call Trace:
&gt;  &lt;TASK&gt;
&gt;  __list_del_entry_valid include/linux/list.h:124 [inline]
&gt;  __list_del_entry include/linux/list.h:215 [inline]
&gt;  list_move_tail include/linux/list.h:310 [inline]
&gt;  swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748
&gt;  swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779

The syzbot console output indicates a virtual environment where swapfile
is on a rotational device.  In this case, clusters aren't actually used,
and si-&gt;full_clusters is not initialized.  Daan's report is from qemu, so
likely rotational too.

Make sure to only schedule the cluster reclaim work when clusters are
actually in use.

Link: https://lkml.kernel.org/r/20241107142335.GB1172372@cmpxchg.org
Link: https://lore.kernel.org/lkml/672ac50b.050a0220.2edce.1517.GAE@google.com/
Link: https://github.com/systemd/systemd/issues/35044
Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters")
Reported-by: syzbot+078be8bfa863cb9e0c6b@syzkaller.appspotmail.com
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-by: Daan De Meyer &lt;daan.j.demeyer@gmail.com&gt;
Cc: Kairui Song &lt;ryncsn@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm, swap: avoid over reclaim of full clusters</title>
<updated>2024-10-31T03:14:11+00:00</updated>
<author>
<name>Kairui Song</name>
<email>kasong@tencent.com</email>
</author>
<published>2024-10-22T17:55:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5168a68eb78fa1c67a8b2d31d0642c7fd866cc12'/>
<id>urn:sha1:5168a68eb78fa1c67a8b2d31d0642c7fd866cc12</id>
<content type='text'>
When running low on usable slots, cluster allocator will try to reclaim
the full clusters aggressively to reclaim HAS_CACHE slots.  This
guarantees that as long as there are any usable slots, HAS_CACHE or not,
the swap device will be usable and workload won't go OOM early.

Before the cluster allocator, swap allocator fails easily if device is
filled up with reclaimable HAS_CACHE slots.  Which can be easily
reproduced with following simple program:

    #include &lt;stdio.h&gt;
    #include &lt;string.h&gt;
    #include &lt;linux/mman.h&gt;
    #include &lt;sys/mman.h&gt;
    #define SIZE 8192UL * 1024UL * 1024UL
    int main(int argc, char **argv) {
        long tmp;
        char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        memset(p, 0, SIZE);
        madvise(p, SIZE, MADV_PAGEOUT);
        for (unsigned long i = 0; i &lt; SIZE; ++i)
            tmp += p[i];
        getchar(); /* Pause */
        return 0;
    }

Setup an 8G non ramdisk swap, the first run of the program will swapout 8G
ram successfully.  But run same program again after the first run paused,
the second run can't swapout all 8G memory as now half of the swap device
is pinned by HAS_CACHE.  There was a random scan in the old allocator that
may reclaim part of the HAS_CACHE by luck, but it's unreliable.

The new allocator's added reclaim of full clusters when device is low on
usable slots.  But when multiple CPUs are seeing the device is low on
usable slots at the same time, they ran into a thundering herd problem.

This is an observable problem on large machine with mass parallel
workload, as full cluster reclaim is slower on large swap device and
higher number of CPUs will also make things worse.

Testing using a 128G ZRAM on a 48c96t system.  When the swap device is
very close to full (eg.  124G / 128G), running build linux kernel with
make -j96 in a 1G memory cgroup will hung (not a softlockup though)
spinning in full cluster reclaim for about ~5min before go OOM.

To solve this, split the full reclaim into two parts:

- Instead of do a synchronous aggressively reclaim when device is low,
  do only one aggressively reclaim when device is strictly full with a
  kworker. This still ensures in worst case the device won't be unusable
  because of HAS_CACHE slots.

- To avoid allocation (especially higher order) suffer from HAS_CACHE
  filling up clusters and kworker not responsive enough, do one synchronous
  scan every time the free list is drained, and only scan one cluster. This
  is kind of similar to the random reclaim before, keeps the full clusters
  rotated and has a minimal latency. This should provide a fair reclaim
  strategy suitable for most workloads.

Link: https://lkml.kernel.org/r/20241022175512.10398-1-ryncsn@gmail.com
Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim")
Signed-off-by: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Barry Song &lt;v-songbaohua@oppo.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/swapfile: skip HugeTLB pages for unuse_vma</title>
<updated>2024-10-17T07:28:11+00:00</updated>
<author>
<name>Liu Shixin</name>
<email>liushixin2@huawei.com</email>
</author>
<published>2024-10-15T01:45:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7528c4fb1237512ee18049f852f014eba80bbe8d'/>
<id>urn:sha1:7528c4fb1237512ee18049f852f014eba80bbe8d</id>
<content type='text'>
I got a bad pud error and lost a 1GB HugeTLB when calling swapoff.  The
problem can be reproduced by the following steps:

 1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory.
 2. Swapout the above anonymous memory.
 3. run swapoff and we will get a bad pud error in kernel message:

  mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7)

We can tell that pud_clear_bad is called by pud_none_or_clear_bad in
unuse_pud_range() by ftrace.  And therefore the HugeTLB pages will never
be freed because we lost it from page table.  We can skip HugeTLB pages
for unuse_vma to fix it.

Link: https://lkml.kernel.org/r/20241015014521.570237-1-liushixin2@huawei.com
Fixes: 0fe6e20b9c4c ("hugetlb, rmap: add reverse mapping for hugepage")
Signed-off-by: Liu Shixin &lt;liushixin2@huawei.com&gt;
Acked-by: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Naoya Horiguchi &lt;nao.horiguchi@gmail.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: swap: prevent possible data-race in __try_to_reclaim_swap</title>
<updated>2024-10-17T07:28:11+00:00</updated>
<author>
<name>Jeongjun Park</name>
<email>aha310510@gmail.com</email>
</author>
<published>2024-10-07T07:06:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=818f916e3a07bf0c64bbf5e250ad209eebe21c85'/>
<id>urn:sha1:818f916e3a07bf0c64bbf5e250ad209eebe21c85</id>
<content type='text'>
A report [1] was uploaded from syzbot.

In the previous commit 862590ac3708 ("mm: swap: allow cache reclaim to
skip slot cache"), the __try_to_reclaim_swap() function reads offset and
folio-&gt;entry from folio without folio_lock protection.

In the currently reported KCSAN log, it is assumed that the actual
data-race will not occur because the calltrace that does WRITE already
obtains the folio_lock and then writes.

However, the existing __try_to_reclaim_swap() function was already
implemented to perform reads under folio_lock protection [1], and there is
a risk of a data-race occurring through a function other than the one
shown in the KCSAN log.

Therefore, I think it is appropriate to change
read operations for folio to be performed under folio_lock.

[1]

==================================================================
BUG: KCSAN: data-race in __delete_from_swap_cache / __try_to_reclaim_swap

write to 0xffffea0004c90328 of 8 bytes by task 5186 on cpu 0:
 __delete_from_swap_cache+0x1f0/0x290 mm/swap_state.c:163
 delete_from_swap_cache+0x72/0xe0 mm/swap_state.c:243
 folio_free_swap+0x1d8/0x1f0 mm/swapfile.c:1850
 free_swap_cache mm/swap_state.c:293 [inline]
 free_pages_and_swap_cache+0x1fc/0x410 mm/swap_state.c:325
 __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline]
 tlb_batch_pages_flush mm/mmu_gather.c:149 [inline]
 tlb_flush_mmu_free mm/mmu_gather.c:366 [inline]
 tlb_flush_mmu+0x2cf/0x440 mm/mmu_gather.c:373
 zap_pte_range mm/memory.c:1700 [inline]
 zap_pmd_range mm/memory.c:1739 [inline]
 zap_pud_range mm/memory.c:1768 [inline]
 zap_p4d_range mm/memory.c:1789 [inline]
 unmap_page_range+0x1f3c/0x22d0 mm/memory.c:1810
 unmap_single_vma+0x142/0x1d0 mm/memory.c:1856
 unmap_vmas+0x18d/0x2b0 mm/memory.c:1900
 exit_mmap+0x18a/0x690 mm/mmap.c:1864
 __mmput+0x28/0x1b0 kernel/fork.c:1347
 mmput+0x4c/0x60 kernel/fork.c:1369
 exit_mm+0xe4/0x190 kernel/exit.c:571
 do_exit+0x55e/0x17f0 kernel/exit.c:926
 do_group_exit+0x102/0x150 kernel/exit.c:1088
 get_signal+0xf2a/0x1070 kernel/signal.c:2917
 arch_do_signal_or_restart+0x95/0x4b0 arch/x86/kernel/signal.c:337
 exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
 exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
 __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
 syscall_exit_to_user_mode+0x59/0x130 kernel/entry/common.c:218
 do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffffea0004c90328 of 8 bytes by task 5189 on cpu 1:
 __try_to_reclaim_swap+0x9d/0x510 mm/swapfile.c:198
 free_swap_and_cache_nr+0x45d/0x8a0 mm/swapfile.c:1915
 zap_pte_range mm/memory.c:1656 [inline]
 zap_pmd_range mm/memory.c:1739 [inline]
 zap_pud_range mm/memory.c:1768 [inline]
 zap_p4d_range mm/memory.c:1789 [inline]
 unmap_page_range+0xcf8/0x22d0 mm/memory.c:1810
 unmap_single_vma+0x142/0x1d0 mm/memory.c:1856
 unmap_vmas+0x18d/0x2b0 mm/memory.c:1900
 exit_mmap+0x18a/0x690 mm/mmap.c:1864
 __mmput+0x28/0x1b0 kernel/fork.c:1347
 mmput+0x4c/0x60 kernel/fork.c:1369
 exit_mm+0xe4/0x190 kernel/exit.c:571
 do_exit+0x55e/0x17f0 kernel/exit.c:926
 __do_sys_exit kernel/exit.c:1055 [inline]
 __se_sys_exit kernel/exit.c:1053 [inline]
 __x64_sys_exit+0x1f/0x20 kernel/exit.c:1053
 x64_sys_call+0x2d46/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:61
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x0000000000000242 -&gt; 0x0000000000000000

Link: https://lkml.kernel.org/r/20241007070623.23340-1-aha310510@gmail.com
Reported-by: syzbot+fa43f1b63e3aa6f66329@syzkaller.appspotmail.com
Fixes: 862590ac3708 ("mm: swap: allow cache reclaim to skip slot cache")
Signed-off-by: Jeongjun Park &lt;aha310510@gmail.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Reviewed-by: Kairui Song &lt;kasong@tencent.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>swap: convert swapon() to use a folio</title>
<updated>2024-09-09T23:38:58+00:00</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2024-08-26T20:21:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=97b76796ccd07727b58ed739fcd60f2174d5ac18'/>
<id>urn:sha1:97b76796ccd07727b58ed739fcd60f2174d5ac18</id>
<content type='text'>
Retrieve a folio from the page cache rather than a page.  Saves a couple
of conversions between page &amp; folio.

Link: https://lkml.kernel.org/r/20240826202138.3804238-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: store zero pages to be swapped out in a bitmap</title>
<updated>2024-09-04T04:15:47+00:00</updated>
<author>
<name>Usama Arif</name>
<email>usamaarif642@gmail.com</email>
</author>
<published>2024-08-23T19:04:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0ca0c24e3211586d44a31b3c4767fe3f43a008a7'/>
<id>urn:sha1:0ca0c24e3211586d44a31b3c4767fe3f43a008a7</id>
<content type='text'>
Patch series "mm: store zero pages to be swapped out in a bitmap", v8.

As shown in the patch series that introduced the zswap same-filled
optimization [1], 10-20% of the pages stored in zswap are same-filled. 
This is also observed across Meta's server fleet.  By using VM counters in
swap_writepage (not included in this patchseries) it was found that less
than 1% of the same-filled pages to be swapped out are non-zero pages.

For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, a bitmap
can be used to mark these pages as zero at write time, and the pages can
be filled at read time if the bit corresponding to the page is set.

When using zswap with swap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings
which would offset the memory used for the bitmap.

A similar attempt was made earlier in [2] where zswap would only track
zero-filled pages instead of same-filled.  This patchseries adds
zero-filled pages optimization to swap (hence it can be used even if zswap
is disabled) and removes the same-filled code from zswap (as only 1% of
the same-filled pages are non-zero), simplifying code.

[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
[2] https://lore.kernel.org/lkml/20240325235018.2028408-1-yosryahmed@google.com/


This patch (of 2):

Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, a bitmap can be used to mark these
pages as zero at write time, and the pages can be filled at
read time if the bit corresponding to the page is set.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).

[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/

Link: https://lkml.kernel.org/r/20240823190545.979059-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20240823190545.979059-2-usamaarif642@gmail.com
Signed-off-by: Usama Arif &lt;usamaarif642@gmail.com&gt;
Reviewed-by: Chengming Zhou &lt;chengming.zhou@linux.dev&gt;
Reviewed-by: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Reviewed-by: Nhat Pham &lt;nphamcs@gmail.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Usama Arif &lt;usamaarif642@gmail.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
