summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-02-07mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to ↵Qi Zheng6-3/+6
CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config For architectures that define __HAVE_ARCH_TLB_REMOVE_TABLE, the page tables at the pmd/pud level are generally not of struct ptdesc type, and do not have pt_rcu_head member, thus these architectures cannot support PT_RECLAIM. In preparation for enabling PT_RECLAIM on more architectures, convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config, so that we can make conditional judgments in Kconfig. Link: https://lkml.kernel.org/r/5ebfa3d4b56e63c6906bda5eccaa9f7194d3a86b.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Andreas Larsson <andreas@gaisler.com> [sparc, UP&SMP] Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07um: mm: enable MMU_GATHER_RCU_TABLE_FREEQi Zheng1-0/+1
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Link: https://lkml.kernel.org/r/e2217546504668b8a87a39eb0e378839339a1bb4.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07parisc: mm: enable MMU_GATHER_RCU_TABLE_FREEQi Zheng2-2/+3
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Link: https://lkml.kernel.org/r/b827939046dbc94bc7c585cdbed8522baab75b15.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mips: mm: enable MMU_GATHER_RCU_TABLE_FREEQi Zheng2-4/+4
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Link: https://lkml.kernel.org/r/0d17f00a724f77aaca2da7c847acd490c3a47571.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREEQi Zheng2-4/+4
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Link: https://lkml.kernel.org/r/bd1b11bc1a13686aeba81a40194f87b369d62661.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07alpha: mm: enable MMU_GATHER_RCU_TABLE_FREEQi Zheng2-3/+4
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Link: https://lkml.kernel.org/r/3380f40a89b73c488202c85f9a8abf99fb08543b.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Magnus Lindholm <linmag7@gmail.com> [alpha] Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.hQi Zheng1-1/+1
Patch series "enable PT_RECLAIM on more 64-bit architectures", v4. This series aims to enable PT_RECLAIM on more 64-bit architectures. On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, we need to enable PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE. For these architectures that define its own __tlb_remove_table(), since their page tables are not of type struct ptdesc, they cannot be supported PT_RECLAIM. Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all 64-bit architectures, then converts __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config, and finally makes PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && !HAVE_ARCH_TLB_REMOVE_TABLE. This way, PT_RECLAIM can be enabled by default on most 64-bit architectures. Of course, this will also be enabled on some 32-bit architectures that already support MMU_GATHER_RCU_TABLE_FREE. That's fine, PT_RECLAIM works well on all 32-bit architectures as well. Although the benefit isn't significant, there's still memory that can be reclaimed. Perhaps PT_RECLAIM can be enabled on all 32-bit architectures in the future. This patch (of 8): Generally, the asm/tlb.h will include asm-generic/tlb.h, so change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This is a preparation for enabling CONFIG_PT_RECLAIM on other architectures, such as alpha. Link: https://lkml.kernel.org/r/cover.1769515122.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/befca537d10c6bf8d531b1ee0a8af1e3b31352b0.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm/damon/stat: remove __read_mostly from memory_idle_ms_percentilesLi RongQing1-1/+1
The 'memory_idle_ms_percentiles' array in DAMON_STAT is updated frequently by the kernel to reflect the latest idle time statistics. Marking it as '__read_mostly' is inappropriate for data that is regularly written to, as it can lead to cache pollution in the read-mostly section. Remove the '__read_mostly' annotation to accurately reflect the variable's usage pattern. Link: https://lkml.kernel.org/r/20260130085603.1814-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07zsmalloc: make common caches globalSergey Senozhatsky1-56/+50
Currently, zsmalloc creates kmem_cache of handles and zspages for each pool, which may be suboptimal from the memory usage point of view (extra internal fragmentation per pool). Systems that create multiple zsmalloc pools may benefit from shared common zsmalloc caches. Make handles and zspages kmem caches global. The memory savings depend on particular setup and data patterns and can be found via slabinfo. Link: https://lkml.kernel.org/r/20260117025406.799428-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Brian Geffon <bgeffon@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm: add SPDX id lines to some mm source filesTim Bird8-13/+8
Some of the memory management source files are missing SPDX-License-Identifier lines. Add appropriate IDs to these files (mostly GPL-2.0, but one LGPL-2.1). Link: https://lkml.kernel.org/r/20260204213101.1754183-1-tim.bird@sony.com Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm/zswap: use %pe to print error pointersSahil Chandna1-2/+2
Use the %pe printk format specifier to report error pointers directly instead of printing PTR_ERR() as a long value. This improves clarity, produces more readable error messages. This instance was flagged by the Coccinelle script (misc/ptr_err_to_pe.cocci) as an opportunity to adopt %pe. Found by: make coccicheck MODE=report M=mm/ No functional change intended. Link: https://lkml.kernel.org/r/581a26f22fb4c6ce04aeb7ee0d703fe64454ac7f.1770230135.git.chandna.sahil@gmail.com Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm/vmscan: use %pe to print error pointersSahil Chandna1-2/+2
Use the %pe printk format specifier to report error pointers directly instead of printing PTR_ERR() as a long value. This improves clarity, produces more readable error messages. This instance was flagged by the Coccinelle script (misc/ptr_err_to_pe.cocci) as an opportunity to adopt %pe. Found by: make coccicheck MODE=report M=mm/ No functional change intended Link: https://lkml.kernel.org/r/80a6643657a60e75ddf48b4869b3e7fdc101f855.1770230135.git.chandna.sahil@gmail.com Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm/readahead: fix typo in commentWilson Zeng1-1/+1
Fix a typo in a comment: max_readhead -> max_readahead. Link: https://lkml.kernel.org/r/20260127152535.321951-1-cheng20011202@gmail.com Signed-off-by: Wilson Zeng <cheng20011202@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()Shakeel Butt1-7/+9
In META's fleet, we observed high-level cgroups showing zero file memcg stats while their descendants had non-zero values. Investigation using drgn revealed that these parent cgroups actually had negative file stats, aggregated from their children. This issue became more frequent after deploying thp-always more widely, pointing to a correlation with THP file collapsing. The root cause is that collapse_file() assumes old folios and the new THP belong to the same node and memcg. When this assumption breaks, stats become skewed. The bug affects not just memcg stats but also per-numa stats, and not just NR_FILE_PAGES but also NR_SHMEM. The assumption breaks in scenarios such as: 1. Small folios allocated on one node while the THP gets allocated on a different node. 2. A package downloader running in one cgroup populates the page cache, while a job in a different cgroup executes the downloaded binary. 3. A file shared between processes in different cgroups, where one process faults in the pages and khugepaged (or madvise(COLLAPSE)) collapses them on behalf of the other. Fix the accounting by explicitly incrementing stats for the new THP and decrementing stats for the old folios being replaced. Link: https://lkml.kernel.org/r/20260130042925.2797946-1-shakeel.butt@linux.dev Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Song Liu <songliubraving@fb.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm: refactor vma_map_pages to use vm_insert_pagesJustin Green1-9/+1
vma_map_pages currently calls vm_insert_page on each individual page in the mapping, which creates significant overhead because we are repeatedly spinlocking. Instead, we should batch insert pages using vm_insert_pages, which amortizes the cost of the spinlock. Tested through watching hardware accelerated video on a MTK ChromeOS device. This particular path maps both a V4L2 buffer and a GEM allocated buffer into userspace and converts the contents from one pixel format to another. Both vb2_mmap() and mtk_gem_object_mmap() exercise this pathway. Link: https://lkml.kernel.org/r/20260128225648.2938636-1-greenjustin@chromium.org Signed-off-by: Justin Green <greenjustin@chromium.org> Acked-by: Brian Geffon <bgeffon@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Arjun Roy <arjunroy@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm/damon: unify address range representation with damon_addr_rangeEnze Li1-13/+2
Currently, DAMON defines two identical structures for representing address ranges: damon_system_ram_region and damon_addr_range. Both structures share the same semantic interpretation of a half-open interval [start, end), where the start address is inclusive and the end address is exclusive. This duplication adds unnecessary redundancy and increases maintenance overhead. This patch replaces all uses of damon_system_ram_region with the more generic damon_addr_range structure, ensuring a unified type representation for address ranges within the DAMON subsystem. The change simplifies the codebase, improves readability, and avoids potential inconsistencies in future modifications. Link: https://lkml.kernel.org/r/20260129100845.281734-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm/cma: replace snprintf with strscpy in cma_new_areaThorsten Blum1-1/+2
Replace snprintf("%s", ...) with the faster and more direct strscpy(). Link: https://lkml.kernel.org/r/20260126174516.236968-1-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-07mm: zswap: use SG list decompression APIs from zsmallocYosry Ahmed1-30/+19
Use the new zs_obj_read_sg_*() APIs in zswap_decompress(), instead of zs_obj_read_*() APIs returning a linear address. The SG list is passed directly to the crypto API, simplifying the logic and dropping the workaround that copies highmem addresses to a buffer. The crypto API should internally linearize the SG list if needed. This avoids the memcpy() in zsmalloc for objects spanning multiple pages, although an equivalent operation will be done internally by acomp/scomp. However, in the future compression algorithms could support handling discontiguous SG lists, completely eliminating the copying for spanning objects. Zsmalloc fills an SG list up to 2 entries in size, so change the input SG list to fit 2 entries. Update the incompressible entries path to use memcpy_from_sglist() to copy the data to the folio. Opportunistically set dlen to PAGE_SIZE in the same code path (rather that at the top of the function) to make it clearer. Drop the goto in zswap_compress() as the code now is not simple enough for an if-else statement instead. Rename 'decomp_ret' to 'ret' and reuse it to keep the intermediate return value of crypto_acomp_decompress() to keep line lengths manageable. No functional change intended. Link: https://lkml.kernel.org/r/20260121013615.2906368-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: remove no longer needed _swap_info_getKairui Song1-41/+6
There are now only two users of _swap_info_get after consolidating these callers, folio_free_swap and swp_swapcount. folio_free_swap already holds the folio lock, and the folio must be in the swap cache, _swap_info_get is redundant. For swp_swapcount, it should use get_swap_device instead. get_swap_device increases the device ref count, which is actually a bit safer. The only current use is smap walking, and the performance change here is tiny. And after these changes, _swap_info_get is no longer used, so we can safely remove it. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-19-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: drop the SWAP_HAS_CACHE flagKairui Song4-132/+78
Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries) after swap allocation (folio_alloc_swap), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have just simplified that and defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so stop checking the SWAP_HAS_CACHE bit and just check the swap cache (swap table) directly, and add a WARN if a swap entry's count is being increased for the first time while the folio is not in swap cache. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up batch freeing to check the swap cache usage using the swap table: a slot with swap cache in the swap table will not be freed until its cache is gone, and no SWAP_HAS_CACHE bit is involved anymore. And besides, the removal of a folio and freeing of the slots are being done in the same critical section now, which should improve the performance. After these two changes, SWAP_HAS_CACHE no longer has any users. Swap cache synchronization is also done by the swap table directly, so using SWAP_HAS_CACHE to pin a slot before adding the cache is also no longer needed. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-18-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: clean up and improve swap entries freeingKairui Song1-142/+96
There are a few problems with the current freeing of swap entries. When freeing a set of swap entries directly (swap_put_entries_direct, typically from zapping the page table), it scans the whole swap region multiple times. First, it scans the whole region to check if it can be batch freed and if there is any cached folio. Then do a batch free only if the whole region's swap count equals 1. And if any entry is cached, even if only one, it will have to walk the whole region again to clean up the cache. And if any entry is not in a consistent status with other entries, it will fall back to order 0 freeing. For example, if only one of them is cached, the batch free will fall back. And the current batch freeing workflow relies on the swap map's SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which isn't compatible with the swap table design. Tidy this up, introduce a new cluster scoped helper for all swap entry freeing job. It will batch frees all continuous entries, and just start a new batch if any inconsistent entry is found. This may improve the batch size when the clusters are fragmented. This should also be more robust with more sanity checks, and make it clear that a slot pinned by swap cache will be cleared upon cache reclaim. And the cache reclaim scan is also now limited to each cluster. If a cluster has any clean swap cache left after putting the swap count, reclaim the cluster only instead of the whole region. And since a folio's entries are always in the same cluster, putting swap entries from a folio can also use the new helper directly. This should be both an optimization and a cleanup, and the new helper is adapted to the swap table. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-17-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: check swap table directly for checking cacheKairui Song4-36/+56
Instead of looking at the swap map, check swap table directly to tell if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-16-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: add folio to swap cache directly on allocationKairui Song4-129/+105
The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-15-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: cleanup swap entry management workflowKairui Song10-98/+196
The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly freed and free with a folio bound. The folio lock will be useful for resolving many swap related races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table is fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change [kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi] Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com [ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris] Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4 Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Kairui Song <ryncsn@gmail.com> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Chris Mason <clm@meta.com> Cc: Lai Yi <yi1.lai@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: remove workaround for unsynchronized swap map cache stateKairui Song3-19/+12
Remove the "skip if exists" check from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"). It was needed because there is a tiny time window between setting the SWAP_HAS_CACHE bit and actually adding the folio to the swap cache. If a user is trying to add the folio into the swap cache but another user was interrupted after setting SWAP_HAS_CACHE but hasn't added the folio to the swap cache yet, it might lead to a deadlock. We have moved the bit setting to the same critical section as adding the folio, so this is no longer needed. Remove it and clean it up. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-13-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: use swap cache as the swap in synchronize layerKairui Song5-71/+97
Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. Now all swap in paths are using the swap cache, and both the swap cache and swap map are protected by the cluster lock. So we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock and folio lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced swap operations will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking memcg-aware"), which will be removed very soon. [kasong@tencent.com: fix cgroup v1 accounting issue] Link: https://lkml.kernel.org/r/CAMgjq7CGUnzOVG7uSaYjzw9wD7w2dSKOHprJfaEp4CcGLgE3iw@mail.gmail.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: split locked entry duplicating into a standalone helperKairui Song1-31/+31
No feature change, split the common logic into a stand alone helper to be reused later. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-11-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: consolidate cluster reclaim and usability checkKairui Song1-16/+29
Swap cluster cache reclaim requires releasing the lock, so the cluster may become unusable after the reclaim. To prepare for checking swap cache using the swap table directly, consolidate the swap cluster reclaim and the check logic. We will want to avoid touching the cluster's data completely with the swap table, to avoid RCU overhead here. And by moving the cluster usable check into the reclaim helper, it will also help avoid a redundant scan of the slots if the cluster is no longer usable, and we will want to avoid touching the cluster. Also, adjust it very slightly while at it: always scan the whole region during reclaim, don't skip slots covered by a reclaimed folio. Because the reclaim is lockless, it's possible that new cache lands at any time. And for allocation, we want all caches to be reclaimed to avoid fragmentation. Besides, if the scan offset is not aligned with the size of the reclaimed folio, we might skip some existing cache and fail the reclaim unexpectedly. There should be no observable behavior change. It might slightly improve the fragmentation issue or performance. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-10-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: swap entry of a bad slot should not be considered as swapped outKairui Song2-9/+10
When checking if a swap entry is swapped out, we simply check if the bitwise result of the count value is larger than 0. But SWAP_MAP_BAD will also be considered as a swao count value larger than 0. SWAP_MAP_BAD being considered as a count value larger than 0 is useful for the swap allocator: they will be seen as a used slot, so the allocator will skip them. But for the swapped out check, this isn't correct. There is currently no observable issue. The swapped out check is only useful for readahead and folio swapped-out status check. For readahead, the swap cache layer will abort upon checking and updating the swap map. For the folio swapped out status check, the swap allocator will never allocate an entry of bad slots to folio, so that part is fine too. The worst that could happen now is redundant allocation/freeing of folios and waste CPU time. This also makes it easier to get rid of swap map checking and update during folio insertion in the swap cache layer. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-9-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/shmem, swap: remove SWAP_MAP_SHMEMNhat Pham3-34/+25
The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now having swap count == SWAP_MAP_SHMEM value is basically the same as having swap count == 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-8-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IOKairui Song3-77/+27
Now the overhead of the swap cache is trivial to none, bypassing the swap cache is no longer a good optimization. We have removed the cache bypass swapin for anon memory, now do the same for shmem. Many helpers and functions can be dropped now. The performance may slightly drop because of the co-existence and double update of swap_map and swap table, and this problem will be improved very soon in later commits by dropping the swap_map update partially: Swapin of 24 GB file with tmpfs with transparent_hugepage_tmpfs=within_size and ZRAM, 3 test runs on my machine: Before: After this commit: After this series: 5.99s 6.29s 6.08s And later swap table phases will drop the swap_map completely to avoid overhead and reduce memory usage. Link: https://lkml.kernel.org/r/20251219195751.61328-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: free the swap cache after folio is mappedKairui Song1-10/+11
Currently, we remove the folio from the swap cache and free the swap cache before mapping the PTE. To reduce repeated faults due to parallel swapins of the same PTE, change it to remove the folio from the swap cache after it is mapped. So new faults from the swap PTE will be much more likely to see the folio in the swap cache and wait on it. This does not eliminate all swapin races: an ongoing swapin fault may still see an empty swap cache. That's harmless, as the PTE is changed before the swap cache is cleared, so it will just return and not trigger any repeated faults. This does help to reduce the chance. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-6-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: simplify the code and reduce indentionKairui Song1-46/+43
Now swap cache is always used, multiple swap cache checks are no longer useful, remove them and reduce the code indention. No behavior change. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-5-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devicesKairui Song1-2/+16
Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side effect is that a folio may stay in swap cache for a longer time due to lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios are being swapped out very frequently right after swapin, hence improving the performance. But the long pinning of swap slots also increases the fragmentation rate of the swap device significantly, and currently, all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also causes the backing memory to be pinned, increasing the memory pressure. So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices after swapin finishes. Swap cache has served its role as a synchronization layer to prevent any parallel swap-in from wasting CPU or memory allocation, and the redundant IO is not a major concern for SWP_SYNCHRONOUS_IO devices. Worth noting, without this patch, this series so far can provide a ~30% performance gain for certain workloads like MySQL or kernel compilation, but causes significant regression or OOM when under extreme global pressure. With this patch, we still have a nice performance gain for most workloads, and without introducing any observable regressions. This is a hint that further optimization can be done based on the new unified swapin with swap cache, but for now, just keep the behaviour consistent with before. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-4-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IOKairui Song3-85/+85
Now the overhead of the swap cache is trivial. Bypassing the swap cache is no longer a valid optimization. So unify the swapin path using the swap cache. This changes the swap in behavior in two observable ways. Readahead is now always disabled for SWP_SYNCHRONOUS_IO devices, which is a huge win for some workloads: We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as the indicator to bypass both the swap cache and readahead, the swap count check made bypassing ineffective in many cases, and it's not a good indicator. The limitation existed because the current swap design made it hard to decouple readahead bypassing and swap cache bypassing. We do want to always bypass readahead for SWP_SYNCHRONOUS_IO devices, but bypassing swap cache at the same time will cause repeated IO and memory overhead. Now that swap cache bypassing is gone, this swap count check can be dropped. The second thing here is that this enabled large swapin for all swap entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is also coupled with swap cache bypassing, and so the swap count checking also makes large swapin less effective. Now this is also improved. We will always have large swapin supported for all SWP_SYNCHRONOUS_IO cases. And to catch potential issues with large swapin, especially with page exclusiveness and swap cache, more debug sanity checks and comments are added. But overall, the code is simpler. And new helper and routines will be used by other components in later commits too. And now it's possible to rely on the swap cache layer for resolving synchronization issues, which will also be done by a later commit. Worth mentioning that for a large folio workload, this may cause more serious thrashing. This isn't a problem with this commit, but a generic large folio issue. For a 4K workload, this commit increases the performance. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-3-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: split swap cache preparation loop into a standalone helperKairui Song1-88/+109
To prepare for the removal of swap cache bypass swapin, introduce a new helper that accepts an allocated and charged fresh folio, prepares the folio, the swap map, and then adds the folio to the swap cache. This doesn't change how swap cache works yet, we are still depending on the SWAP_HAS_CACHE in the swap map for synchronization. But all synchronization hacks are now all in this single helper. No feature change. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-2-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folioKairui Song4-19/+39
Patch series "mm, swap: swap table phase II: unify swapin use", v5. This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code and special swap flag bits including SWAP_HAS_CACHE, along with many historical issues. The performance is about ~20% better for some workloads, like Redis with persistence. This also cleans up the code to prepare for later phases, some patches are from a previously posted series. Swap cache bypassing and swap synchronization in general had many issues. Some are solved as workarounds, and some are still there [1]. To resolve them in a clean way, one good solution is to always use swap cache as the synchronization layer [2]. So we have to remove the swap cache bypass swap-in path first. It wasn't very doable due to performance issues, but now combined with the swap table, removing the swap cache bypass path will instead improve the performance, there is no reason to keep it. Now we can rework the swap entry and cache synchronization following the new design. Swap cache synchronization was heavily relying on SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage of special swap map bits and related workarounds, we get a cleaner code base and prepare for merging the swap count into the swap table in the next step. And swap_map is now only used for swap count, so in the next phase, swap_map can be merged into the swap table, which will clean up more things and start to reduce the static memory usage. Removal of swap_cgroup_ctrl is also doable, but needs to be done after we also simplify the allocation of swapin folios: always use the new swap_cache_alloc_folio helper so the accounting will also be managed by the swap layer by then. Test results: Redis / Valkey bench: ===================== Testing on a ARM64 VM 1.5G memory: Server: valkey-server --maxmemory 2560M Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 460475.84 RPS 311591.19 RPS After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%) Testing on a x86_64 VM with 4G memory (system components takes about 2G): Server: Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 306044.38 RPS 102745.88 RPS After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%) The performance is a lot better when persistence is applied. This should apply to many other workloads that involve sharing memory and COW. A slight performance drop was observed for the ARM64 Redis test: We are still using swap_map to track the swap count, which is causing redundant cache and CPU overhead and is not very performance-friendly for some arches. This will be improved once we merge the swap map into the swap table (as already demonstrated previously [3]). vm-scabiity =========== usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure, simulated PMEM as swap), average result of 6 test run: Before: After: System time: 282.22s 283.47s Sum Throughput: 5677.35 MB/s 5688.78 MB/s Single process Throughput: 176.41 MB/s 176.23 MB/s Free latency: 518477.96 us 521488.06 us Which is almost identical. Build kernel test: ================== Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM with 4G RAM, under global pressure, avg of 32 test run: Before After: System time: 1379.91s 1364.22s (-0.11%) Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM with 4G RAM, under global pressure, avg of 32 test run: Before After: System time: 1822.52s 1803.33s (-0.11%) Which is almost identical. MySQL: ====== sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16 --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up). Before: 318162.18 qps After: 318512.01 qps (+0.01%) In conclusion, the result is looking better or identical for most cases, and it's especially better for workloads with swap count > 1 on SYNC_IO devices, about ~20% gain in above test. Next phases will start to merge swap count into swap table and reduce memory usage. One more gain here is that we now have better support for THP swapin. Previously, the THP swapin was bound with swap cache bypassing, which only works for single-mapped folios. Removing the bypassing path also enabled THP swapin for all folios. The THP swapin is still limited to SYNC_IO devices, the limitation can be removed later. This may cause more serious THP thrashing for certain workloads, but that's not an issue caused by this series, it's a common THP issue we should resolve separately. This patch (of 19): __read_swap_cache_async is widely used to allocate and ensure a folio is in swapcache, or get the folio if a folio is already there. It's not async, and it's not doing any read. Rename it to better present its usage, and prepare to be reworked as part of new swap cache APIs. Also, add some comments for the function. Worth noting that the skip_if_exists argument is an long existing workaround that will be dropped soon. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-1-8862a265a033@tencent.com Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2] Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3] Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01selftests/mm: have the harness run each test category separatelyMark Brown31-1/+152
At present the mm selftests are integrated into the kselftest harness by having it run run_vmtest.sh and letting it pick it's default set of tests to invoke, rather than by telling the kselftest framework about each test program individually as is more standard. This has some unfortunate interactions with the kselftest harness: - If any of the tests hangs the harness will kill the entire mm selftests run rather than just the individual test, meaning no further tests get run. - The timeout applied by the harness is applied to the whole run rather than an individual test which frequently leads to the suite not being completed in production testing. Deploy a crude but effective mitigation for these issues by telling the kselftest framework to run each of the test categories that run_vmtests.sh has separately. Since kselftest really wants to run test programs this is done by providing a trivial wrapper script for each categorty that invokes run_vmtest.sh, this is not a thing of great elegence but it is clear and simple. Since run_vmtests.sh is doing runtime support detection, scenario enumeration and setup for many of the tests we can't consistently tell the framework about the individual test programs. This has the side effect of reordering the tests, hopefully the testing is not overly sensitive to this. Link: https://lkml.kernel.org/r/20260123-selftests-mm-run-suites-separately-v2-1-3e934edacbfa@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01percpu: add double free check to pcpu_free_area()Dennis Zhou1-1/+14
Percpu memory provides access via offsets into the percpu address space. Offsets are essentially fixed for the lifetime of a chunk and therefore require all users be good samaritans. If a user improperly handles the lifetime of the percpu object, it can result in corruption in a couple of ways: - immediate double free - breaks percpu metadata accounting - free after subsequent allocation - corruption due to multiple owner problem (either prior owner still writes or future allocation happens) - potential for oops if the percpu pages are reclaimed as the subsequent allocation isn't pinning the pages down - can lead to page->private pointers pointing to freed chunks Sebastian noticed that if this happens, none of the memory debugging facilities add additional information [1]. This patch aims to catch invalid free scenarios within valid chunks. To better guard free_percpu(), we can either add a magic number or some tracking facility to the percpu subsystem in a separate patch. The invalid free check in pcpu_free_area() validates that the allocation's starting bit is set in both alloc_map and bound_map. The alloc_map bit test ensures the area is allocated while the bound_map bit test checks we are freeing from the beginning of an allocation. We choose not to check the validity of the offset as that is encoded in page->private being a valid chunk. pcpu_stats_area_dealloc() is moved later to only be on the happy path so stats are only updated on valid frees. Link: https://lkml.kernel.org/r/20260123205535.35267-1-dennis@kernel.org Link: https://lore.kernel.org/lkml/20260119074813.ecAFsGaT@linutronix.de/ [1] Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Chistoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01hugetlb: increase hugepage reservations when using node-specific ↵Li Zhe1-0/+7
"hugepages=" cmdline Commit 3dfd02c90037 ("hugetlb: increase number of reserving hugepages via cmdline") raised the number of hugepages that can be reserved through the boot-time "hugepages=" parameter for the non-node-specific case, but left the node-specific form of the same parameter unchanged. This patch extends the same optimization to node-specific reservations. When HugeTLB vmemmap optimization (HVO) is enabled and a node cannot satisfy the requested hugepages, the code first releases ordinary struct-page memory of hugepages obtained from the buddy allocator, allowing their struct-page memory to be reclaimed and reused for additional hugepage reservations on that node. This is particularly beneficial for configurations that require identical, large per-node hugepage reservations. On a four-node, 384 GB x86 VM, the patch raises the attainable 2 MiB hugepage reservation from under 374 GB to more than 379 GB. Link: https://lkml.kernel.org/r/20260122035002.79958-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01kasan: remove unnecessary sync argument from start_report()Maninder Singh1-4/+4
commit 7ce0ea19d50e ("kasan: switch kunit tests to console tracepoints") removed use of sync variable, thus removing that extra argument also. Link: https://lkml.kernel.org/r/20260122041556.341868-1-maninder1.s@samsung.com Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/pagewalk: use min() to simplify the codezenghongling1-1/+2
Use the min() macro to simplify the function and improve its readability. [akpm@linux-foundation.org: add newline, per Lorenzo] Link: https://lkml.kernel.org/r/20260120094932.183697-1-zenghongling@kylinos.cn Signed-off-by: zenghongling <zenghongling@kylinos.cn> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Hongling Zeng <zenghongling@kylinos.cn> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/vma: add and use vma_assert_stabilised()Lorenzo Stoakes3-7/+54
Sometimes we wish to assert that a VMA is stable, that is - the VMA cannot be changed underneath us. This will be the case if EITHER the VMA lock or the mmap lock is held. In order to do so, we introduce a new assert vma_assert_stabilised() - this will make a lockdep assert if lockdep is enabled AND the VMA is read-locked. Currently lockdep tracking for VMA write locks is not implemented, so it suffices to check in this case that we have either an mmap read or write semaphore held. Note that because the VMA lock uses the non-standard vmlock_dep_map naming convention, we cannot use lockdep_assert_is_write_held() so have to open code this ourselves via lockdep-asserting that lock_is_held_type(&vma->vmlock_dep_map, 0). We have to be careful here - for instance when merging a VMA, we use the mmap write lock to stabilise the examination of adjacent VMAs which might be simultaneously VMA read-locked whilst being faulted in. If we were to assert VMA read lock using lockdep we would encounter an incorrect lockdep assert. Also, we have to be careful about asserting mmap locks are held - if we try to address the above issue by first checking whether mmap lock is held and if so asserting it via lockdep, we may find that we were raced by another thread acquiring an mmap read lock simultaneously that either we don't own (and thus can be released any time - so we are not stable) or was indeed released since we last checked. So to deal with these complexities we end up with either a precise (if lockdep is enabled) or imprecise (if not) approach - in the first instance we assert the lock is held using lockdep and thus whether we own it. If we do own it, then the check is complete, otherwise we must check for the VMA read lock being held (VMA write lock implies mmap write lock so the mmap lock suffices for this). If lockdep is not enabled we simply check if the mmap lock is held and risk a false negative (i.e. not asserting when we should do). There are a couple places in the kernel where we already do this stabliisation check - the anon_vma_name() helper in mm/madvise.c and vma_flag_set_atomic() in include/linux/mm.h, which we update to use vma_assert_stabilised(). This change abstracts these into vma_assert_stabilised(), uses lockdep if possible, and avoids a duplicate check of whether the mmap lock is held. This is also self-documenting and lays the foundations for further VMA stability checks in the code. The only functional change here is adding the lockdep check. Link: https://lkml.kernel.org/r/6c9e64bb2b56ddb6f806fde9237f8a00cb3a776b.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/vma: update vma_assert_locked() to use lockdepLorenzo Stoakes1-8/+48
We can use lockdep to avoid unnecessary work here, otherwise update the code to logically evaluate all pertinent cases and share code with vma_assert_write_locked(). Make it clear here that we treat the VMA being detached at this point as a bug, this was only implicit before. Additionally, abstract references to vma->vmlock_dep_map by introducing a macro helper __vma_lockdep_map() which accesses this field if lockdep is enabled. Since lock_is_held() is specified as an extern function if lockdep is disabled, we can simply have __vma_lockdep_map() defined as NULL in this case, and then use IS_ENABLED(CONFIG_LOCKDEP) to avoid ugly ifdeffery. [lorenzo.stoakes@oracle.com: add helper macro __vma_lockdep_map(), per Vlastimil] Link: https://lkml.kernel.org/r/7c4b722e-604b-4b20-8e33-03d2f8d55407@lucifer.local Link: https://lkml.kernel.org/r/538762f079cc4fa76ff8bf30a8a9525a09961451.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/vma: improve and document __is_vma_write_locked()Lorenzo Stoakes2-24/+27
We don't actually need to return an output parameter providing mm sequence number, rather we can separate that out into another function - __vma_raw_mm_seqnum() - and have any callers which need to obtain that invoke that instead. The access to the raw sequence number requires that we hold the exclusive mmap lock such that we know we can't race vma_end_write_all(), so move the assert to __vma_raw_mm_seqnum() to make this requirement clear. Also while we're here, convert all of the VM_BUG_ON_VMA()'s to VM_WARN_ON_ONCE_VMA()'s in line with the convention that we do not invoke oopses when we can avoid it. [lorenzo.stoakes@oracle.com: minor tweaks, per Vlastimil] Link: https://lkml.kernel.org/r/3fa89c13-232d-4eee-86cc-96caa75c2c67@lucifer.local Link: https://lkml.kernel.org/r/ef6c415c2d2c03f529dca124ccaed66bc2f60edc.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/vma: introduce helper struct + thread through exclusive lock fnsLorenzo Stoakes3-77/+112
It is confusing to have __vma_start_exclude_readers() return 0, 1 or an error (but only when waiting for readers in TASK_KILLABLE state), and having the return value be stored in a stack variable called 'locked' is further confusion. More generally, we are doing a lot of rather finnicky things during the acquisition of a state in which readers are excluded and moving out of this state, including tracking whether we are detached or not or whether an error occurred. We are implementing logic in __vma_start_exclude_readers() that effectively acts as if 'if one caller calls us do X, if another then do Y', which is very confusing from a control flow perspective. Introducing the shared helper object state helps us avoid this, as we can now handle the 'an error arose but we're detached' condition correctly in both callers - a warning if not detaching, and treating the situation as if no error arose in the case of a VMA detaching. This also acts to help document what's going on and allows us to add some more logical debug asserts. Also update vma_mark_detached() to add a guard clause for the likely 'already detached' state (given we hold the mmap write lock), and add a comment about ephemeral VMA read lock reference count increments to clarify why we are entering/exiting an exclusive locked state here. Finally, separate vma_mark_detached() into its fast-path component and make it inline, then place the slow path for excluding readers in mmap_lock.c. No functional change intended. [akpm@linux-foundation.org: fix function naming in comments, add comment per Vlastimil per Lorenzo] Link: https://lkml.kernel.org/r/7d3084d596c84da10dd374130a5055deba6439c0.1769198904.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/7d3084d596c84da10dd374130a5055deba6439c0.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/vma: clean up __vma_enter/exit_locked()Lorenzo Stoakes2-20/+42
These functions are very confusing indeed. 'Entering' a lock could be interpreted as acquiring it, but this is not what these functions are interacting with. Equally they don't indicate at all what kind of lock we are 'entering' or 'exiting'. Finally they are misleading as we invoke these functions when we already hold a write lock to detach a VMA. These functions are explicitly simply 'entering' and 'exiting' a state in which we hold the EXCLUSIVE lock in order that we can either mark the VMA as being write-locked, or mark the VMA detached. Rename the functions accordingly, and also update __vma_end_exclude_readers() to return detached state with a __must_check directive, as it is simply clumsy to pass an output pointer here to detached state and inconsistent vs. __vma_start_exclude_readers(). Finally, remove the unnecessary 'inline' directives. No functional change intended. Link: https://lkml.kernel.org/r/33273be9389712347d69987c408ca7436f0c1b22.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/vma: de-duplicate __vma_enter_locked() error pathLorenzo Stoakes1-9/+12
We're doing precisely the same thing that __vma_exit_locked() does, so de-duplicate this code and keep the refcount primitive in one place. No functional change intended. Link: https://lkml.kernel.org/r/c9759b593f6a158e984fa87abe2c3cbd368ef825.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/vma: add+use vma lockdep acquire/release definesLorenzo Stoakes2-8/+39
The code is littered with inscrutable and duplicative lockdep incantations, replace these with defines which explain what is going on and add commentary to explain what we're doing. If lockdep is disabled these become no-ops. We must use defines so _RET_IP_ remains meaningful. These are self-documenting and aid readability of the code. Additionally, instead of using the confusing rwsem_*() form for something that is emphatically not an rwsem, we instead explicitly use lock_[acquired, release]_shared/exclusive() lockdep invocations since we are doing something rather custom here and these make more sense to use. No functional change intended. Link: https://lkml.kernel.org/r/fdae72441949ecf3b4a0ed3510da803e881bb153.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-01mm/vma: rename is_vma_write_only(), separate out shared refcount putLorenzo Stoakes2-20/+63
The is_vma_writer_only() function is misnamed - this isn't determining if there is only a write lock, as it checks for the presence of the VM_REFCNT_EXCLUDE_READERS_FLAG. Really, it is checking to see whether readers are excluded, with a possibility of a false positive in the case of a detachment (there we expect the vma->vm_refcnt to eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it to eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1). Rename the function accordingly. Relatedly, we use a __refcount_dec_and_test() primitive directly in vma_refcount_put(), using the old value to determine what the reference count ought to be after the operation is complete (ignoring racing reference count adjustments). Wrap this into a __vma_refcount_put_return() function, which we can then utilise in vma_mark_detached() and thus keep the refcount primitive usage abstracted. This function, as the name implies, returns the value after the reference count has been updated. This reduces duplication in the two invocations of this function. Also adjust comments, removing duplicative comments covered elsewhere and adding more to aid understanding. No functional change intended. Link: https://lkml.kernel.org/r/32053580bff460eb1092ef780b526cefeb748bad.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>