summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-04-05mm/sparse: move memory hotplug bits to sparse-vmemmap.cDavid Hildenbrand (Arm)4-309/+310
Let's move all memory hoptplug related code to sparse-vmemmap.c. We only have to expose sparse_index_init(). While at it, drop the definition of sparse_index_init() for !CONFIG_SPARSEMEM, which is unused, and place the declaration in internal.h. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-15-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/sparse: move __section_mark_present() to internal.hDavid Hildenbrand (Arm)2-8/+9
Let's prepare for moving memory hotplug handling from sparse.c to sparse-vmemmap.c by moving __section_mark_present() to internal.h. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-14-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/sparse: move sparse_init_one_section() to internal.hDavid Hildenbrand (Arm)3-25/+23
While at it, convert the BUG_ON to a VM_WARN_ON_ONCE, avoid long lines, and merge sparse_encode_mem_map() into its only caller sparse_init_one_section(). Clarify the comment a bit, pointing at page_to_pfn(). [david@kernel.org: s/VM_WARN_ON/VM_WARN_ON_ONCE/] Link: https://lkml.kernel.org/r/6b04c1a1-74e7-42e8-8523-a40802e5dacc@kernel.org Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-13-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/sparse: drop set_section_nid() from sparse_add_section()David Hildenbrand (Arm)1-1/+0
CONFIG_MEMORY_HOTPLUG is CONFIG_SPARSEMEM_VMEMMAP-only. And CONFIG_SPARSEMEM_VMEMMAP implies that NODE_NOT_IN_PAGE_FLAGS cannot be set: see include/linux/page-flags-layout.h ... #elif defined(CONFIG_SPARSEMEM_VMEMMAP) #error "Vmemmap: No space for nodes field in page flags" ... Which implies that the node is always stored in page flags and NODE_NOT_IN_PAGE_FLAGS cannot be set. Therefore, set_section_nid() is a NOP on CONFIG_SPARSEMEM_VMEMMAP. So let's remove the set_section_nid() call to prepare for moving CONFIG_MEMORY_HOTPLUG to mm/sparse-vmemmap.c Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-12-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: prepare to move subsection_map_init() to mm/sparse-vmemmap.cDavid Hildenbrand (Arm)4-9/+14
We want to move subsection_map_init() to mm/sparse-vmemmap.c. To prepare for getting rid of subsection_map_init() in mm/sparse.c completely, use a static inline function for !CONFIG_SPARSEMEM_VMEMMAP. While at it, move the declaration to internal.h and rename it to "sparse_init_subsection_map()". Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-11-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/sparse: remove CONFIG_MEMORY_HOTPLUG-specific usemap allocation handlingDavid Hildenbrand (Arm)1-99/+1
In 2008, we added through commit 48c906823f39 ("memory hotplug: allocate usemap on the section with pgdat") quite some complexity to try allocating memory for the "usemap" (storing pageblock information per memory section) for a memory section close to the memory of the "pgdat" of the node. The goal was to make memory hotunplug of boot memory more likely to succeed. That commit also added some checks for circular dependencies between two memory sections, whereby two memory sections would contain each others usemap, turning both boot memory sections un-removable. However, in 2010, commit a4322e1bad91 ("sparsemem: Put usemap for one node together") started allocating the usemap for multiple memory sections on the same node in one chunk, effectively grouping all usemap allocations of the same node in a single memblock allocation. We don't really give guarantees about memory hotunplug of boot memory, and with the change in 2010, it is impossible in practice to get any circular dependencies. So let's simply remove this complexity. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-10-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/sparse: remove sparse_decode_mem_map()David Hildenbrand (Arm)2-17/+1
section_deactivate() applies to CONFIG_SPARSEMEM_VMEMMAP only. So we can just use pfn_to_page() (after making sure we have the start PFN of the section), and remove sparse_decode_mem_map(). Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-9-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/bootmem_info: avoid using sparse_decode_mem_map()David Hildenbrand (Arm)1-5/+4
With SPARSEMEM_VMEMMAP, we can just do a pfn_to_page(). It is not super clear whether the start_pfn is properly aligned ... so let's just make sure it is properly aligned to the start of the section. We will soon might try to remove the bootmem info completely, for now, just keep it working as is. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-8-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/bootmem_info: remove handling for !CONFIG_SPARSEMEM_VMEMMAPDavid Hildenbrand (Arm)1-37/+0
It is not immediately obvious that CONFIG_HAVE_BOOTMEM_INFO_NODE is only selected from CONFIG_MEMORY_HOTREMOVE, which itself depends on CONFIG_MEMORY_HOTPLUG that ... depends on CONFIG_SPARSEMEM_VMEMMAP. Let's remove the !CONFIG_SPARSEMEM_VMEMMAP leftovers that are dead code. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-7-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/sparse: remove !CONFIG_SPARSEMEM_VMEMMAP leftovers for CONFIG_MEMORY_HOTPLUGDavid Hildenbrand (Arm)1-67/+2
CONFIG_MEMORY_HOTPLUG now depends on CONFIG_SPARSEMEM_VMEMMAP. So let's remove the !CONFIG_SPARSEMEM_VMEMMAP leftovers that are dead code. Adjust the comment above fill_subsection_map() accordingly. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-6-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/memory_hotplug: simplify check_pfn_span()David Hildenbrand (Arm)1-14/+6
We now always have CONFIG_SPARSEMEM_VMEMMAP, so remove the dead code. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-5-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/Kconfig: make CONFIG_MEMORY_HOTPLUG depend on CONFIG_SPARSEMEM_VMEMMAPDavid Hildenbrand (Arm)1-1/+1
Ever since commit f8f03eb5f0f9 ("mm: stop making SPARSEMEM_VMEMMAP user-selectable"), an architecture that supports CONFIG_SPARSEMEM_VMEMMAP (by selecting SPARSEMEM_VMEMMAP_ENABLE) can no longer enable CONFIG_SPARSEMEM without CONFIG_SPARSEMEM_VMEMMAP. Right now, CONFIG_MEMORY_HOTPLUG is guarded by CONFIG_SPARSEMEM. However, CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG is only enabled by * arm64: which selects SPARSEMEM_VMEMMAP_ENABLE * loongarch: which selects SPARSEMEM_VMEMMAP_ENABLE * powerpc (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE * riscv (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE * s390 with SPARSEMEM: which selects SPARSEMEM_VMEMMAP_ENABLE * x86 (64bit): which selects SPARSEMEM_VMEMMAP_ENABLE So, we can make CONFIG_MEMORY_HOTPLUG depend on CONFIG_SPARSEMEM_VMEMMAP without affecting any setups. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-4-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/sparse: remove WARN_ONs from (online|offline)_mem_sections()David Hildenbrand (Arm)1-15/+2
We do not allow offlining of memory with memory holes, and always hotplug memory without holes. Consequently, we cannot end up onlining or offlining memory sections that have holes (including invalid sections). That's also why these WARN_ONs never fired. Let's remove the WARN_ONs along with the TODO regarding double-checking. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-3-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/memory_hotplug: remove for_each_valid_pfn() usageDavid Hildenbrand (Arm)1-2/+2
When offlining memory, we know that the memory range has no holes. Checking for valid pfns is not required. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-2-096addc8800d@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/memory_hotplug: fix possible race in scan_movable_pages()David Hildenbrand (Arm)1-3/+8
Patch series "mm: memory hot(un)plug and SPARSEMEM cleanups", v2. Some cleanups around memory hot(un)plug and SPARSEMEM. In essence, we can limit CONFIG_MEMORY_HOTPLUG to CONFIG_SPARSEMEM_VMEMMAP, remove some dead code, and move all the hotplug bits over to mm/sparse-vmemmap.c. Some further/related cleanups around other unnecessary code (memory hole handling and complicated usemap allocation). I have some further sparse.c cleanups lying around, and I'm planning on getting rid of bootmem_info.c entirely. This patch (of 15): If a hugetlb folio gets freed while we are in scan_movable_pages(), folio_nr_pages() could return 0, resulting in or'ing "0 - 1 = -1" to the PFN, resulting in PFN = -1. We're not holding any locks or references that would prevent that. for_each_valid_pfn() would then search for the next valid PFN, and could return a PFN that is outside of the range of the original requested range. do_migrate_page() would then try to migrate quite a big range, which is certainly undesirable. To fix it, simply test for valid folio_nr_pages() values. While at it, as PageHuge() really just does a page_folio() internally, we can just use folio_test_hugetlb() on the folio directly. scan_movable_pages() is expected to be fast, and we try to avoid taking locks or grabbing references. We cannot use folio_try_get() as that does not work for free hugetlb folios. We could grab the hugetlb_lock, but that just adds complexity. The race is unlikely to trigger in practice, so we won't be CCing stable. Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-0-096addc8800d@kernel.org Link: https://lkml.kernel.org/r/20260320-sparsemem_cleanups-v2-1-096addc8800d@kernel.org Fixes: 16540dae959d ("mm/hugetlb: mm/memory_hotplug: use a folio in scan_movable_pages()") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/swapfile: remove duplicate include of swap_table.hChen Ni1-1/+0
Remove duplicate inclusion of swap_table.h in swapfile.c to clean up redundant code. Link: https://lkml.kernel.org/r/20260318043849.399266-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05Docs/mm/damon/design: document DAMON actions when TRANSPARENT_HUGEPAGE is offAsier Gutierrez1-2/+6
MADV_HUGEPAGE and MADV_NOHUGEPAGE are guarded and they are not available when compiling the kernel without TRANSPARENT_HUGEPAGE option. The DAMON behaviour is to silently fail [1] in when DAMOS_HUGEPAGE or DAMOS_NOHUGEPAGE are used, but TRANSPARENT_HUGEPAGE is disabled. Update the DAMON documentation to reflect this behaviour. Link: https://lkml.kernel.org/r/20260318035349.88715-1-sj@kernel.org Link: https://lore.kernel.org/66131775-180b-4b9f-b7ce-61a3e077b6e6@huawei-partners.com/ [1] Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: change scan_slots to return voidSergey Senozhatsky1-9/+5
scan_slots_for_writeback() and scan_slots_for_recompress() work in a "best effort" fashion, if they cannot allocate memory for a new pp-slot candidate they just return and post-processing selects slots that were successfully scanned thus far. scan_slots functions never return errors and their callers never check the return status, so convert them to return void. Link: https://lkml.kernel.org/r/20260317032349.753645-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05Docs/mm/damon: document exclusivity of special-purpose modulesLiew Rui Yan4-0/+19
Add a section in design.rst to explain that DAMON special-purpose kernel modules (LRU_SORT, RECLAIM, STAT) run in an exclusive manner and return -EBUSY if another is already running. Update lru_sort.rst, reclaim.rst and stat.rst by adding cross-references to this exclusivity rule at the end of their respective Example sections. This change is motivated from another discussion [1]. Link: https://lkml.kernel.org/r/20260315162945.80994-1-sj@kernel.org Link: https://lore.kernel.org/damon/20260314002119.79742-1-sj@kernel.org/T/#t [1] Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: propagate read_from_bdev_async() errorsSergey Senozhatsky1-7/+8
When read_from_bdev_async() fails to chain bio, for instance fails to allocate request or bio, we need to propagate the error condition so that upper layer is aware of it. zram already does that by setting BLK_STS_IOERR ->bi_status, but only for sync reads. Change async read path to return its error status so that async errors are also handled. Link: https://lkml.kernel.org/r/20260316015354.114465-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Brian Geffon <bgeffon@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05zram: optimize LZ4 dictionary compression performancegao xu1-3/+26
Calling `LZ4_loadDict()` repeatedly in Zram causes significant overhead due to its internal dictionary pre-processing. This commit introduces a template stream mechanism to pre-process the dictionary only once when the dictionary is initially set or modified. It then efficiently copies this state for subsequent compressions. Verification Test Items: Test Platform: android16-6.12 1. Collect Anonymous Page Dataset 1) Apply the following patch: static bool zram_meta_alloc(struct zram *zram, u64 disksize) if (!huge_class_size) - huge_class_size = zs_huge_class_size(zram->mem_pool); + huge_class_size = 0; 2)Install multiple apps and monkey testing until SwapFree is close to 0. 3)Execute the following command to export data: dd if=/dev/block/zram0 of=/data/samples/zram_dump.img bs=4K 2. Train Dictionary Since LZ4 does not have a dedicated dictionary training tool, the zstd tool can be used for training[1]. The command is as follows: zstd --train /data/samples/* --split=4096 --maxdict=64KB -o /vendor/etc/dict_data 3. Test Code adb shell "dd if=/data/samples/zram_dump.img of=/dev/test_pattern bs=4096 count=131072 conv=fsync" adb shell "swapoff /dev/block/zram0" adb shell "echo 1 > /sys/block/zram0/reset" adb shell "echo lz4 > /sys/block/zram0/comp_algorithm" adb shell "echo dict=/vendor/etc/dict_data > /sys/block/zram0/algorithm_params" adb shell "echo 6G > /sys/block/zram0/disksize" echo "Start Compression" adb shell "taskset 80 dd if=/dev/test_pattern of=/dev/block/zram0 bs=4096 count=131072 conv=fsync" echo. echo "Start Decompression" adb shell "taskset 80 dd if=/dev/block/zram0 of=/dev/output_result bs=4096 count=131072 conv=fsync" echo "mm_stat:" adb shell "cat /sys/block/zram0/mm_stat" echo. Note: To ensure stable test results, it is best to lock the CPU frequency before executing the test. LZ4 supports dictionaries up to 64KB. Below are the test results for compression rates at various dictionary sizes: dict_size base patch 4 KB 156M/s 219M/s 8 KB 136M/s 217M/s 16KB 98M/s 214M/s 32KB 66M/s 225M/s 64KB 38M/s 224M/s When an LZ4 compression dictionary is enabled, compression speed is negatively impacted by the dictionary's size; larger dictionaries result in slower compression. This patch eliminates the influence of dictionary size on compression speed, ensuring consistent performance regardless of dictionary scale. Link: https://lkml.kernel.org/r/698181478c9c4b10aa21b4a847bdc706@honor.com Link: https://github.com/lz4/lz4?tab=readme-ov-file [1] Signed-off-by: gao xu <gaoxu2@honor.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()Nico Pache1-70/+72
The khugepaged daemon and madvise_collapse have two different implementations that do almost the same thing. Create collapse_single_pmd to increase code reuse and create an entry point to these two users. Refactor madvise_collapse and collapse_scan_mm_slot to use the new collapse_single_pmd function. To help reduce confusion around the mmap_locked variable, we rename mmap_locked to lock_dropped in the collapse_scan_mm_slot() function, and remove the redundant mmap_locked in madvise_collapse(); this further unifies the code readiblity. the SCAN_PTE_MAPPED_HUGEPAGE enum is no longer reachable in the madvise_collapse() function, so we drop it from the list of "continuing" enums. This introduces a minor behavioral change that is most likely an undiscovered bug. The current implementation of khugepaged tests collapse_test_exit_or_disable() before calling collapse_pte_mapped_thp, but we weren't doing it in the madvise_collapse case. By unifying these two callers madvise_collapse now also performs this check. We also modify the return value to be SCAN_ANY_PROCESS which properly indicates that this process is no longer valid to operate on. By moving the madvise_collapse writeback-retry logic into the helper function we can also avoid having to revalidate the VMA. We guard the khugepaged_pages_collapsed variable to ensure its only incremented for khugepaged. As requested we also convert a VM_BUG_ON to a VM_WARN_ON. Link: https://lkml.kernel.org/r/20260325114022.444081-6-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/khugepaged: rename hpage_collapse_* to collapse_*Nico Pache2-32/+30
The hpage_collapse functions describe functions used by madvise_collapse and khugepaged. remove the unnecessary hpage prefix to shorten the function name. Link: https://lkml.kernel.org/r/20260325114022.444081-5-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1Nico Pache1-4/+5
The value (HPAGE_PMD_NR - 1) is used often in the khugepaged code to signify the limit of the max_ptes_* values. Add a define for this to increase code readability and reuse. Link: https://lkml.kernel.org/r/20260325114022.444081-4-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: introduce is_pmd_order helperNico Pache7-10/+14
In order to add mTHP support to khugepaged, we will often be checking if a given order is (or is not) a PMD order. Some places in the kernel already use this check, so lets create a simple helper function to keep the code clean and readable. Link: https://lkml.kernel.org/r/20260325114022.444081-3-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: consolidate anonymous folio PTE mapping into helpersNico Pache2-20/+45
Patch series "mm: khugepaged cleanups and mTHP prerequisites", v4. The following series contains cleanups and prerequisites for my work on khugepaged mTHP support [1]. These have been separated out to ease review. The first patch in the series refactors the page fault folio to pte mapping and follows a similar convention as defined by map_anon_folio_pmd_(no)pf(). This not only cleans up the current implementation of do_anonymous_page(), but will allow for reuse later in the khugepaged mTHP implementation. The second patch adds a small is_pmd_order() helper to check if an order is the PMD order. This check is open-coded in a number of places. This patch aims to clean this up and will be used more in the khugepaged mTHP work. The third patch also adds a small DEFINE for (HPAGE_PMD_NR - 1) which is used often across the khugepaged code. The fourth and fifth patch come from the khugepaged mTHP patchset [1]. These two patches include the rename of function prefixes, and the unification of khugepaged and madvise_collapse via a new collapse_single_pmd function. Patch 1: refactor do_anonymous_page into map_anon_folio_pte_(no)pf Patch 2: add is_pmd_order helper Patch 3: Add define for (HPAGE_PMD_NR - 1) Patch 4: Refactor/rename hpage_collapse Patch 5: Refactoring to combine madvise_collapse and khugepaged A big thanks to everyone that has reviewed, tested, and participated in the development process. This patch (of 5): The anonymous page fault handler in do_anonymous_page() open-codes the sequence to map a newly allocated anonymous folio at the PTE level: - construct the PTE entry - add rmap - add to LRU - set the PTEs - update the MMU cache. Introduce two helpers to consolidate this duplicated logic, mirroring the existing map_anon_folio_pmd_nopf() pattern for PMD-level mappings: map_anon_folio_pte_nopf(): constructs the PTE entry, takes folio references, adds anon rmap and LRU. This function also handles the uffd_wp that can occur in the pf variant. The future khugepaged mTHP code calls this to handle mapping the new collapsed mTHP to its folio. map_anon_folio_pte_pf(): extends the nopf variant to handle MM_ANONPAGES counter updates, and mTHP fault allocation statistics for the page fault path. The zero-page read path in do_anonymous_page() is also untangled from the shared setpte label, since it does not allocate a folio and should not share the same mapping sequence as the write path. We can now leave nr_pages undeclared at the function intialization, and use the single page update_mmu_cache function to handle the zero page update. This refactoring will also help reduce code duplication between mm/memory.c and mm/khugepaged.c, and provides a clean API for PTE-level anonymous folio mapping that can be reused by future callers (like khugpeaged mTHP support) Link: https://lkml.kernel.org/r/20260325114022.444081-1-npache@redhat.com Link: https://lkml.kernel.org/r/20260325114022.444081-2-npache@redhat.com Link: https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/userfaultfd: fix hugetlb fault mutex hash calculationJianhui Zhou2-1/+18
In mfill_atomic_hugetlb(), linear_page_index() is used to calculate the page index for hugetlb_fault_mutex_hash(). However, linear_page_index() returns the index in PAGE_SIZE units, while hugetlb_fault_mutex_hash() expects the index in huge page units. This mismatch means that different addresses within the same huge page can produce different hash values, leading to the use of different mutexes for the same huge page. This can cause races between faulting threads, which can corrupt the reservation map and trigger the BUG_ON in resv_map_release(). Fix this by introducing hugetlb_linear_page_index(), which returns the page index in huge page granularity, and using it in place of linear_page_index(). Link: https://lkml.kernel.org/r/20260310110526.335749-1-jianhuizzzzz@gmail.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Jianhui Zhou <jianhuizzzzz@gmail.com> Reported-by: syzbot+f525fd79634858f478e7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f525fd79634858f478e7 Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: JonasZhou <JonasZhou@zhaoxin.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/damon/lru_sort: respect addr_unit on default monitoring region setupSeongJae Park1-6/+0
In the past, damon_set_region_biggest_system_ram_default(), which is the core function for setting the default monitoring target region of DAMON_LRU_SORT, didn't support addr_unit. Hence DAMON_LRU_SORT was silently ignoring the user input for addr_unit when the user doesn't explicitly set the monitoring target regions, and therefore the default target region is being used. No real problem from that ignorance was reported so far. But, the implicit rule is only making things confusing. Also, the default target region setup function is updated to support addr_unit. Hence there is no reason to keep ignoring it. Respect the user-passed addr_unit for the default target monitoring region use case. Link: https://lkml.kernel.org/r/20260311052927.93921-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/damon/reclaim: respect addr_unit on default monitoring region setupSeongJae Park1-6/+0
In the past, damon_set_region_biggest_system_ram_default(), which is the core function for setting the default monitoring target region of DAMON_RECLAIM, didn't support addr_unit. Hence DAMON_RECLAIM was silently ignoring the user input for addr_unit when the user doesn't explicitly set the monitoring target regions, and therefore the default target region is being used. No real problem from that ignorance was reported so far. But, the implicit rule is only making things confusing. Also, the default target region setup function is updated to support addr_unit. Hence there is no reason to keep ignoring it. Respect the user-passed addr_unit for the default target monitoring region use case. Link: https://lkml.kernel.org/r/20260311052927.93921-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/damon/core: fix wrong damon_set_regions() argumentSeongJae Park1-1/+1
The third argument is the length of the second parameter. But addr_unit is wrongly being passed. Fix it. Link: https://lkml.kernel.org/r/20260314001854.79623-1-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/damon/core: receive addr_unit on ↵SeongJae Park4-3/+7
damon_set_region_biggest_system_ram_default() damon_find_biggest_system_ram() was not supporting addr_unit in the past. Hence, its caller, damon_set_region_biggest_system_ram_default(), was also not supporting addr_unit. The previous commit has updated the inner function to support addr_unit. There is no more reason to not support addr_unit on damon_set_region_biggest_system_ram_default(). Rather, it makes unnecessary inconsistency on support of addr_unit. Update it to receive addr_unit and handle it inside. Link: https://lkml.kernel.org/r/20260311052927.93921-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/damon/core: support addr_unit on damon_find_biggest_system_ram()SeongJae Park1-11/+23
damon_find_biggest_system_ram() sets an 'unsigned long' variable with 'resource_size_t' value. This is fundamentally wrong. On environments such as ARM 32 bit machines having LPAE (Large Physical Address Extensions), which DAMON supports, the size of 'unsigned long' may be smaller than that of 'resource_size_t'. It is safe, though, since we restrict the walk to be done only up to ULONG_MAX. DAMON supports the address size gap using 'addr_unit'. We didn't add the support to the function, just to make the initial support change small. Now the support is reasonably settled. This kind of gap is only making the code inconsistent and easy to be confused. Add the support of 'addr_unit' to the function, by letting callers pass the 'addr_unit' and handling it in the function. All callers are passing 'addr_unit' 1, though, to keep the old behavior. [sj@kernel.org: verify found biggest system ram] Link: https://lkml.kernel.org/r/20260317144725.88524-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260311052927.93921-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/damon/core: fix wrong end address assignment on walk_system_ram()SeongJae Park1-1/+1
Patch series "mm/damon: support addr_unit on default monitoring targets for modules". DAMON_RECLAIM and DAMON_LRU_SORT support 'addr_unit' parameters only when the monitoring target address range is explicitly set. This was intentional for making the initial 'addr_unit' support change small. Now 'addr_unit' support is being quite stabilized. Having the corner case of the support is only making the code inconsistent with implicit rules. The inconsistency makes it easy to confuse [1] readers. After all, there is no real reason to keep 'addr_unit' support incomplete. Add the support for the case to improve the readability and more completely support 'addr_unit'. This series is constructed with five patches. The first one (patch 1) fixes a small bug that mistakenly assigns inclusive end address to open end address, which was found from this work. The second and third ones (patches 2 and 3) extend the default monitoring target setting functions in the core layer one by one, to support the 'addr_unit' while making no visible changes. The final two patches (patches 4 and 5) update DAMON_RECLAIM and DAMON_LRU_SORT to support 'addr_unit' for the default monitoring target address ranges, by passing the user input to the core functions. This patch (of 5): 'struct damon_addr_range' and 'struct resource' represent different types of address ranges. 'damon_addr_range' is for end-open ranges ([start, end)). 'resource' is for fully-closed ranges ([start, end]). But walk_system_ram() is assigning resource->end to damon_addr_range->end without the inclusiveness adjustment. As a result, the function returns an address range that is missing the last one byte. The function is being used to find and set the biggest system ram as the default monitoring target for DAMON_RECLAIM and DAMON_LRU_SORT. Missing the last byte of the big range shouldn't be a real problem for the real use cases. That said, the loss is definitely an unintended behavior. Do the correct adjustment. Link: https://lkml.kernel.org/r/20260311052927.93921-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260311052927.93921-2-sj@kernel.org Link: https://lore.kernel.org/20260131015643.79158-1-sj@kernel.org [1] Fixes: 43b0536cb471 ("mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yang yingliang <yangyingliang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/mremap: check map count under mmap write lock and abstractLorenzo Stoakes (Oracle)1-13/+75
We are checking the mmap count in check_mremap_params(), prior to obtaining an mmap write lock, which means that accesses to current->mm->map_count might race with this field being updated. Resolve this by only checking this field after the mmap write lock is held. Additionally, abstract this check into a helper function with extensive ASCII documentation of what's going on. Link: https://lkml.kernel.org/r/18be0b48eaa8e8804eb745974ee729c3ade0c687.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reported-by: Jianzhou Zhao <luckd0g@163.com> Closes: https://lore.kernel.org/all/1a7d4c26.6b46.19cdbe7eaf0.Coremail.luckd0g@163.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: abstract reading sysctl_max_map_count, and READ_ONCE()Lorenzo Stoakes (Oracle)9-12/+24
Concurrent reads and writes of sysctl_max_map_count are possible, so we should READ_ONCE() and WRITE_ONCE(). The sysctl procfs logic already enforces WRITE_ONCE(), so abstract the read side with get_sysctl_max_map_count(). While we're here, also move the field to mm/internal.h and add the getter there since only mm interacts with it, there's no need for anybody else to have access. Finally, update the VMA userland tests to reflect the change. Link: https://lkml.kernel.org/r/0715259eb37cbdfde4f9e5db92a20ec7110a1ce5.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Jianzhou Zhao <luckd0g@163.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/mremap: correct invalid map count checkLorenzo Stoakes (Oracle)1-16/+12
Patch series "mm: improve map count checks". Firstly, in mremap(), it appears that our map count checks have been overly conservative - there is simply no reason to require that we have headroom of 4 mappings prior to moving the VMA, we only need headroom of 2 VMAs since commit 659ace584e7a ("mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()"). Likely the original headroom of 4 mappings was a mistake, and 3 was actually intended. Next, we access sysctl_max_map_count in a number of places without being all that careful about how we do so. We introduce a simple helper that READ_ONCE()'s the field (get_sysctl_max_map_count()) to ensure that the field is accessed correctly. The WRITE_ONCE() side is already handled by the sysctl procfs code in proc_int_conv(). We also move this field to internal.h as there's no reason for anybody else to access it outside of mm. Unfortunately we have to maintain the extern variable, as mmap.c implements the procfs code. Finally, we are accessing current->mm->map_count without holding the mmap write lock, which is also not correct, so this series ensures the lock is head before we access it. We also abstract the check to a helper function, and add ASCII diagrams to explain why we're doing what we're doing. This patch (of 3): We currently check to see, if on moving a VMA when doing mremap(), if it might violate the sys.vm.max_map_count limit. This was introduced in the mists of time prior to 2.6.12. At this point in time, as now, the move_vma() operation would copy the VMA (+1 mapping if not merged), then potentially split the source VMA upon unmap. Prior to commit 659ace584e7a ("mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()"), a VMA split would check whether mm->map_count >= sysctl_max_map_count prior to a split before it ran. On unmap of the source VMA, if we are moving a partial VMA, we might split the VMA twice. This would mean, on invocation of split_vma() (as was), we'd check whether mm->map_count >= sysctl_max_map_count with a map count elevated by one, then again with a map count elevated by two, ending up with a map count elevated by three. At this point we'd reduce the map count on unmap. At the start of move_vma(), there was a check that has remained throughout mremap()'s history of mm->map_count >= sysctl_max_map_count - 3 (which implies mm->mmap_count + 4 > sysctl_max_map_count - that is, we must have headroom for 4 additional mappings). After mm->map_count is elevated by 3, it is decremented by one once the unmap completes. The mmap write lock is held, so nothing else will observe mm->map_count > sysctl_max_map_count. It appears this check was always incorrect - it should have either be one of 'mm->map_count > sysctl_max_map_count - 3' or 'mm->map_count >= sysctl_max_map_count - 2'. After commit 659ace584e7a ("mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()"), the map count check on split is eliminated in the newly introduced __split_vma(), which the unmap path uses, and has that path check whether mm->map_count >= sysctl_max_map_count. This is valid since, net, an unmap can only cause an increase in map count of 1 (split both sides, unmap middle). Since we only copy a VMA and (if MREMAP_DONTUNMAP is not set) unmap afterwards, the maximum number of additional mappings that will actually be subject to any check will be 2. Therefore, update the check to assert this corrected value. Additionally, update the check introduced by commit ea2c3f6f5545 ("mm,mremap: bail out earlier in mremap_to under map pressure") to account for this. While we're here, clean up the comment prior to that. Link: https://lkml.kernel.org/r/cover.1773249037.git.ljs@kernel.org Link: https://lkml.kernel.org/r/73e218c67dcd197c5331840fb011e2c17155bfb0.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Jianzhou Zhao <luckd0g@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05kasan: update outdated commentKexin Sun1-4/+4
kmalloc_large() was renamed kmalloc_large_noprof() by commit 7bd230a26648 ("mm/slab: enable slab allocation tagging for kmalloc and friends"), and subsequently renamed __kmalloc_large_noprof() by commit a0a44d9175b3 ("mm, slab: don't wrap internal functions with alloc_hooks()"), making it an internal implementation detail. Large kmalloc allocations are now performed through the public kmalloc() interface directly, making the reference to KMALLOC_MAX_SIZE also stale (KMALLOC_MAX_CACHE_SIZE would be more accurate). Remove the references to kmalloc_large() and KMALLOC_MAX_SIZE, and rephrase the description for large kmalloc allocations. Link: https://lkml.kernel.org/r/20260312053812.1365-1-kexinsun@smail.nju.edu.cn Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Assisted-by: unnamed:deepseek-v3.2 coccinelle Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: migrate: requeue destination folio on deferred split queueUsama Arif1-0/+17
During folio migration, __folio_migrate_mapping() removes the source folio from the deferred split queue, but the destination folio is never re-queued. This causes underutilized THPs to escape the shrinker after NUMA migration, since they silently drop off the deferred split list. Fix this by recording whether the source folio was on the deferred split queue and its partially mapped state before move_to_new_folio() unqueues it, and re-queuing the destination folio after a successful migration if it was. By the time migrate_folio_move() runs, partially mapped folios without a pin have already been split by migrate_pages_batch(). So only two cases remain on the deferred list at this point: 1. Partially mapped folios with a pin (split failed). 2. Fully mapped but potentially underused folios. The recorded partially_mapped state is forwarded to deferred_split_folio() so that the destination folio is correctly re-queued in both cases. Because THPs are removed from the deferred_list, THP shinker cannot split the underutilized THPs in time. As a result, users will show less free memory than before. Link: https://lkml.kernel.org/r/20260312104723.1351321-1-usama.arif@linux.dev Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Usama Arif <usama.arif@linux.dev> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05selftest: memcg: skip memcg_sock test if address family not supportedWaiman Long1-1/+10
The test_memcg_sock test in memcontrol.c sets up an IPv6 socket and send data over it to consume memory and verify that memory.stat.sock and memory.current values are close. On systems where IPv6 isn't enabled or not configured to support SOCK_STREAM, the test_memcg_sock test always fails. When the socket() call fails, there is no way we can test the memory consumption and verify the above claim. I believe it is better to just skip the test in this case instead of reporting a test failure hinting that there may be something wrong with the memcg code. Link: https://lkml.kernel.org/r/20260311200526.885899-1-longman@redhat.com Fixes: 5f8f019380b8 ("selftests: cgroup/memcontrol: add basic test for socket accounting") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05selftests/mm: pagemap_ioctl: remove hungarian notationMike Rapoport (Microsoft)1-10/+10
Replace lpBaseAddress with addr and dwRegionSize with size. Link: https://lkml.kernel.org/r/20260311180737.3767545-1-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: ratelimit min_free_kbytes adjustment messagesBreno Leitao2-4/+4
The "raising min_free_kbytes" pr_info message in set_recommended_min_free_kbytes() and the "min_free_kbytes is not updated to" pr_warn in calculate_min_free_kbytes() can spam the kernel log when called repeatedly. Switch the pr_info in set_recommended_min_free_kbytes() and the pr_warn in calculate_min_free_kbytes() to their _ratelimited variants to prevent the log spam for this message. Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-4-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: huge_memory: refactor enabled_store() with set_global_enabled_mode()Breno Leitao1-15/+48
Refactor enabled_store() to use a new set_global_enabled_mode() helper. Introduce a separate enum global_enabled_mode and global_enabled_mode_strings[], mirroring the anon_enabled_mode pattern from the previous commit. A separate enum is necessary because the global THP setting does not support "inherit", only "always", "madvise", and "never". Reusing anon_enabled_mode would leave a NULL gap in the string array, causing sysfs_match_string() to stop early and fail to match entries after the gap. The helper uses the same loop pattern as set_anon_enabled_mode(), iterating over an array of flag bit positions and using test_and_set_bit()/test_and_clear_bit() to track whether the state actually changed. Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-3-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: huge_memory: refactor anon_enabled_store() with set_anon_enabled_mode()Breno Leitao1-32/+52
Consolidate the repeated spin_lock/set_bit/clear_bit pattern in anon_enabled_store() into a new set_anon_enabled_mode() helper that loops over an orders[] array, setting the bit for the selected mode and clearing the others. Introduce enum anon_enabled_mode and anon_enabled_mode_strings[] for the per-order anon THP setting. Use sysfs_match_string() with the anon_enabled_mode_strings[] table to replace the if/else chain of sysfs_streq() calls. The helper uses __test_and_set_bit()/__test_and_clear_bit() to track whether the state actually changed, so start_stop_khugepaged() is only called when needed. When the mode is unchanged, set_recommended_min_free_kbytes() is called directly to preserve the watermark recalculation behavior of the original code. Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-2-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: khugepaged: export set_recommended_min_free_kbytes()Breno Leitao2-1/+6
Patch series "mm: thp: reduce unnecessary start_stop_khugepaged()", v7. Writing to /sys/kernel/mm/transparent_hugepage/enabled causes start_stop_khugepaged() called independent of any change. start_stop_khugepaged() SPAMs the printk ring buffer overflow with the exact same message, even when nothing changes. For instance, if you have a custom vm.min_free_kbytes, just touching /sys/kernel/mm/transparent_hugepage/enabled causes a printk message. Example: # sysctl -w vm.min_free_kbytes=112382 # for i in $(seq 100); do echo never > /sys/kernel/mm/transparent_hugepage/enabled ; done and you have 100 WARN messages like the following, which is pretty dull: khugepaged: min_free_kbytes is not updated to 112381 because user defined value 112382 is preferred A similar message shows up when setting thp to "always": # for i in $(seq 100); do # echo 1024 > /proc/sys/vm/min_free_kbytes # echo always > /sys/kernel/mm/transparent_hugepage/enabled # done And then, we have 100 messages like: khugepaged: raising min_free_kbytes from 1024 to 67584 to help transparent hugepage allocations This is more common when you have a configuration management system that writes the THP configuration without an extra read, assuming that nothing will happen if there is no change in the configuration, but it prints these annoying messages. For instance, at Meta's fleet, ~10K servers were producing 3.5M of these messages per day. Fix this by making the sysfs _store helpers easier to digest and ratelimiting the message. This patch (of 4): Make set_recommended_min_free_kbytes() callable from outside khugepaged.c by removing the static qualifier and adding a declaration in mm/internal.h. This allows callers that change THP settings to recalculate watermarks without going through start_stop_khugepaged(). Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-0-31eb98fa5a8b@debian.org Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-1-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05selftests/damon/sysfs.py: test goal_tuner commitSeongJae Park1-0/+7
Extend the near-full DAMON parameters commit selftest to commit goal_tuner and confirm the internal status is updated as expected. Link: https://lkml.kernel.org/r/20260310010529.91162-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05selftests/damon/drgn_dump_damon_status: support quota goal_tuner dumpingSeongJae Park1-0/+1
Update drgn_dump_damon_status.py, which is being used to dump the in-kernel DAMON status for tests, to dump goal_tuner setup status. Link: https://lkml.kernel.org/r/20260310010529.91162-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05selftests/damon/_damon_sysfs: support goal_tuner setupSeongJae Park1-3/+9
Add support of goal_tuner setup to the test-purpose DAMON sysfs interface control helper, _damon_sysfs.py. Link: https://lkml.kernel.org/r/20260310010529.91162-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/damon/tests/core-kunit: test goal_tuner commitSeongJae Park1-0/+3
Extend damos_commit_quota() kunit test for the newly added goal_tuner parameter. Link: https://lkml.kernel.org/r/20260310010529.91162-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05Docs/ABI/damon: update for goal_tunerSeongJae Park1-0/+6
Update the ABI document for the newly added goal_tuner sysfs file. Link: https://lkml.kernel.org/r/20260310010529.91162-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05Docs/admin-guide/mm/damon/usage: document goal_tuner sysfs fileSeongJae Park1-4/+12
Update the DAMON usage document for the new sysfs file for the goal based quota auto-tuning algorithm selection. Link: https://lkml.kernel.org/r/20260310010529.91162-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>