kernel/linux.git/include/linux/swap.h, branch v6.18.21

mm, swap: use the swap table for the swap cache and switch API

2025-09-21T21:22:24+00:00

Introduce basic swap table infrastructures, which are now just a fixed-sized flat array inside each swap cluster, with access wrappers. Each cluster contains a swap table of 512 entries. Each table entry is an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE), a folio type (pointer), or NULL. In this first step, it only supports storing a folio or shadow, and it is a drop-in replacement for the current swap cache. Convert all swap cache users to use the new sets of APIs. Chris Li has been suggesting using a new infrastructure for swap cache for better performance, and that idea combined well with the swap table as the new backing structure. Now the lock contention range is reduced to 2M clusters, which is much smaller than the 64M address_space. And we can also drop the multiple address_space design. All the internal works are done with swap_cache_get_* helpers. Swap cache lookup is still lock-less like before, and the helper's contexts are same with original swap cache helpers. They still require a pin on the swap device to prevent the backing data from being freed. Swap cache updates are now protected by the swap cluster lock instead of the XArray lock. This is mostly handled internally, but new __swap_cache_* helpers require the caller to lock the cluster. So, a few new cluster access and locking helpers are also introduced. A fully cluster-based unified swap table can be implemented on top of this to take care of all count tracking and synchronization work, with dynamic allocation. It should reduce the memory usage while making the performance even better. Link: https://lkml.kernel.org/r/20250916160100.31545-12-ryncsn@gmail.com Co-developed-by: Chris Li Signed-off-by: Chris Li Signed-off-by: Kairui Song Acked-by: Chris Li Suggested-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: kernel test robot Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Cc: Yosry Ahmed Cc: Zi Yan Cc: SeongJae Park Signed-off-by: Andrew Morton

mm, swap: tidy up swap device and cluster info helpers

2025-09-21T21:22:23+00:00

swp_swap_info is the most commonly used helper for retrieving swap info. It has an internal check that may lead to a NULL return value, but almost none of its caller checks the return value, making the internal check pointless. In fact, most of these callers already ensured the entry is valid and never expect a NULL value. Tidy this up and improve the function names. If the caller can make sure the swap entry/type is valid and the device is pinned, use the new introduced __swap_entry_to_info/__swap_type_to_info instead. They have more debug sanity checks and lower overhead as they are inlined. Callers that may expect a NULL value should use swap_entry_to_info/swap_type_to_info instead. No feature change. The rearranged codes should have had no effect, or they should have been hitting NULL de-ref bugs already. Only some new sanity checks are added so potential issues may show up in debug build. The new helpers will be frequently used with swap table later when working with swap cache folios. A locked swap cache folio ensures the entries are valid and stable so these helpers are very helpful. Link: https://lkml.kernel.org/r/20250916160100.31545-8-ryncsn@gmail.com Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Barry Song Acked-by: David Hildenbrand Suggested-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: kernel test robot Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Cc: Yosry Ahmed Cc: Zi Yan Cc: SeongJae Park Signed-off-by: Andrew Morton

mm, swap: rename and move some swap cluster definition and helpers

2025-09-21T21:22:23+00:00

No feature change, move cluster related definitions and helpers to mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock helpers, so they can be used outside of swap files. And while at it, add kerneldoc. Link: https://lkml.kernel.org/r/20250916160100.31545-7-ryncsn@gmail.com Signed-off-by: Kairui Song Reviewed-by: Baolin Wang Reviewed-by: Barry Song Acked-by: Chris Li Acked-by: David Hildenbrand Suggested-by: Chris Li Acked-by: Nhat Pham Cc: Baoquan He Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: kernel test robot Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Yosry Ahmed Cc: Zi Yan Cc: SeongJae Park Signed-off-by: Andrew Morton

Merge branch 'mm-hotfixes-stable' into mm-stable in order to pick up

2025-09-21T21:19:36+00:00

changes required by mm-stable material: hugetlb and damon.

mm: swap.h: Remove deleted field from comments

2025-09-13T23:55:00+00:00

The comment for struct swap_info_struct.lock incorrectly mentions fields that have already been deleted from the structure. Update the comments to accurately reflect the current struct swap_info_struct. There is no functional change. Link: https://lkml.kernel.org/r/20250812-swap-scan-list-v3-2-6d73504d267b@kernel.org Signed-off-by: Chris Li Reviewed-by: Kairui Song Acked-by: Nhat Pham Reviewed-by: Barry Song Cc: Baoquan He Cc: "Huang, Ying" Cc: Kemeng Shi Signed-off-by: Andrew Morton

mm, swap: remove fragment clusters counter

2025-09-13T23:54:44+00:00

It was used for calculating the iteration number when the swap allocator wants to scan the whole fragment list. Now the allocator only scans one fragment cluster at a time, so no one uses this counter anymore. Remove it as a cleanup; the performance change is marginal: Build linux kernel using 10G ZRAM, make -j96, defconfig with 2G cgroup memory limit, on top of tmpfs, 64kB mTHP enabled: Before: sys time: 6278.45s After: sys time: 6176.34s Change to 8G ZRAM: Before: sys time: 5572.85s After: sys time: 5531.49s Link: https://lkml.kernel.org/r/20250806161748.76651-3-ryncsn@gmail.com Signed-off-by: Kairui Song Reviewed-by: Nhat Pham Acked-by: Chris Li Cc: Baoquan He Cc: Barry Song Cc: "Huang, Ying" Cc: Kemeng Shi Signed-off-by: Andrew Morton

mm: folio_may_be_lru_cached() unless folio_test_large()

2025-09-13T20:05:36+00:00

mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as a large folio is added: so collect_longterm_unpinnable_folios() just wastes effort when calling lru_add_drain[_all]() on a large folio. But although there is good reason not to batch up PMD-sized folios, we might well benefit from batching a small number of low-order mTHPs (though unclear how that "small number" limitation will be implemented). So ask if folio_may_be_lru_cached() rather than !folio_test_large(), to insulate those particular checks from future change. Name preferred to "folio_is_batchable" because large folios can well be put on a batch: it's just the per-CPU LRU caches, drained much later, which need care. Marked for stable, to counter the increase in lru_add_drain_all()s from "mm/gup: check ref_count instead of lru before migration". Link: https://lkml.kernel.org/r/57d2eaf8-3607-f318-e0c5-be02dce61ad0@google.com Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") Signed-off-by: Hugh Dickins Suggested-by: David Hildenbrand Acked-by: David Hildenbrand Cc: "Aneesh Kumar K.V" Cc: Axel Rasmussen Cc: Chris Li Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Johannes Weiner Cc: John Hubbard Cc: Keir Fraser Cc: Konstantin Khlebnikov Cc: Li Zhe Cc: Matthew Wilcox (Oracle) Cc: Peter Xu Cc: Rik van Riel Cc: Shivank Garg Cc: Vlastimil Babka Cc: Wei Xu Cc: Will Deacon Cc: yangge Cc: Yuanchu Xie Cc: Yu Zhao Cc: Signed-off-by: Andrew Morton

mm: optimize lru_note_cost() by adding lru_note_cost_unlock_irq()

2025-07-25T02:12:28+00:00

Dropping a lock, just to demand it again for an afterthought, cannot be good if contended: convert lru_note_cost() to lru_note_cost_unlock_irq(). [hughd@google.com: delete unneeded comment] Link: https://lkml.kernel.org/r/dbf9352a-1ed9-a021-c0c7-9309ac73e174@google.com Link: https://lkml.kernel.org/r/21100102-51b6-79d5-03db-1bb7f97fa94c@google.com Signed-off-by: Hugh Dickins Acked-by: Johannes Weiner Reviewed-by: Roman Gushchin Tested-by: Roman Gushchin Reviewed-by: Shakeel Butt Cc: David Hildenbrand Cc: Lorenzo Stoakes Cc: Michal Hocko Signed-off-by: Andrew Morton

mm: introduce per-node proactive reclaim interface

2025-07-20T01:59:53+00:00

This adds support for allowing proactive reclaim in general on a NUMA system. A per-node interface extends support for beyond a memcg-specific interface, respecting the current semantics of memory.reclaim: respecting aging LRU and not supporting artificially triggering eviction on nodes belonging to non-bottom tiers. This patch allows userspace to do: echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim One of the premises for this is to semantically align as best as possible with memory.reclaim. During a brief time memcg did support nodemask until 55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which semantics around reclaim (eviction) vs demotion were not clear, rendering charging expectations to be broken. With this approach: 1. Users who do not use memcg can benefit from proactive reclaim. The memcg interface is not NUMA aware and there are usecases that are focusing on NUMA balancing rather than workload memory footprint. 2. Proactive reclaim on top tiers will trigger demotion, for which memory is still byte-addressable. Reclaiming on the bottom nodes will trigger evicting to swap (the traditional sense of reclaim). This follows the semantics of what is today part of the aging process on tiered memory, mirroring what every other form of reclaim does (reactive and memcg proactive reclaim). Furthermore per-node proactive reclaim is not as susceptible to the memcg charging problem mentioned above. 3. Unlike the nodes= arg, this interface avoids confusing semantics, such as what exactly the user wants when mixing top-tier and low-tier nodes in the nodemask. Further per-node interface is less exposed to "free up memory in my container" usecases, where eviction is intended. 4. Users that *really* want to free up memory can use proactive reclaim on nodes knowingly to be on the bottom tiers to force eviction in a natural way - higher access latencies are still better than swap. If compelled, while no guarantees and perhaps not worth the effort, users could also also potentially follow a ladder-like approach to eventually free up the memory. Alternatively, perhaps an 'evict' option could be added to the parameters for both memory.reclaim and per-node interfaces to force this action unconditionally. [akpm@linux-foundation.org: user_proactive_reclaim(): return -EBUSY on PGDAT_RECLAIM_LOCKED contention, per Roman] [dave@stgolabs.net: memcg && node is also a bogus case, per Shakeel] Link: https://lkml.kernel.org/r/20250717235604.2atyx2aobwowpge3@offworld Link: https://lkml.kernel.org/r/20250623185851.830632-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso Acked-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Michal Hocko Cc: Yosry Ahmed Signed-off-by: Andrew Morton

mm: fix spelling issue in swap.h

2025-07-13T23:38:23+00:00

recliam -> reclaim Link: https://lkml.kernel.org/r/20250702071235.212794-1-liuqiye2025@163.com Signed-off-by: Xuanye Liu Signed-off-by: Andrew Morton