summaryrefslogtreecommitdiff
path: root/include/linux/swap.h
diff options
context:
space:
mode:
authorRyan Roberts <ryan.roberts@arm.com>2024-04-08 21:39:44 +0300
committerAndrew Morton <akpm@linux-foundation.org>2024-04-26 06:56:37 +0300
commit845982eb264bc64b0c3242ace217fb574f56a299 (patch)
treeafed14e1f0fc3ca6cf5ab3bcb1bb78d529a2ad82 /include/linux/swap.h
parent9faaa0f8168bfcd81469b0724b25ba3093097a08 (diff)
downloadlinux-845982eb264bc64b0c3242ace217fb574f56a299.tar.xz
mm: swap: allow storage of all mTHP orders
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Link: https://lkml.kernel.org/r/20240408183946.2991168-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'include/linux/swap.h')
-rw-r--r--include/linux/swap.h8
1 files changed, 7 insertions, 1 deletions
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b69b06733247..6b672a67ae5d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -268,13 +268,19 @@ struct swap_cluster_info {
*/
#define SWAP_NEXT_INVALID 0
+#ifdef CONFIG_THP_SWAP
+#define SWAP_NR_ORDERS (PMD_ORDER + 1)
+#else
+#define SWAP_NR_ORDERS 1
+#endif
+
/*
* We assign a cluster to each CPU, so each CPU can allocate swap entry from
* its own cluster and swapout sequentially. The purpose is to optimize swapout
* throughput.
*/
struct percpu_cluster {
- unsigned int next; /* Likely next allocation offset */
+ unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
};
struct swap_cluster_list {