diff options
| author | Ryan Roberts <ryan.roberts@arm.com> | 2026-04-01 13:16:19 +0300 |
|---|---|---|
| committer | Andrew Morton <akpm@linux-foundation.org> | 2026-05-29 07:04:40 +0300 |
| commit | 4aa4abf1f14bd6d0748b7d35a803cc2376a8e20b (patch) | |
| tree | 05bc0aa8d7dc3dbb04b440efba7926e0d5dbb281 /include/linux | |
| parent | 4221aadd720bef7df1268391d6eb1ea1f0476b38 (diff) | |
| download | linux-4aa4abf1f14bd6d0748b7d35a803cc2376a8e20b.tar.xz | |
mm/page_alloc: optimize free_contig_range()
Patch series "mm: Free contiguous order-0 pages efficiently", v6.
A recent change to vmalloc caused some performance benchmark regressions
(see [1]). I'm attempting to fix that (and at the same time significantly
improve beyond the baseline) by freeing a contiguous set of order-0 pages
as a batch.
At the same time I observed that free_contig_range() was essentially doing
the same thing as vfree() so I've fixed it there too. While at it,
optimize the __free_contig_frozen_range() as well.
Check that the contiguous range falls in the same section. If they aren't
enabled, the if conditions get optimized out by the compiler as
memdesc_section() returns 0. See num_pages_contiguous() for more details
about it.
This patch (of 3):
Decompose the range of order-0 pages to be freed into the set of largest
possible power-of-2 size and aligned chunks and free them to the pcp or
buddy. This improves on the previous approach which freed each order-0
page individually in a loop. Testing shows performance to be improved by
more than 10x in some cases.
Since each page is order-0, we must decrement each page's reference count
individually and only consider the page for freeing as part of a high
order chunk if the reference count goes to zero. Additionally
free_pages_prepare() must be called for each individual order-0 page too,
so that the struct page state and global accounting state can be
appropriately managed. But once this is done, the resulting high order
chunks can be freed as a unit to the pcp or buddy.
This significantly speeds up the free operation but also has the side
benefit that high order blocks are added to the pcp instead of each page
ending up on the pcp order-0 list; memory remains more readily available
in high orders.
vmalloc will shortly become a user of this new optimized
free_contig_range() since it aggressively allocates high order
non-compound pages, but then calls split_page() to end up with contiguous
order-0 pages. These can now be freed much more efficiently.
The execution time of the following function was measured in a server
class arm64 machine:
static int page_alloc_high_order_test(void)
{
unsigned int order = HPAGE_PMD_ORDER;
struct page *page;
int i;
for (i = 0; i < 100000; i++) {
page = alloc_pages(GFP_KERNEL, order);
if (!page)
return -1;
split_page(page, order);
free_contig_range(page_to_pfn(page), 1UL << order);
}
return 0;
}
Execution time before: 4097358 usec
Execution time after: 729831 usec
Perf trace before:
99.63% 0.00% kthreadd [kernel.kallsyms] [.] kthread
|
---kthread
0xffffb33c12a26af8
|
|--98.13%--0xffffb33c12a26060
| |
| |--97.37%--free_contig_range
| | |
| | |--94.93%--___free_pages
| | | |
| | | |--55.42%--__free_frozen_pages
| | | | |
| | | | --43.20%--free_frozen_page_commit
| | | | |
| | | | --35.37%--_raw_spin_unlock_irqrestore
| | | |
| | | |--11.53%--_raw_spin_trylock
| | | |
| | | |--8.19%--__preempt_count_dec_and_test
| | | |
| | | |--5.64%--_raw_spin_unlock
| | | |
| | | |--2.37%--__get_pfnblock_flags_mask.isra.0
| | | |
| | | --1.07%--free_frozen_page_commit
| | |
| | --1.54%--__free_frozen_pages
| |
| --0.77%--___free_pages
|
--0.98%--0xffffb33c12a26078
alloc_pages_noprof
Perf trace after:
8.42% 2.90% kthreadd [kernel.kallsyms] [k] __free_contig_range
|
|--5.52%--__free_contig_range
| |
| |--5.00%--free_prepared_contig_range
| | |
| | |--1.43%--__free_frozen_pages
| | | |
| | | --0.51%--free_frozen_page_commit
| | |
| | |--1.08%--_raw_spin_trylock
| | |
| | --0.89%--_raw_spin_unlock
| |
| --0.52%--free_pages_prepare
|
--2.90%--ret_from_fork
kthread
0xffffae1c12abeaf8
0xffffae1c12abe7a0
|
--2.69%--vfree
__free_contig_range
Link: https://lore.kernel.org/20260401101634.2868165-1-usama.anjum@arm.com
Link: https://lore.kernel.org/20260401101634.2868165-2-usama.anjum@arm.com
Link: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com [1]
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'include/linux')
| -rw-r--r-- | include/linux/gfp.h | 2 |
1 files changed, 2 insertions, 0 deletions
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 51ef13ed756e..87259e309dee 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages); void free_contig_range(unsigned long pfn, unsigned long nr_pages); #endif +void __free_contig_range(unsigned long pfn, unsigned long nr_pages); + DEFINE_FREE(free_page, void *, free_page((unsigned long)_T)) #endif /* __LINUX_GFP_H */ |
