diff options
author | Mike Kravetz <mike.kravetz@oracle.com> | 2023-10-19 05:31:03 +0300 |
---|---|---|
committer | Andrew Morton <akpm@linux-foundation.org> | 2023-10-26 02:47:07 +0300 |
commit | d2cf88c27f51361dfe2982e510a444411d2fc78a (patch) | |
tree | fd556d21c36943361daf217577091854c225bda3 | |
parent | fa8c4f9a665bd0cf5a475327aea4fd179e896216 (diff) | |
download | linux-d2cf88c27f51361dfe2982e510a444411d2fc78a.tar.xz |
hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
Patch series "Batch hugetlb vmemmap modification operations", v8.
When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit 426e5c429d16 [1]. The
summary states that allocating a hugetlb page should be ~2x slower with
optimization and freeing a hugetlb page should be ~2-3x slower. Such
overhead was deemed an acceptable trade off for the memory savings
obtained by freeing vmemmap pages.
It was recently reported that the overhead associated with enabling
vmemmap optimization could be as high as 190x for hugetlb page
allocations. Yes, 190x! Some actual numbers from other environments are:
Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.119s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.477s
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m28.973s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m36.748s
VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.463s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m2.931s
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 2m27.609s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 2m29.924s
In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
resulted in allocation times being 61x slower.
A quick profile showed that the vast majority of this overhead was due to
TLB flushing. Each time we modify the kernel pagetable we need to flush
the TLB. For each hugetlb that is optimized, there could be potentially
two TLB flushes performed. One for the vmemmap pages associated with the
hugetlb page, and potentially another one if the vmemmap pages are mapped
at the PMD level and must be split. The TLB flushes required for the
kernel pagetable, result in a broadcast IPI with each CPU having to flush
a range of pages, or do a global flush if a threshold is exceeded. So,
the flush time increases with the number of CPUs. In addition, in virtual
environments the broadcast IPI can’t be accelerated by hypervisor
hardware and leads to traps that need to wakeup/IPI all vCPUs which is
very expensive. Because of this the slowdown in virtual environments is
even worse than bare metal as the number of vCPUS/CPUs is increased.
The following series attempts to reduce amount of time spent in TLB
flushing. The idea is to batch the vmemmap modification operations for
multiple hugetlb pages. Instead of doing one or two TLB flushes for each
page, we do two TLB flushes for each batch of pages. One flush after
splitting pages mapped at the PMD level, and another after remapping
vmemmap associated with all hugetlb pages. Results of such batching are
as follows:
Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.719s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.245s
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m7.267s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m13.199s
VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.715s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m3.186s
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m4.799s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m5.273s
With batching, results are back in the 2-3x slowdown range.
This patch (of 8):
update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators. This may require
allocating vmemmmap pages associated with each hugetlb page. The hugetlb
page destructor must be changed before pages are freed to lower level
allocators. However, the destructor must be changed under the hugetlb
lock. This means there is potentially one lock cycle per page.
Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators
Link: https://lkml.kernel.org/r/20231019023113.345257-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20231019023113.345257-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: James Houghton <jthoughton@google.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Konrad Dybcio <konradybcio@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Usama Arif <usama.arif@bytedance.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-rw-r--r-- | mm/hugetlb.c | 39 |
1 files changed, 39 insertions, 0 deletions
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index da6f85b7db88..b839080a2a6b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1862,7 +1862,46 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio, static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list) { struct folio *folio, *t_folio; + bool clear_dtor = false; + /* + * First allocate required vmemmmap (if necessary) for all folios on + * list. If vmemmap can not be allocated, we can not free folio to + * lower level allocator, so add back as hugetlb surplus page. + * add_hugetlb_folio() removes the page from THIS list. + * Use clear_dtor to note if vmemmap was successfully allocated for + * ANY page on the list. + */ + list_for_each_entry_safe(folio, t_folio, list, lru) { + if (folio_test_hugetlb_vmemmap_optimized(folio)) { + if (hugetlb_vmemmap_restore(h, &folio->page)) { + spin_lock_irq(&hugetlb_lock); + add_hugetlb_folio(h, folio, true); + spin_unlock_irq(&hugetlb_lock); + } else + clear_dtor = true; + } + } + + /* + * If vmemmmap allocation was performed on any folio above, take lock + * to clear destructor of all folios on list. This avoids the need to + * lock/unlock for each individual folio. + * The assumption is vmemmap allocation was performed on all or none + * of the folios on the list. This is true expect in VERY rare cases. + */ + if (clear_dtor) { + spin_lock_irq(&hugetlb_lock); + list_for_each_entry(folio, list, lru) + __clear_hugetlb_destructor(h, folio); + spin_unlock_irq(&hugetlb_lock); + } + + /* + * Free folios back to low level allocators. vmemmap and destructors + * were taken care of above, so update_and_free_hugetlb_folio will + * not need to take hugetlb lock. + */ list_for_each_entry_safe(folio, t_folio, list, lru) { update_and_free_hugetlb_folio(h, folio, false); cond_resched(); |