KVM: x86/mmu: Process atomically-zapped SPTEs after TLB flush - kernel/linux.git

diff options

author	David Matlack <dmatlack@google.com>	2024-03-07 22:40:59 +0300
committer	Sean Christopherson <seanjc@google.com>	2024-04-09 19:37:47 +0300
commit	aca48556c592189fdcdc68b82bbae442bd08730f (patch)
tree	d06abbb41fd8ca49688fa18cb55b125c605a8026 /tools/perf/scripts/python/export-to-postgresql.py
parent	fec50db7033ea478773b159e0e2efb135270e3b7 (diff)
download	linux-aca48556c592189fdcdc68b82bbae442bd08730f.tar.xz

KVM: x86/mmu: Process atomically-zapped SPTEs after TLB flush

When zapping TDP MMU SPTEs under read-lock, processes zapped SPTEs *after* flushing TLBs and after replacing the special REMOVED_SPTE with '0'. When zapping an SPTE that points to a page table, processing SPTEs after flushing TLBs minimizes contention on the child SPTEs (e.g. vCPUs won't hit write-protection faults via stale, read-only child SPTEs), and processing after replacing REMOVED_SPTE with '0' minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE. Processing SPTEs after setting the SPTE to '0', i.e. in parallel with the SPTE potentially being replacing with a new SPTE, is safe because KVM does not depend on completing the processing before a new SPTE is installed, and the processing is done on a subset of the page tables that is disconnected from the root, and thus unreachable by other tasks (after the TLB flush). KVM already relies on similar logic, as kvm_mmu_zap_all_fast() can result in KVM processing all SPTEs in a given root after vCPUs create mappings in a new root. In VMs with a large (400+) number of vCPUs, it can take KVM multiple seconds to process a 1GiB region mapped with 4KiB entries, e.g. when disabling dirty logging in a VM backed by 1GiB HugeTLB. During those seconds, if a vCPU accesses the 1GiB region being zapped it will be stalled until KVM finishes processing the SPTE and replaces the REMOVED_SPTE with 0. Re-ordering the processing does speed up the atomic-zaps somewhat, but the main benefit is avoiding blocking vCPU threads. Before: $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e ... Disabling dirty logging time: 509.765146313s $ ./funclatency -m tdp_mmu_zap_spte_atomic msec : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 8 |** | 256 -> 511 : 68 |****************** | 512 -> 1023 : 129 |********************************** | 1024 -> 2047 : 151 |****************************************| 2048 -> 4095 : 60 |*************** | After: $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e ... Disabling dirty logging time: 336.516838548s $ ./funclatency -m tdp_mmu_zap_spte_atomic msec : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 12 |** | 256 -> 511 : 166 |****************************************| 512 -> 1023 : 101 |************************ | 1024 -> 2047 : 137 |********************************* | Note, KVM's processing of collapsible SPTEs is still extremely slow and can be improved. For example, a significant amount of time is spent calling kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, even when processing SPTEs that all map the same folio. But avoiding blocking vCPUs and contending SPTEs is valuable regardless of how fast KVM can process collapsible SPTEs. Link: https://lore.kernel.org/all/20240320005024.3216282-1-seanjc@google.com Cc: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20240307194059.1357377-1-dmatlack@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>

Diffstat (limited to 'tools/perf/scripts/python/export-to-postgresql.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: