starfive-tech/linux.git/virt, branch master

kvm: Cap halt polling at kvm->max_halt_poll_ns

2021-05-07T10:06:22+00:00

When growing halt-polling, there is no check that the poll time exceeds the per-VM limit. It's possible for vcpu->halt_poll_ns to grow past kvm->max_halt_poll_ns and stay there until a halt which takes longer than kvm->halt_poll_ns. Signed-off-by: David Matlack Signed-off-by: Venkatesh Srinivas Message-Id: <20210506152442.4010298-1-venkateshs@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini

kvm: exit halt polling on need_resched() as well

2021-05-03T15:25:36+00:00

single_task_running() is usually more general than need_resched() but CFS_BANDWIDTH throttling will use resched_task() when there is just one task to get the task to block. This was causing long-need_resched warnings and was likely allowing VMs to overrun their quota when halt polling. Signed-off-by: Ben Segall Signed-off-by: Venkatesh Srinivas Message-Id: <20210429162233.116849-1-venkateshs@chromium.org> Signed-off-by: Paolo Bonzini Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson

KVM: Boost vCPU candidate in user mode which is delivering interrupt

2021-04-21T16:20:03+00:00

Both lock holder vCPU and IPI receiver that has halted are condidate for boost. However, the PLE handler was originally designed to deal with the lock holder preemption problem. The Intel PLE occurs when the spinlock waiter is in kernel mode. This assumption doesn't hold for IPI receiver, they can be in either kernel or user mode. the vCPU candidate in user mode will not be boosted even if they should respond to IPIs. Some benchmarks like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most of the time they are running in user mode. It can lead to a large number of continuous PLE events because the IPI sender causes PLE events repeatedly until the receiver is scheduled while the receiver is not candidate for a boost. This patch boosts the vCPU candidiate in user mode which is delivery interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs VM in over-subscribe scenario (The host machine is 2 socket, 48 cores, 96 HTs Intel CLX box). There is no performance regression for other benchmarks like Unixbench spawn (most of the time contend read/write lock in kernel mode), ebizzy (most of the time contend read/write sem and TLB shoodtdown in kernel mode). Signed-off-by: Wanpeng Li Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini

KVM: x86: Support KVM VMs sharing SEV context

2021-04-21T16:20:02+00:00

Add a capability for userspace to mirror SEV encryption context from one vm to another. On our side, this is intended to support a Migration Helper vCPU, but it can also be used generically to support other in-guest workloads scheduled by the host. The intention is for the primary guest and the mirror to have nearly identical memslots. The primary benefits of this are that: 1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they can't accidentally clobber each other. 2) The VMs can have different memory-views, which is necessary for post-copy migration (the migration vCPUs on the target need to read and write to pages, when the primary guest would VMEXIT). This does not change the threat model for AMD SEV. Any memory involved is still owned by the primary guest and its initial state is still attested to through the normal SEV_LAUNCH_* flows. If userspace wanted to circumvent SEV, they could achieve the same effect by simply attaching a vCPU to the primary VM. This patch deliberately leaves userspace in charge of the memslots for the mirror, as it already has the power to mess with them in the primary guest. This patch does not support SEV-ES (much less SNP), as it does not handle handing off attested VMSAs to the mirror. For additional context, we need a Migration Helper because SEV PSP migration is far too slow for our live migration on its own. Using an in-guest migrator lets us speed this up significantly. Signed-off-by: Nathan Tempelman Message-Id: <20210408223214.2582277-1-natet@google.com> Signed-off-by: Paolo Bonzini

KVM: Add proper lockdep assertion in I/O bus unregister

2021-04-20T08:18:51+00:00

Convert a comment above kvm_io_bus_unregister_dev() into an actual lockdep assertion, and opportunistically add curly braces to a multi-line for-loop. Signed-off-by: Sean Christopherson Message-Id: <20210412222050.876100-4-seanjc@google.com> Signed-off-by: Paolo Bonzini

KVM: Stop looking for coalesced MMIO zones if the bus is destroyed

2021-04-20T08:18:51+00:00

Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev() fails to allocate memory for the new instance of the bus. If it can't instantiate a new bus, unregister_dev() destroys all devices _except_ the target device. But, it doesn't tell the caller that it obliterated the bus and invoked the destructor for all devices that were on the bus. In the coalesced MMIO case, this can result in a deleted list entry dereference due to attempting to continue iterating on coalesced_zones after future entries (in the walk) have been deleted. Opportunistically add curly braces to the for-loop, which encompasses many lines but sneaks by without braces due to the guts being a single if statement. Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: stable@vger.kernel.org Reported-by: Hao Sun Signed-off-by: Sean Christopherson Message-Id: <20210412222050.876100-3-seanjc@google.com> Signed-off-by: Paolo Bonzini

KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing SRCU

2021-04-20T08:18:50+00:00

If allocating a new instance of an I/O bus fails when unregistering a device, wait to destroy the device until after all readers are guaranteed to see the new null bus. Destroying devices before the bus is nullified could lead to use-after-free since readers expect the devices on their reference of the bus to remain valid. Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Message-Id: <20210412222050.876100-2-seanjc@google.com> Signed-off-by: Paolo Bonzini

KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot

2021-04-17T12:31:08+00:00

Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been detected in the memslots, i.e. don't take the lock for notifications that don't affect the guest. For small VMs, spurious locking is a minor annoyance. And for "volatile" setups where the majority of notifications _are_ relevant, this barely qualifies as an optimization. But, for large VMs (hundreds of threads) with static setups, e.g. no page migration, no swapping, etc..., the vast majority of MMU notifier callbacks will be unrelated to the guest, e.g. will often be in response to the userspace VMM adjusting its own virtual address space. In such large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from handling page faults. In some scenarios it can even be "fatal" in the sense that it causes unacceptable brownouts, e.g. when rebuilding huge pages after live migration, a significant percentage of vCPUs will be attempting to handle page faults. x86's TDP MMU implementation is especially susceptible to spurious locking due it taking mmu_lock for read when handling page faults. Because rwlock is fair, a single writer will stall future readers, while the writer is itself stalled waiting for in-progress readers to complete. This is exacerbated by the MMU notifiers often firing multiple times in quick succession, e.g. moving a page will (always?) invoke three separate notifiers: .invalidate_range_start(), invalidate_range_end(), and .change_pte(). Unnecessarily taking mmu_lock each time means even a single spurious sequence can be problematic. Note, this optimizes only the unpaired callbacks. Optimizing the .invalidate_range_{start,end}() pairs is more complex and will be done in a future patch. Suggested-by: Ben Gardon Signed-off-by: Sean Christopherson Message-Id: <20210402005658.3024832-9-seanjc@google.com> Signed-off-by: Paolo Bonzini

KVM: Move MMU notifier's mmu_lock acquisition into common helper

2021-04-17T12:31:08+00:00

Acquire and release mmu_lock in the __kvm_handle_hva_range() helper instead of requiring the caller to do the same. This paves the way for future patches to take mmu_lock if and only if an overlapping memslot is found, without also having to introduce the on_lock() shenanigans used to manipulate the notifier count and sequence. No functional change intended. Signed-off-by: Sean Christopherson Message-Id: <20210402005658.3024832-8-seanjc@google.com> Signed-off-by: Paolo Bonzini

KVM: Kill off the old hva-based MMU notifier callbacks

2021-04-17T12:31:08+00:00

Yank out the hva-based MMU notifier APIs now that all architectures that use the notifiers have moved to the gfn-based APIs. No functional change intended. Signed-off-by: Sean Christopherson Message-Id: <20210402005658.3024832-7-seanjc@google.com> Signed-off-by: Paolo Bonzini