summaryrefslogtreecommitdiff
path: root/virt/kvm/kvm_main.c
AgeCommit message (Collapse)AuthorFilesLines
2020-01-27KVM: Move running VCPU from ARM to common codePaolo Bonzini1-1/+24
For ring-based dirty log tracking, it will be more efficient to account writes during schedule-out or schedule-in to the currently running VCPU. We would like to do it even if the write doesn't use the current VCPU's address space, as is the case for cached writes (see commit 4e335d9e7ddb, "Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02). Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct. There is already something similar in KVM/ARM; one important difference is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c: we have to update both the architecture-independent vcpu_{load,put} and the preempt notifiers. Another change made in the process is to allow using kvm_get_running_vcpu() in preemptible code. This is allowed because preempt notifiers ensure that the value does not change even after the VCPU thread is migrated. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Add build-time error check on kvm_run sizePeter Xu1-0/+1
It's already going to reach 2400 Bytes (which is over half of page size on 4K page archs), so maybe it's good to have this build-time check in case it overflows when adding new fields. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Remove kvm_read_guest_atomic()Peter Xu1-11/+0
Remove kvm_read_guest_atomic() because it's not used anywhere. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Move vcpu->run page allocation out of kvm_vcpu_init()Sean Christopherson1-21/+13
Open code the allocation and freeing of the vcpu->run page in kvm_vm_ioctl_create_vcpu() and kvm_vcpu_destroy() respectively. Doing so allows kvm_vcpu_init() to be a pure init function and eliminates kvm_vcpu_uninit() entirely. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Move putting of vcpu->pid to kvm_vcpu_destroy()Sean Christopherson1-6/+7
Move the putting of vcpu->pid to kvm_vcpu_destroy(). vcpu->pid is guaranteed to be NULL when kvm_vcpu_uninit() is called in the error path of kvm_vm_ioctl_create_vcpu(), e.g. it is explicitly nullified by kvm_vcpu_init() and is only changed by KVM_RUN. No functional change intended. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Drop kvm_arch_vcpu_init() and kvm_arch_vcpu_uninit()Sean Christopherson1-14/+2
Remove kvm_arch_vcpu_init() and kvm_arch_vcpu_uninit() now that all arch specific implementations are nops. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Drop kvm_arch_vcpu_setup()Sean Christopherson1-5/+0
Remove kvm_arch_vcpu_setup() now that all arch specific implementations are nops. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Move initialization of preempt notifier to kvm_vcpu_init()Sean Christopherson1-2/+1
Initialize the preempt notifier immediately in kvm_vcpu_init() to pave the way for removing kvm_arch_vcpu_setup(), i.e. to allow arch specific code to call vcpu_load() during kvm_arch_vcpu_create(). Back when preemption support was added, the location of the call to init the preempt notifier was perfectly sane. The overall vCPU creation flow featured a single arch specific hook and the preempt notifer was used immediately after its initialization (by vcpu_load()). E.g.: vcpu = kvm_arch_ops->vcpu_create(kvm, n); if (IS_ERR(vcpu)) return PTR_ERR(vcpu); preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops); vcpu_load(vcpu); r = kvm_mmu_setup(vcpu); vcpu_put(vcpu); if (r < 0) goto free_vcpu; Today, the call to preempt_notifier_init() is sandwiched between two arch specific calls, kvm_arch_vcpu_create() and kvm_arch_vcpu_setup(), which needlessly forces x86 (and possibly others?) to split its vCPU creation flow. Init the preempt notifier prior to any arch specific call so that each arch can independently decide how best to organize its creation flow. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Unexport kvm_vcpu_cache and kvm_vcpu_{un}init()Sean Christopherson1-6/+3
Unexport kvm_vcpu_cache and kvm_vcpu_{un}init() and make them static now that they are referenced only in kvm_main.c. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-27KVM: Move vcpu alloc and init invocation to common codeSean Christopherson1-3/+18
Now that all architectures tightly couple vcpu allocation/free with the mandatory calls to kvm_{un}init_vcpu(), move the sequences verbatim to common KVM code. Move both allocation and initialization in a single patch to eliminate thrash in arch specific code. The bisection benefits of moving the two pieces in separate patches is marginal at best, whereas the odds of introducing a transient arch specific bug are non-zero. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-24KVM: Introduce kvm_vcpu_destroy()Sean Christopherson1-0/+6
Add kvm_vcpu_destroy() and wire up all architectures to call the common function instead of their arch specific implementation. The common destruction function will be used by future patches to move allocation and initialization of vCPUs to common KVM code, i.e. to free resources that are allocated by arch agnostic code. No functional change intended. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-24KVM: Add kvm_arch_vcpu_precreate() to handle pre-allocation issuesSean Christopherson1-0/+4
Add a pre-allocation arch hook to handle checks that are currently done by arch specific code prior to allocating the vCPU object. This paves the way for moving the allocation to common KVM code. Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-21kvm: Refactor handling of VM debugfs filesMilan Pandurov1-71/+71
We can store reference to kvm_stats_debugfs_item instead of copying its values to kvm_stat_data. This allows us to remove duplicated code and usage of temporary kvm_stat_data inside vm_stat_get et al. Signed-off-by: Milan Pandurov <milanpa@amazon.de> Reviewed-by: Alexander Graf <graf@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-21KVM: Fix some writing mistakesMiaohe Lin1-1/+1
Fix some writing mistakes in the comments. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-21KVM: Fix some grammar mistakesMiaohe Lin1-1/+1
Fix some grammar mistakes in the comments. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-21KVM: Fix some wrong function names in commentMiaohe Lin1-1/+1
Fix some wrong function names in comment. mmu_check_roots is a typo for mmu_check_root, vmcs_read_any should be vmcs12_read_any and so on. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-08KVM: get rid of var page in kvm_set_pfn_dirty()Miaohe Lin1-5/+2
We can get rid of unnecessary var page in kvm_set_pfn_dirty() , thus make code style similar with kvm_set_pfn_accessed(). Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-23KVM: Fix jump label out_free_* in kvm_init()Miaohe Lin1-4/+3
The jump label out_free_1 and out_free_2 deal with the same stuff, so git rid of one and rename the label out_free_0a to retain the label name order. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-21Merge branch 'kvm-tsx-ctrl' into HEADPaolo Bonzini1-30/+179
Conflicts: arch/x86/kvm/vmx/vmx.c
2019-11-21Merge tag 'kvmarm-5.5' of ↵Paolo Bonzini1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm updates for Linux 5.5: - Allow non-ISV data aborts to be reported to userspace - Allow injection of data aborts from userspace - Expose stolen time to guests - GICv4 performance improvements - vgic ITS emulation fixes - Simplify FWB handling - Enable halt pool counters - Make the emulated timer PREEMPT_RT compliant Conflicts: include/uapi/linux/kvm.h
2019-11-15KVM: remember position in kvm->vcpus arrayRadim Krčmář1-2/+3
Fetching an index for any vcpu in kvm->vcpus array by traversing the entire array everytime is costly. This patch remembers the position of each vcpu in kvm->vcpus array by storing it in vcpus_idx under kvm_vcpu structure. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15KVM: Add a comment describing the /dev/kvm no_compat handlingMarc Zyngier1-0/+7
Add a comment explaining the rational behind having both no_compat open and ioctl callbacks to fend off compat tasks. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13KVM: Forbid /dev/kvm being opened by a compat task when CONFIG_KVM_COMPAT=nMarc Zyngier1-1/+7
On a system without KVM_COMPAT, we prevent IOCTLs from being issued by a compat task. Although this prevents most silly things from happening, it can still confuse a 32bit userspace that is able to open the kvm device (the qemu test suite seems to be pretty mad with this behaviour). Take a more radical approach and return a -ENODEV to the compat task. Reported-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-14/+34
Pull kvm fixes from Paolo Bonzini: "Fix unwinding of KVM_CREATE_VM failure, VT-d posted interrupts, DAX/ZONE_DEVICE, and module unload/reload" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved KVM: VMX: Introduce pi_is_pir_empty() helper KVM: VMX: Do not change PID.NDST when loading a blocked vCPU KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON KVM: X86: Fix initialization of MSR lists KVM: fix placement of refcount initialization KVM: Fix NULL-ptr deref after kvm_create_vm fails
2019-11-12KVM: MMU: Do not treat ZONE_DEVICE pages as being reservedSean Christopherson1-3/+23
Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and instead manually handle ZONE_DEVICE on a case-by-case basis. For things like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal pages, e.g. put pages grabbed via gup(). But for flows such as setting A/D bits or shifting refcounts for transparent huge pages, KVM needs to to avoid processing ZONE_DEVICE pages as the flows in question lack the underlying machinery for proper handling of ZONE_DEVICE pages. This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup() when running a KVM guest backed with /dev/dax memory, as KVM straight up doesn't put any references to ZONE_DEVICE pages acquired by gup(). Note, Dan Williams proposed an alternative solution of doing put_page() on ZONE_DEVICE pages immediately after gup() in order to simplify the auditing needed to ensure is_zone_device_page() is called if and only if the backing device is pinned (via gup()). But that approach would break kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til unmap() when accessing guest memory, unlike KVM's secondary MMU, which coordinates with mmu_notifier invalidations to avoid creating stale page references, i.e. doesn't rely on pages being pinned. [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl Reported-by: Adam Borowski <kilobyte@angband.pl> Analyzed-by: David Hildenbrand <david@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-11KVM: fix placement of refcount initializationPaolo Bonzini1-2/+2
Reported by syzkaller: ============================= WARNING: suspicious RCU usage ----------------------------- ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by repro_11/12688. stack backtrace: Call Trace: dump_stack+0x7d/0xc5 lockdep_rcu_suspicious+0x123/0x170 kvm_dev_ioctl+0x9a9/0x1260 [kvm] do_vfs_ioctl+0x1a1/0xfb0 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x73/0xb0 do_syscall_64+0x108/0xaa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Commit a97b0e773e4 (kvm: call kvm_arch_destroy_vm if vm creation fails) sets users_count to 1 before kvm_arch_init_vm(), however, if kvm_arch_init_vm() fails, we need to decrease this count. By moving it earlier, we can push the decrease to out_err_no_arch_destroy_vm without introducing yet another error label. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=15209b84e00000 Reported-by: syzbot+75475908cd0910f141ee@syzkaller.appspotmail.com Fixes: a97b0e773e49 ("kvm: call kvm_arch_destroy_vm if vm creation fails") Cc: Jim Mattson <jmattson@google.com> Analyzed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-11KVM: Fix NULL-ptr deref after kvm_create_vm failsPaolo Bonzini1-9/+9
Reported by syzkaller: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 14727 Comm: syz-executor.3 Not tainted 5.4.0-rc4+ #0 RIP: 0010:kvm_coalesced_mmio_init+0x5d/0x110 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:121 Call Trace: kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3446 [inline] kvm_dev_ioctl+0x781/0x1490 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3494 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0x196/0x1150 fs/ioctl.c:696 ksys_ioctl+0x62/0x90 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x6e/0xb0 fs/ioctl.c:718 do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Commit 9121923c457d ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm") moves memslots and buses allocations around, however, if kvm->srcu/irq_srcu fails initialization, NULL will be returned instead of error code, NULL will not be intercepted in kvm_dev_ioctl_create_vm() and be dereferenced by kvm_coalesced_mmio_init(), this patch fixes it. Moving the initialization is required anyway to avoid an incorrect synchronize_srcu that was also reported by syzkaller: wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136 __synchronize_srcu+0x197/0x250 kernel/rcu/srcutree.c:921 synchronize_srcu_expedited kernel/rcu/srcutree.c:946 [inline] synchronize_srcu+0x239/0x3e8 kernel/rcu/srcutree.c:997 kvm_page_track_unregister_notifier+0xe7/0x130 arch/x86/kvm/page_track.c:212 kvm_mmu_uninit_vm+0x1e/0x30 arch/x86/kvm/mmu.c:5828 kvm_arch_destroy_vm+0x4a2/0x5f0 arch/x86/kvm/x86.c:9579 kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:702 [inline] so do it. Reported-by: syzbot+89a8060879fa0bd2db4f@syzkaller.appspotmail.com Reported-by: syzbot+e27e7027eb2b80e44225@syzkaller.appspotmail.com Fixes: 9121923c457d ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm") Cc: Jim Mattson <jmattson@google.com> Cc: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-04kvm: x86: mmu: Recovery of shattered NX large pagesJunaid Shahid1-0/+28
The page table pages corresponding to broken down large pages are zapped in FIFO order, so that the large page can potentially be recovered, if it is not longer being used for execution. This removes the performance penalty for walking deeper EPT page tables. By default, one large page will last about one hour once the guest reaches a steady state. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-11-04kvm: Add helper function for creating VM worker threadsJunaid Shahid1-0/+84
Add a function to create a kernel thread associated with a given VM. In particular, it ensures that the worker thread inherits the priority and cgroups of the calling thread. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-10-31kvm: call kvm_arch_destroy_vm if vm creation failsJim Mattson1-5/+7
In kvm_create_vm(), if we've successfully called kvm_arch_init_vm(), but then fail later in the function, we need to call kvm_arch_destroy_vm() so that it can do any necessary cleanup (like freeing memory). Fixes: 44a95dae1d229a ("KVM: x86: Detect and Initialize AVIC support") Signed-off-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Junaid Shahid <junaids@google.com> [Remove dependency on "kvm: Don't clear reference count on kvm_create_vm() error path" which was not committed. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-25kvm: Allocate memslots and buses before calling kvm_arch_init_vmJim Mattson1-19/+21
This reorganization will allow us to call kvm_arch_destroy_vm in the event that kvm_create_vm fails after calling kvm_arch_init_vm. Suggested-by: Junaid Shahid <junaids@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22KVM: Add separate helper for putting borrowed reference to kvmSean Christopherson1-2/+14
Add a new helper, kvm_put_kvm_no_destroy(), to handle putting a borrowed reference[*] to the VM when installing a new file descriptor fails. KVM expects the refcount to remain valid in this case, as the in-progress ioctl() has an explicit reference to the VM. The primary motiviation for the helper is to document that the 'kvm' pointer is still valid after putting the borrowed reference, e.g. to document that doing mutex(&kvm->lock) immediately after putting a ref to kvm isn't broken. [*] When exposing a new object to userspace via a file descriptor, e.g. a new vcpu, KVM grabs a reference to itself (the VM) prior to making the object visible to userspace to avoid prematurely freeing the VM in the scenario where userspace immediately closes file descriptor. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-22KVM: Don't shrink/grow vCPU halt_poll_ns if host side polling is disabledWanpeng Li1-13/+16
Don't waste cycles to shrink/grow vCPU halt_poll_ns if host side polling is disabled. Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-21KVM: Allow kvm_device_ops to be constSteven Price1-3/+3
Currently a kvm_device_ops structure cannot be const without triggering compiler warnings. However the structure doesn't need to be written to and, by marking it const, it can be read-only in memory. Add some more const keywords to allow this. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2019-09-30kvm: x86, powerpc: do not allow clearing largepages debugfs entryPaolo Bonzini1-3/+7
The largepages debugfs entry is incremented/decremented as shadow pages are created or destroyed. Clearing it will result in an underflow, which is harmless to KVM but ugly (and could be misinterpreted by tools that use debugfs information), so make this particular statistic read-only. Cc: kvm-ppc@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-18KVM: Call kvm_arch_vcpu_blocking early into the blocking sequenceMarc Zyngier1-4/+3
When a vpcu is about to block by calling kvm_vcpu_block, we call back into the arch code to allow any form of synchronization that may be required at this point (SVN stops the AVIC, ARM synchronises the VMCR and enables GICv4 doorbells). But this synchronization comes in quite late, as we've potentially waited for halt_poll_ns to expire. Instead, let's move kvm_arch_vcpu_blocking() to the beginning of kvm_vcpu_block(), which on ARM has several benefits: - VMCR gets synchronised early, meaning that any interrupt delivered during the polling window will be evaluated with the correct guest PMR - GICv4 doorbells are enabled, which means that any guest interrupt directly injected during that window will be immediately recognised Tang Nianyao ran some tests on a GICv4 machine to evaluate such change, and reported up to a 10% improvement for netperf: <quote> netperf result: D06 as server, intel 8180 server as client with change: package 512 bytes - 5500 Mbits/s package 64 bytes - 760 Mbits/s without change: package 512 bytes - 5000 Mbits/s package 64 bytes - 710 Mbits/s </quote> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2019-08-09kvm: remove unnecessary PageReserved checkPaolo Bonzini1-2/+1
The same check is already done in kvm_is_reserved_pfn. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05KVM: no need to check return value of debugfs_create functionsGreg KH1-16/+5
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return void instead of an integer, as we should not care at all about if this function actually does anything or not. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Cc: <kvm@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05KVM: remove kvm_arch_has_vcpu_debugfs()Paolo Bonzini1-3/+2
There is no need for this function as all arches have to implement kvm_arch_create_vcpu_debugfs() no matter what. A #define symbol let us actually simplify the code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05KVM: Fix leak vCPU's VMCS value into other pCPUWanpeng Li1-1/+24
After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05KVM: Check preempted_in_kernel for involuntary preemptionWanpeng Li1-3/+4
preempted_in_kernel is updated in preempt_notifier when involuntary preemption ocurrs, it can be stale when the voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop. This patch lets it just check preempted_in_kernel for involuntary preemption. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20KVM: Boost vCPUs that are delivering interruptsWanpeng Li1-4/+8
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during vcpu wakeup and interrupt delivery), we want to also boost not just lock holders but also vCPUs that are delivering interrupts. Most smp_call_function_many calls are synchronous, so the IPI target vCPUs are also good yield candidates. This patch introduces vcpu->ready to boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected (VT-d PI handles voluntary preemption separately, in pi_pre_block). Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM: ebizzy -M vanilla boosting improved 1VM 21443 23520 9% 2VM 2800 8000 180% 3VM 1800 3100 72% Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs, one running ebizzy -M, the other running 'stress --cpu 2': w/ boosting + w/o pv sched yield(vanilla) vanilla boosting improved 1570 4000 155% w/ boosting + w/ pv sched yield(vanilla) vanilla boosting improved 1844 5157 179% w/o boosting, perf top in VM: 72.33% [kernel] [k] smp_call_function_many 4.22% [kernel] [k] call_function_i 3.71% [kernel] [k] async_page_fault w/ boosting, perf top in VM: 38.43% [kernel] [k] smp_call_function_many 6.31% [kernel] [k] async_page_fault 6.13% libc-2.23.so [.] __memcpy_avx_unaligned 4.88% [kernel] [k] call_function_interrupt Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Marc Zyngier <maz@kernel.org> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-11Merge tag 'kvm-arm-for-5.3' of ↵Paolo Bonzini1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm updates for 5.3 - Add support for chained PMU counters in guests - Improve SError handling - Handle Neoverse N1 erratum #1349291 - Allow side-channel mitigation status to be migrated - Standardise most AArch64 system register accesses to msr_s/mrs_s - Fix host MPIDR corruption on 32bit
2019-07-10KVM: Properly check if "page" is valid in kvm_vcpu_unmapKarimAllah Ahmed1-1/+1
The field "page" is initialized to KVM_UNMAPPED_PAGE when it is not used (i.e. when the memory lives outside kernel control). So this check will always end up using kunmap even for memremap regions. Fixes: e45adf665a53 ("KVM: Introduce a new guest mapping API") Cc: stable@vger.kernel.org Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499Thomas Gleixner1-4/+1
Based on 1 normalized pattern(s): this work is licensed under the terms of the gnu gpl version 2 see the copying file in the top level directory extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 35 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05kvm: Convert kvm_lock to a mutexJunaid Shahid1-15/+15
It doesn't seem as if there is any particular need for kvm_lock to be a spinlock, so convert the lock to a mutex so that sleepable functions (in particular cond_resched()) can be called while holding it. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: Directly return result from kvm_arch_check_processor_compat()Sean Christopherson1-3/+6
Add a wrapper to invoke kvm_arch_check_processor_compat() so that the boilerplate ugliness of checking virtualization support on all CPUs is hidden from the arch specific code. x86's implementation in particular is quite heinous, as it unnecessarily propagates the out-param pattern into kvm_x86_ops. While the x86 specific issue could be resolved solely by changing kvm_x86_ops, make the change for all architectures as returning a value directly is prettier and technically more robust, e.g. s390 doesn't set the out param, which could lead to subtle breakage in the (highly unlikely) scenario where the out-param was not pre-initialized by the caller. Opportunistically annotate svm_check_processor_compat() with __init. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-28KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_IDThomas Huth1-2/+0
KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all architectures. However, on s390x, the amount of usable CPUs is determined during runtime - it is depending on the features of the machine the code is running on. Since we are using the vcpu_id as an index into the SCA structures that are defined by the hardware (see e.g. the sca_add_vcpu() function), it is not only the amount of CPUs that is limited by the hard- ware, but also the range of IDs that we can use. Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too. So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common code into the architecture specific code, and on s390x we have to return the same value here as for KVM_CAP_MAX_VCPUS. This problem has been discovered with the kvm_create_max_vcpus selftest. With this change applied, the selftest now passes on s390x, too. Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20190523164309.13345-9-thuth@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2019-05-28kvm: fix compile on s390 part 2Christian Borntraeger1-0/+2
We also need to fence the memunmap part. Fixes: e45adf665a53 ("KVM: Introduce a new guest mapping API") Fixes: d30b214d1d0a (kvm: fix compilation on s390) Cc: Michal Kubecek <mkubecek@suse.cz> Cc: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2019-05-24kvm: fix compilation on s390Paolo Bonzini1-0/+2
s390 does not have memremap, even though in this particular case it would be useful. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>