summaryrefslogtreecommitdiff
path: root/virt/kvm
AgeCommit message (Collapse)AuthorFilesLines
2019-09-23KVM: coalesced_mmio: add bounds checkingMatt Delco1-7/+10
commit b60fe990c6b07ef6d4df67bc0530c7c90a62623a upstream. The first/last indexes are typically shared with a user app. The app can change the 'last' index that the kernel uses to store the next result. This change sanity checks the index before using it for writing to a potentially arbitrary address. This fixes CVE-2019-14821. Fixes: 5f94c1741bdc ("KVM: Add coalesced MMIO support (common part)") Signed-off-by: Matt Delco <delco@chromium.org> Signed-off-by: Jim Mattson <jmattson@google.com> Reported-by: syzbot+983c866c3dd6efa3662a@syzkaller.appspotmail.com [Use READ_ONCE. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 3.16: - Use ACCESS_ONCE() instead of READ_ONCE() - kvm_coalesced_mmio_zone::pio field is not supported] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-08-13KVM: Reject device ioctls from processes other than the VM's creatorSean Christopherson1-0/+3
commit ddba91801aeb5c160b660caed1800eb3aef403f8 upstream. KVM's API requires thats ioctls must be issued from the same process that created the VM. In other words, userspace can play games with a VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the creator can do anything useful. Explicitly reject device ioctls that are issued by a process other than the VM's creator, and update KVM's API documentation to extend its requirements to device ioctls. Fixes: 852b6d57dc7f ("kvm: add device control API") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-04-04kvm: Disallow wraparound in kvm_gfn_to_hva_cache_initJim Mattson1-19/+21
commit f1b9dd5eb86cec1fcf66aad17e7701d98d024a9a upstream. Previously, in the case where (gpa + len) wrapped around, the entire region was not validated, as the comment claimed. It doesn't actually seem that wraparound should be allowed here at all. Furthermore, since some callers don't check the return code from this function, it seems prudent to clear ghc->memslot in the event of an error. Fixes: 8f964525a121f ("KVM: Allow cross page reads and writes from cached translations.") Reported-by: Cfir Cohen <cfir@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Cfir Cohen <cfir@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Cc: Andrew Honig <ahonig@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-03-25kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)Jann Horn1-1/+2
commit cfa39381173d5f969daf43582c95ad679189cbc9 upstream. kvm_ioctl_create_device() does the following: 1. creates a device that holds a reference to the VM object (with a borrowed reference, the VM's refcount has not been bumped yet) 2. initializes the device 3. transfers the reference to the device to the caller's file descriptor table 4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real reference The ownership transfer in step 3 must not happen before the reference to the VM becomes a proper, non-borrowed reference, which only happens in step 4. After step 3, an attacker can close the file descriptor and drop the borrowed reference, which can cause the refcount of the kvm object to drop to zero. This means that we need to grab a reference for the device before anon_inode_getfd(), otherwise the VM can disappear from under us. Fixes: 852b6d57dc7f ("kvm: add device control API") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-03-25KVM: use after free in kvm_ioctl_create_device()Dan Carpenter1-1/+1
commit a0f1d21c1ccb1da66629627a74059dd7f5ac9c61 upstream. We should move the ops->destroy(dev) after the list_del(&dev->vm_node) so that we don't use "dev" after freeing it. Fixes: a28ebea2adc4 ("KVM: Protect device ops->create and list_add with kvm->lock") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-03-25KVM: Protect device ops->create and list_add with kvm->lockChristoffer Dall2-10/+14
commit a28ebea2adc4a2bef5989a5a181ec238f59fbcad upstream. KVM devices were manipulating list data structures without any form of synchronization, and some implementations of the create operations also suffered from a lack of synchronization. Now when we've split the xics create operation into create and init, we can hold the kvm->lock mutex while calling the create operation and when manipulating the devices list. The error path in the generic code gets slightly ugly because we have to take the mutex again and delete the device from the list, but holding the mutex during anon_inode_getfd or releasing/locking the mutex in the common non-error path seemed wrong. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [bwh: Backported to 3.16: - Drop change to a failure path that doesn't exist in kvm_vgic_create() - Adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-03-25KVM: PPC: Move xics_debugfs_init out of createChristoffer Dall1-0/+3
commit 023e9fddc3616b005c3753fc1bb6526388cd7a30 upstream. As we are about to hold the kvm->lock during the create operation on KVM devices, we should move the call to xics_debugfs_init into its own function, since holding a mutex over extended amounts of time might not be a good idea. Introduce an init operation on the kvm_device_ops struct which cannot fail and call this, if configured, after the device has been created. Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-06-17KVM: mmu: Fix overlap between public and private memslotsWanpeng Li1-2/+1
commit b28676bb8ae4569cced423dc2a88f7cb319d5379 upstream. Reported by syzkaller: pte_list_remove: ffff9714eb1f8078 0->BUG ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu.c:1157! invalid opcode: 0000 [#1] SMP RIP: 0010:pte_list_remove+0x11b/0x120 [kvm] Call Trace: drop_spte+0x83/0xb0 [kvm] mmu_page_zap_pte+0xcc/0xe0 [kvm] kvm_mmu_prepare_zap_page+0x81/0x4a0 [kvm] kvm_mmu_invalidate_zap_all_pages+0x159/0x220 [kvm] kvm_arch_flush_shadow_all+0xe/0x10 [kvm] kvm_mmu_notifier_release+0x6c/0xa0 [kvm] ? kvm_mmu_notifier_release+0x5/0xa0 [kvm] __mmu_notifier_release+0x79/0x110 ? __mmu_notifier_release+0x5/0x110 exit_mmap+0x15a/0x170 ? do_exit+0x281/0xcb0 mmput+0x66/0x160 do_exit+0x2c9/0xcb0 ? __context_tracking_exit.part.5+0x4a/0x150 do_group_exit+0x50/0xd0 SyS_exit_group+0x14/0x20 do_syscall_64+0x73/0x1f0 entry_SYSCALL64_slow_path+0x25/0x25 The reason is that when creates new memslot, there is no guarantee for new memslot not overlap with private memslots. This can be triggered by the following program: #include <fcntl.h> #include <pthread.h> #include <setjmp.h> #include <signal.h> #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> #include <linux/kvm.h> long r[16]; int main() { void *p = valloc(0x4000); r[2] = open("/dev/kvm", 0); r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul); uint64_t addr = 0xf000; ioctl(r[3], KVM_SET_IDENTITY_MAP_ADDR, &addr); r[6] = ioctl(r[3], KVM_CREATE_VCPU, 0x0ul); ioctl(r[3], KVM_SET_TSS_ADDR, 0x0ul); ioctl(r[6], KVM_RUN, 0); ioctl(r[6], KVM_RUN, 0); struct kvm_userspace_memory_region mr = { .slot = 0, .flags = KVM_MEM_LOG_DIRTY_PAGES, .guest_phys_addr = 0xf000, .memory_size = 0x4000, .userspace_addr = (uintptr_t) p }; ioctl(r[3], KVM_SET_USER_MEMORY_REGION, &mr); return 0; } This patch fixes the bug by not adding a new memslot even if it overlaps with private memslots. Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2017-10-12vfio: New external user group/file matchAlex Williamson1-8/+19
commit 5d6dee80a1e94cc284d03e06d930e60e8d3ecf7d upstream. At the point where the kvm-vfio pseudo device wants to release its vfio group reference, we can't always acquire a new reference to make that happen. The group can be in a state where we wouldn't allow a new reference to be added. This new helper function allows a caller to match a file to a group to facilitate this. Given a file and group, report if they match. Thus the caller needs to already have a group reference to match to the file. This allows the deletion of a group without acquiring a new reference. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2017-07-18KVM: kvm_io_bus_unregister_dev() should never failDavid Hildenbrand2-18/+25
commit 90db10434b163e46da413d34db8d0e77404cc645 upstream. No caller currently checks the return value of kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on freeing their device. A stale reference will remain in the io_bus, getting at least used again, when the iobus gets teared down on kvm_destroy_vm() - leading to use after free errors. There is nothing the callers could do, except retrying over and over again. So let's simply remove the bus altogether, print an error and make sure no one can access this broken bus again (returning -ENOMEM on any attempt to access it). Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU") Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 3.16: - Drop changes to kvm_io_bus_get_dev() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2017-07-18KVM: x86: clear bus pointer when destroyedPeter Xu1-1/+11
commit df630b8c1e851b5e265dc2ca9c87222e342c093b upstream. When releasing the bus, let's clear the bus pointers to mark it out. If any further device unregister happens on this bus, we know that we're done if we found the bus being released already. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-11-20KVM: nVMX: Fix memory corruption when using VMCS shadowingJim Mattson1-0/+2
commit 2f1fe81123f59271bddda673b60116bde9660385 upstream. When freeing the nested resources of a vcpu, there is an assumption that the vcpu's vmcs01 is the current VMCS on the CPU that executes nested_release_vmcs12(). If this assumption is violated, the vcpu's vmcs01 may be made active on multiple CPUs at the same time, in violation of Intel's specification. Moreover, since the vcpu's vmcs01 is not VMCLEARed on every CPU on which it is active, it can linger in a CPU's VMCS cache after it has been freed and potentially repurposed. Subsequent eviction from the CPU's VMCS cache on a capacity miss can result in memory corruption. It is not sufficient for vmx_free_vcpu() to call vmx_load_vmcs01(). If the vcpu in question was last loaded on a different CPU, it must be migrated to the current CPU before calling vmx_load_vmcs01(). Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-08-23kvm: Fix irq route entries exceeding KVM_MAX_IRQ_ROUTESXiubo Li1-1/+1
commit caf1ff26e1aa178133df68ac3d40815fed2187d9 upstream. These days, we experienced one guest crash with 8 cores and 3 disks, with qemu error logs as bellow: qemu-system-x86_64: /build/qemu-2.0.0/kvm-all.c:984: kvm_irqchip_commit_routes: Assertion `ret == 0' failed. And then we found one patch(bdf026317d) in qemu tree, which said could fix this bug. Execute the following script will reproduce the BUG quickly: irq_affinity.sh ======================================================================== vda_irq_num=25 vdb_irq_num=27 while [ 1 ] do for irq in {1,2,4,8,10,20,40,80} do echo $irq > /proc/irq/$vda_irq_num/smp_affinity echo $irq > /proc/irq/$vdb_irq_num/smp_affinity dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct done done ======================================================================== The following qemu log is added in the qemu code and is displayed when this bug reproduced: kvm_irqchip_commit_routes: max gsi: 1008, nr_allocated_irq_routes: 1024, irq_routes->nr: 1024, gsi_count: 1024. That's to say when irq_routes->nr == 1024, there are 1024 routing entries, but in the kernel code when routes->nr >= 1024, will just return -EINVAL; The nr is the number of the routing entries which is in of [1 ~ KVM_MAX_IRQ_ROUTES], not the index in [0 ~ KVM_MAX_IRQ_ROUTES - 1]. This patch fix the BUG above. Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com> Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com> Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-08-23KVM: irqfd: fix NULL pointer dereference in kvm_irq_map_gsiPaolo Bonzini1-1/+1
commit c622a3c21ede892e370b56e1ceb9eb28f8bbda6b upstream. Found by syzkaller: BUG: unable to handle kernel NULL pointer dereference at 0000000000000120 IP: [<ffffffffa0797202>] kvm_irq_map_gsi+0x12/0x90 [kvm] PGD 6f80b067 PUD b6535067 PMD 0 Oops: 0000 [#1] SMP CPU: 3 PID: 4988 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1 [...] Call Trace: [<ffffffffa0795f62>] irqfd_update+0x32/0xc0 [kvm] [<ffffffffa0796c7c>] kvm_irqfd+0x3dc/0x5b0 [kvm] [<ffffffffa07943f4>] kvm_vm_ioctl+0x164/0x6f0 [kvm] [<ffffffff81241648>] do_vfs_ioctl+0x298/0x480 [<ffffffff812418a9>] SyS_ioctl+0x79/0x90 [<ffffffff817a1062>] tracesys_phase2+0x84/0x89 Code: b5 71 a7 e0 5b 41 5c 41 5d 5d f3 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 8f 10 2e 00 00 31 c0 48 89 e5 <39> 91 20 01 00 00 76 6a 48 63 d2 48 8b 94 d1 28 01 00 00 48 85 RIP [<ffffffffa0797202>] kvm_irq_map_gsi+0x12/0x90 [kvm] RSP <ffff8800926cbca8> CR2: 0000000000000120 Testcase: #include <unistd.h> #include <sys/syscall.h> #include <string.h> #include <stdint.h> #include <linux/kvm.h> #include <fcntl.h> #include <sys/ioctl.h> long r[26]; int main() { memset(r, -1, sizeof(r)); r[2] = open("/dev/kvm", 0); r[3] = ioctl(r[2], KVM_CREATE_VM, 0); struct kvm_irqfd ifd; ifd.fd = syscall(SYS_eventfd2, 5, 0); ifd.gsi = 3; ifd.flags = 2; ifd.resamplefd = ifd.fd; r[25] = ioctl(r[3], KVM_IRQFD, &ifd); return 0; } Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-05-01KVM: fix spin_lock_init order on x86Paolo Bonzini1-10/+11
commit e9ad4ec8379ad1ba6f68b8ca1c26b50b5ae0a327 upstream. Moving the initialization earlier is needed in 4.6 because kvm_arch_init_vm is now using mmu_lock, causing lockdep to complain: [ 284.440294] INFO: trying to register non-static key. [ 284.445259] the code is fine but needs lockdep annotation. [ 284.450736] turning off the locking correctness validator. ... [ 284.528318] [<ffffffff810aecc3>] lock_acquire+0xd3/0x240 [ 284.533733] [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm] [ 284.541467] [<ffffffff81715581>] _raw_spin_lock+0x41/0x80 [ 284.546960] [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm] [ 284.554707] [<ffffffffa0305aa0>] kvm_page_track_register_notifier+0x20/0x60 [kvm] [ 284.562281] [<ffffffffa02ece70>] kvm_mmu_init_vm+0x20/0x30 [kvm] [ 284.568381] [<ffffffffa02dbf7a>] kvm_arch_init_vm+0x1ea/0x200 [kvm] [ 284.574740] [<ffffffffa02bff3f>] kvm_dev_ioctl+0xbf/0x4d0 [kvm] However, it also helps fixing a preexisting problem, which is why this patch is also good for stable kernels: kvm_create_vm was incrementing current->mm->mm_count but not decrementing it at the out_err label (in case kvm_init_mmu_notifier failed). The new initialization order makes it possible to add the required mmdrop without adding a new error label. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-03-08KVM: async_pf: do not warn on page allocation failuresChristian Borntraeger1-1/+1
commit d7444794a02ff655eda87e3cc54e86b940e7736f upstream. In async_pf we try to allocate with NOWAIT to get an element quickly or fail. This code also handle failures gracefully. Lets silence potential page allocation failures under load. qemu-system-s39: page allocation failure: order:0,mode:0x2200000 [...] Call Trace: ([<00000000001146b8>] show_trace+0xf8/0x148) [<000000000011476a>] show_stack+0x62/0xe8 [<00000000004a36b8>] dump_stack+0x70/0x98 [<0000000000272c3a>] warn_alloc_failed+0xd2/0x148 [<000000000027709e>] __alloc_pages_nodemask+0x94e/0xb38 [<00000000002cd36a>] new_slab+0x382/0x400 [<00000000002cf7ac>] ___slab_alloc.constprop.30+0x2dc/0x378 [<00000000002d03d0>] kmem_cache_alloc+0x160/0x1d0 [<0000000000133db4>] kvm_setup_async_pf+0x6c/0x198 [<000000000013dee8>] kvm_arch_vcpu_ioctl_run+0xd48/0xd58 [<000000000012fcaa>] kvm_vcpu_ioctl+0x372/0x690 [<00000000002f66f6>] do_vfs_ioctl+0x3be/0x510 [<00000000002f68ec>] SyS_ioctl+0xa4/0xb8 [<0000000000781c5e>] system_call+0xd6/0x264 [<000003ffa24fa06a>] 0x3ffa24fa06a Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-10-06kvm: fix zero length mmio searchingJason Wang1-2/+17
commit 8f4216c7d28976f7ec1b2bcbfa0a9f787133c45e upstream. Currently, if we had a zero length mmio eventfd assigned on KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it always compares the kvm_io_range() with the length that guest wrote. This will cause e.g for vhost, kick will be trapped by qemu userspace instead of vhost. Fixing this by using zero length if an iodevice is zero length. Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-10-06kvm: fix double free for fast mmio eventfdJason Wang1-18/+25
commit eefd6b06b17c5478e7c24bea6f64beaa2c431ca6 upstream. We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS and once on KVM_FAST_MMIO_BUS but with a single iodev instance. This will lead to an issue: kvm_io_bus_destroy() knows nothing about the devices on two buses pointing to a single dev. Which will lead to double free[1] during exit. Fix this by allocating two instances of iodevs then registering one on KVM_MMIO_BUS and another on KVM_FAST_MMIO_BUS. CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013 task: ffff88009ae0c4b0 ti: ffff88020e7f0000 task.ti: ffff88020e7f0000 RIP: 0010:[<ffffffffc07e25d8>] [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm] RSP: 0018:ffff88020e7f3bc8 EFLAGS: 00010292 RAX: dead000000200200 RBX: ffff8801ec19c900 RCX: 000000018200016d RDX: ffff8801ec19cf80 RSI: ffffea0008bf1d40 RDI: ffff8801ec19c900 RBP: ffff88020e7f3bd8 R08: 000000002fc75a01 R09: 000000018200016d R10: ffffffffc07df6ae R11: ffff88022fc75a98 R12: ffff88021e7cc000 R13: ffff88021e7cca48 R14: ffff88021e7cca50 R15: ffff8801ec19c880 FS: 00007fc1ee3e6700(0000) GS:ffff88023e240000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f389d8000 CR3: 000000023dc13000 CR4: 00000000001427e0 Stack: ffff88021e7cc000 0000000000000000 ffff88020e7f3be8 ffffffffc07e2622 ffff88020e7f3c38 ffffffffc07df69a ffff880232524160 ffff88020e792d80 0000000000000000 ffff880219b78c00 0000000000000008 ffff8802321686a8 Call Trace: [<ffffffffc07e2622>] ioeventfd_destructor+0x12/0x20 [kvm] [<ffffffffc07df69a>] kvm_put_kvm+0xca/0x210 [kvm] [<ffffffffc07df818>] kvm_vcpu_release+0x18/0x20 [kvm] [<ffffffff811f69f7>] __fput+0xe7/0x250 [<ffffffff811f6bae>] ____fput+0xe/0x10 [<ffffffff81093f04>] task_work_run+0xd4/0xf0 [<ffffffff81079358>] do_exit+0x368/0xa50 [<ffffffff81082c8f>] ? recalc_sigpending+0x1f/0x60 [<ffffffff81079ad5>] do_group_exit+0x45/0xb0 [<ffffffff81085c71>] get_signal+0x291/0x750 [<ffffffff810144d8>] do_signal+0x28/0xab0 [<ffffffff810f3a3b>] ? do_futex+0xdb/0x5d0 [<ffffffff810b7028>] ? __wake_up_locked_key+0x18/0x20 [<ffffffff810f3fa6>] ? SyS_futex+0x76/0x170 [<ffffffff81014fc9>] do_notify_resume+0x69/0xb0 [<ffffffff817cb9af>] int_signal+0x12/0x17 Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 01 10 00 00 RIP [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm] RSP <ffff88020e7f3bc8> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-10-06kvm: factor out core eventfd assign/deassign logicJason Wang1-35/+50
commit 85da11ca587c8eb73993a1b503052391a73586f9 upstream. This patch factors out core eventfd assign/deassign logic and leaves the argument checking and bus index selection to callers. Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-10-06kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfdJason Wang1-2/+2
commit 8453fecbecae26edb3f278627376caab05d9a88d upstream. We only want zero length mmio eventfd to be registered on KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to make sure this. Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-05-20arm/arm64: KVM: Keep elrsr/aisr in sync with software modelChristoffer Dall1-0/+10
commit ae705930fca6322600690df9dc1c7d0516145a93 upstream. There is an interesting bug in the vgic code, which manifests itself when the KVM run loop has a signal pending or needs a vmid generation rollover after having disabled interrupts but before actually switching to the guest. In this case, we flush the vgic as usual, but we sync back the vgic state and exit to userspace before entering the guest. The consequence is that we will be syncing the list registers back to the software model using the GICH_ELRSR and GICH_EISR from the last execution of the guest, potentially overwriting a list register containing an interrupt. This showed up during migration testing where we would capture a state where the VM has masked the arch timer but there were no interrupts, resulting in a hung test. Cc: Marc Zyngier <marc.zyngier@arm.com> Reported-by: Alex Bennee <alex.bennee@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> [ luis: backported to 3.16: used shannon's backport to 3.14 ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-05-20arm/arm64: KVM: Require in-kernel vgic for the arch timersChristoffer Dall1-8/+22
commit 05971120fca43e0357789a14b3386bb56eef2201 upstream. It is curently possible to run a VM with architected timers support without creating an in-kernel VGIC, which will result in interrupts from the virtual timer going nowhere. To address this issue, move the architected timers initialization to the time when we run a VCPU for the first time, and then only initialize (and enable) the architected timers if we have a properly created and initialized in-kernel VGIC. When injecting interrupts from the virtual timer to the vgic, the current setup should ensure that this never calls an on-demand init of the VGIC, which is the only call path that could return an error from kvm_vgic_inject_irq(), so capture the return value and raise a warning if there's an error there. We also change the kvm_timer_init() function from returning an int to be a void function, since the function always succeeds. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> [ luis: backported to 3.16: adjusted context ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-05-20arm/arm64: KVM: vgic: Fix error code in kvm_vgic_create()Christoffer Dall1-4/+4
commit 6b50f54064a02b77a7b990032b80234fee59bcd6 upstream. If we detect another vCPU is running we just exit and return 0 as if we succesfully created the VGIC, but the VGIC wouldn't actual be created. This shouldn't break in-kernel behavior because the kernel will not observe the failed the attempt to create the VGIC, but userspace could be rightfully confused. Cc: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-05-20arm/arm64: KVM: Fix set_clear_sgi_pend_reg offsetChristoffer Dall1-2/+2
commit 0fea6d7628ed6e25a9ee1b67edf7c859718d39e8 upstream. The sgi values calculated in read_set_clear_sgi_pend_reg() and write_set_clear_sgi_pend_reg() were horribly incorrectly multiplied by 4 with catastrophic results in that subfunctions ended up overwriting memory not allocated for the expected purpose. This showed up as bugs in kfree() and the kernel complaining a lot of you turn on memory debugging. This addresses: http://marc.info/?l=kvm&m=141164910007868&w=2 Reported-by: Shannon Zhao <zhaoshenglong@huawei.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-05-20KVM: ARM: vgic: plug irq injection raceMarc Zyngier1-1/+2
commit 71afaba4a2e98bb7bdeba5078370ab43d46e67a1 upstream. As it stands, nothing prevents userspace from injecting an interrupt before the guest's GIC is actually initialized. This goes unnoticed so far (as everything is pretty much statically allocated), but ends up exploding in a spectacular way once we switch to a more dynamic allocation (the GIC data structure isn't there yet). The fix is to test for the "ready" flag in the VGIC distributor before trying to inject the interrupt. Note that in order to avoid breaking userspace, we have to ignore what is essentially an error. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> [ luis: backported to 3.16: used shannon's backport to 3.14 ] Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-05-20KVM: vgic: return int instead of bool when checking I/O rangesWill Deacon1-1/+1
commit 1fa451bcc67fa921a04c5fac8dbcde7844d54512 upstream. vgic_ioaddr_overlap claims to return a bool, but in reality it returns an int. Shut sparse up by fixing the type signature. Cc: Christoffer Dall <christoffer.dall@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-05-06KVM: use slowpath for cross page cached accessesRadim Krčmář1-2/+2
commit ca3f0874723fad81d0c701b63ae3a17a408d5f25 upstream. kvm_write_guest_cached() does not mark all written pages as dirty and code comments in kvm_gfn_to_hva_cache_init() talk about NULL memslot with cross page accesses. Fix all the easy way. The check is '<= 1' to have the same result for 'len = 0' cache anywhere in the page. (nr_pages_needed is 0 on page boundary.) Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Message-Id: <20150408121648.GA3519@potion.brq.redhat.com> Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-04-21kvm: avoid page allocation failure in kvm_set_memory_region()Igor Mammedov1-7/+7
commit 744961341d472db6272ed9b42319a90f5a2aa7c4 upstream. KVM guest can fail to startup with following trace on host: qemu-system-x86: page allocation failure: order:4, mode:0x40d0 Call Trace: dump_stack+0x47/0x67 warn_alloc_failed+0xee/0x150 __alloc_pages_direct_compact+0x14a/0x150 __alloc_pages_nodemask+0x776/0xb80 alloc_kmem_pages+0x3a/0x110 kmalloc_order+0x13/0x50 kmemdup+0x1b/0x40 __kvm_set_memory_region+0x24a/0x9f0 [kvm] kvm_set_ioapic+0x130/0x130 [kvm] kvm_set_memory_region+0x21/0x40 [kvm] kvm_vm_ioctl+0x43f/0x750 [kvm] Failure happens when attempting to allocate pages for 'struct kvm_memslots', however it doesn't have to be present in physically contiguous (kmalloc-ed) address space, change allocation to kvm_kvzalloc() so that it will be vmalloc-ed when its size is more then a page. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-04-21kvm: commonize allocation of the new memory slotsPaolo Bonzini1-17/+11
commit f2a81036516e2b97c07c49dd6d51d36bfa43593d upstream. The two kmemdup invocations can be unified. I find that the new placement of the comment makes it easier to see what happens. Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2015-04-21KVM: remove redundant assignments in __kvm_set_memory_regionChristian Borntraeger1-3/+0
commit f2a25160887e00434ce1361007009120e1fecbda upstream. __kvm_set_memory_region sets r to EINVAL very early. Doing it again is not necessary. The same is true later on, where r is assigned -ENOMEM twice. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2014-11-05kvm: fix excessive pages un-pinning in kvm_iommu_map error path.Quentin Casasnovas1-4/+4
commit 3d32e4dbe71374a6780eaf51d719d76f9a9bf22f upstream. The third parameter of kvm_unpin_pages() when called from kvm_iommu_map_pages() is wrong, it should be the number of pages to un-pin and not the page size. This error was facilitated with an inconsistent API: kvm_pin_pages() takes a size, but kvn_unpin_pages() takes a number of pages, so fix the problem by matching the two. This was introduced by commit 350b8bd ("kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601)"), which fixes the lack of un-pinning for pages intended to be un-pinned (i.e. memory leak) but unfortunately potentially aggravated the number of pages we un-pin that should have stayed pinned. As far as I understand though, the same practical mitigations apply. This issue was found during review of Red Hat 6.6 patches to prepare Ksplice rebootless updates. Thanks to Vegard for his time on a late Friday evening to help me in understanding this code. Fixes: 350b8bd ("kvm: iommu: fix the third parameter of... (CVE-2014-3601)") Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Jamie Iles <jamie.iles@oracle.com> Reviewed-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2014-10-30kvm: don't take vcpu mutex for obviously invalid vcpu ioctlsDavid Matlack1-0/+4
commit 2ea75be3219571d0ec009ce20d9971e54af96e09 upstream. vcpu ioctls can hang the calling thread if issued while a vcpu is running. However, invalid ioctls can happen when userspace tries to probe the kind of file descriptors (e.g. isatty() calls ioctl(TCGETS)); in that case, we know the ioctl is going to be rejected as invalid anyway and we can fail before trying to take the vcpu mutex. This patch does not change functionality, it just makes invalid ioctls fail faster. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30KVM: do not bias the generation number in kvm_current_mmio_generationPaolo Bonzini1-0/+7
commit 00f034a12fdd81210d58116326d92780aac5c238 upstream. The next patch will give a meaning (a la seqcount) to the low bit of the generation number. Ensure that it matches between kvm->memslots->generation and kvm_current_mmio_generation(). Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30kvm: fix potentially corrupt mmio cacheDavid Matlack1-7/+16
commit ee3d1570b58677885b4552bce8217fda7b226a68 upstream. vcpu exits and memslot mutations can run concurrently as long as the vcpu does not aquire the slots mutex. Thus it is theoretically possible for memslots to change underneath a vcpu that is handling an exit. If we increment the memslot generation number again after synchronize_srcu_expedited(), vcpus can safely cache memslot generation without maintaining a single rcu_dereference through an entire vm exit. And much of the x86/kvm code does not maintain a single rcu_dereference of the current memslots during each exit. We can prevent the following case: vcpu (CPU 0) | thread (CPU 1) --------------------------------------------+-------------------------- 1 vm exit | 2 srcu_read_unlock(&kvm->srcu) | 3 decide to cache something based on | old memslots | 4 | change memslots | (increments generation) 5 | synchronize_srcu(&kvm->srcu); 6 retrieve generation # from new memslots | 7 tag cache with new memslot generation | 8 srcu_read_unlock(&kvm->srcu) | ... | <action based on cache occurs even | though the caching decision was based | on the old memslots> | ... | <action *continues* to occur until next | memslot generation change, which may | be never> | | By incrementing the generation after synchronizing with kvm->srcu readers, we ensure that the generation retrieved in (6) will become invalid soon after (8). Keeping the existing increment is not strictly necessary, but we do keep it and just move it for consistency from update_memslots to install_new_memslots. It invalidates old cached MMIOs immediately, instead of having to wait for the end of synchronize_srcu_expedited, which makes the code more clearly correct in case CPU 1 is preempted right after synchronize_srcu() returns. To avoid halving the generation space in SPTEs, always presume that the low bit of the generation is zero when reconstructing a generation number out of an SPTE. This effectively disables MMIO caching in SPTEs during the call to synchronize_srcu_expedited. Using the low bit this way is somewhat like a seqcount---where the protected thing is a cache, and instead of retrying we can simply punt if we observe the low bit to be 1. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-06kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601)Michael S. Tsirkin1-9/+10
commit 350b8bdd689cd2ab2c67c8a86a0be86cfa0751a7 upstream. The third parameter of kvm_iommu_put_pages is wrong, It should be 'gfn - slot->base_gfn'. By making gfn very large, malicious guest or userspace can cause kvm to go to this error path, and subsequently to pass a huge value as size. Alternatively if gfn is small, then pages would be pinned but never unpinned, causing host memory leak and local DOS. Passing a reasonable but large value could be the most dangerous case, because it would unpin a page that should have stayed pinned, and thus allow the device to DMA into arbitrary memory. However, this cannot happen because of the condition that can trigger the error: - out of memory (where you can't allocate even a single page) should not be possible for the attacker to trigger - when exceeding the iommu's address space, guest pages after gfn will also exceed the iommu's address space, and inside kvm_iommu_put_pages() the iommu_iova_to_phys() will fail. The page thus would not be unpinned at all. Reported-by: Jack Morgenstein <jackm@mellanox.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-06KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir tablePaolo Bonzini1-4/+3
commit 0f6c0a740b7d3e1f3697395922d674000f83d060 upstream. Currently, the EOI exit bitmap (used for APICv) does not include interrupts that are masked. However, this can cause a bug that manifests as an interrupt storm inside the guest. Alex Williamson reported the bug and is the one who really debugged this; I only wrote the patch. :) The scenario involves a multi-function PCI device with OHCI and EHCI USB functions and an audio function, all assigned to the guest, where both USB functions use legacy INTx interrupts. As soon as the guest boots, interrupts for these devices turn into an interrupt storm in the guest; the host does not see the interrupt storm. Basically the EOI path does not work, and the guest continues to see the interrupt over and over, even after it attempts to mask it at the APIC. The bug is only visible with older kernels (RHEL6.5, based on 2.6.32 with not many changes in the area of APIC/IOAPIC handling). Alex then tried forcing bit 59 (corresponding to the USB functions' IRQ) on in the eoi_exit_bitmap and TMR, and things then work. What happens is that VFIO asserts IRQ11, then KVM recomputes the EOI exit bitmap. It does not have set bit 59 because the RTE was masked, so the IOAPIC never sees the EOI and the interrupt continues to fire in the guest. My guess was that the guest is masking the interrupt in the redirection table in the interrupt routine, i.e. while the interrupt is set in a LAPIC's ISR, The simplest fix is to ignore the masking state, we would rather have an unnecessary exit rather than a missed IRQ ACK and anyway IOAPIC interrupts are not as performance-sensitive as for example MSIs. Alex tested this patch and it fixed his bug. [Thanks to Alex for his precise description of the problem and initial debugging effort. A lot of the text above is based on emails exchanged with him.] Reported-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-30kvm: arm64: vgic: fix hyp panic with 64k pages on juno platformWill Deacon1-4/+20
If the physical address of GICV isn't page-aligned, then we end up creating a stage-2 mapping of the page containing it, which causes us to map neighbouring memory locations directly into the guest. As an example, consider a platform with GICV at physical 0x2c02f000 running a 64k-page host kernel. If qemu maps this into the guest at 0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will map host physical region 0x2c020000 - 0x2c02efff. Accesses to these physical regions may cause UNPREDICTABLE behaviour, for example, on the Juno platform this will cause an SError exception to EL3, which brings down the entire physical CPU resulting in RCU stalls / HYP panics / host crashing / wasted weeks of debugging. SBSA recommends that systems alias the 4k GICV across the bounding 64k region, in which case GICV physical could be described as 0x2c020000 in the above scenario. This patch fixes the problem by failing the vgic probe if the physical base address or the size of GICV aren't page-aligned. Note that this generated a warning in dmesg about freeing enabled IRQs, so I had to move the IRQ enabling later in the probe. Cc: Christoffer Dall <christoffer.dall@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joel Schopp <joel.schopp@amd.com> Cc: Don Dutile <ddutile@redhat.com> Acked-by: Peter Maydell <peter.maydell@linaro.org> Acked-by: Joel Schopp <joel.schopp@amd.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-06-13Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more scheduler updates from Ingo Molnar: "Second round of scheduler changes: - try-to-wakeup and IPI reduction speedups, from Andy Lutomirski - continued power scheduling cleanups and refactorings, from Nicolas Pitre - misc fixes and enhancements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Delete extraneous extern for to_ratio() sched/idle: Optimize try-to-wake-up IPI sched/idle: Simplify wake_up_idle_cpu() sched/idle: Clear polling before descheduling the idle thread sched, trace: Add a tracepoint for IPI-less remote wakeups cpuidle: Set polling in poll_idle sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...) sched: Rename capacity related flags sched: Final power vs. capacity cleanups sched: Remove remaining dubious usage of "power" sched: Let 'struct sched_group_power' care about CPU capacity sched/fair: Disambiguate existing/remaining "capacity" usage sched/fair: Change "has_capacity" to "has_free_capacity" sched/fair: Remove "power" from 'struct numa_stats' sched: Fix signedness bug in yield_to() sched/fair: Use time_after() in record_wakee() sched/balancing: Reduce the rate of needless idle load balancing sched/fair: Fix unlocked reads of some cfs_b->quota/period
2014-06-05sched: Fix signedness bug in yield_to()Dan Carpenter1-2/+2
yield_to() is supposed to return -ESRCH if there is no task to yield to, but because the type is bool that is the same as returning true. The only place I see which cares is kvm_vcpu_on_spin(). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into nextLinus Torvalds5-50/+95
Pull KVM updates from Paolo Bonzini: "At over 200 commits, covering almost all supported architectures, this was a pretty active cycle for KVM. Changes include: - a lot of s390 changes: optimizations, support for migration, GDB support and more - ARM changes are pretty small: support for the PSCI 0.2 hypercall interface on both the guest and the host (the latter acked by Catalin) - initial POWER8 and little-endian host support - support for running u-boot on embedded POWER targets - pretty large changes to MIPS too, completing the userspace interface and improving the handling of virtualized timer hardware - for x86, a larger set of changes is scheduled for 3.17. Still, we have a few emulator bugfixes and support for running nested fully-virtualized Xen guests (para-virtualized Xen guests have always worked). And some optimizations too. The only missing architecture here is ia64. It's not a coincidence that support for KVM on ia64 is scheduled for removal in 3.17" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits) KVM: add missing cleanup_srcu_struct KVM: PPC: Book3S PR: Rework SLB switching code KVM: PPC: Book3S PR: Use SLB entry 0 KVM: PPC: Book3S HV: Fix machine check delivery to guest KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs KVM: PPC: Book3S HV: Make sure we don't miss dirty pages KVM: PPC: Book3S HV: Fix dirty map for hugepages KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number KVM: PPC: Book3S: Add ONE_REG register names that were missed KVM: PPC: Add CAP to indicate hcall fixes KVM: PPC: MPIC: Reset IRQ source private members KVM: PPC: Graciously fail broken LE hypercalls PPC: ePAPR: Fix hypercall on LE guest KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler KVM: PPC: BOOK3S: Always use the saved DAR value PPC: KVM: Make NX bit available with magic page KVM: PPC: Disable NX for old magic page using guests KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest ...
2014-06-03KVM: add missing cleanup_srcu_structPaolo Bonzini1-0/+1
Reported-by: hrg <hrgstephen@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-05kvm/irqchip: Speed up KVM_SET_GSI_ROUTINGChristian Borntraeger4-39/+50
When starting lots of dataplane devices the bootup takes very long on Christian's s390 with irqfd patches. With larger setups he is even able to trigger some timeouts in some components. Turns out that the KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec) when having multiple CPUs. This is caused by the synchronize_rcu and the HZ=100 of s390. By changing the code to use a private srcu we can speed things up. This patch reduces the boot time till mounting root from 8 to 2 seconds on my s390 guest with 100 disks. Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu are fine because they do not have lockdep checks (hlist_for_each_entry_rcu uses rcu_dereference_raw rather than rcu_dereference, and write-sides do not do rcu lockdep at all). Note that we're hardly relying on the "sleepable" part of srcu. We just want SRCU's faster detection of grace periods. Testing was done by Andrew Theurer using netperf tests STREAM, MAERTS and RR. The difference between results "before" and "after" the patch has mean -0.2% and standard deviation 0.6%. Using a paired t-test on the data points says that there is a 2.5% probability that the patch is the cause of the performance difference (rather than a random fluctuation). (Restricting the t-test to RR, which is the most likely to be affected, changes the numbers to respectively -0.3% mean, 0.7% stdev, and 8% probability that the numbers actually say something about the patch. The probability increases mostly because there are fewer data points). Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390 Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-30Merge tag 'kvm-arm-for-3.15-rc4' of ↵Paolo Bonzini1-7/+8
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master First round of KVM/ARM Fixes for 3.15 Includes vgic fixes, a possible kernel corruption bug due to misalignment of pages and disabling of KVM in KConfig on big-endian systems, because the last one breaks the build.
2014-04-29KVM: ARM: vgic: Fix the overlap check action about setting the GICD & GICC ↵Haibin Wang1-2/+3
base address. Currently below check in vgic_ioaddr_overlap will always succeed, because the vgic dist base and vgic cpu base are still kept UNDEF after initialization. The code as follows will be return forever. if (IS_VGIC_ADDR_UNDEF(dist) || IS_VGIC_ADDR_UNDEF(cpu)) return 0; So, before invoking the vgic_ioaddr_overlap, it needs to set the corresponding base address firstly. Signed-off-by: Haibin Wang <wanghaibin.wang@huawei.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-04-28KVM: async_pf: change async_pf_execute() to use get_user_pages(tsk => NULL)Oleg Nesterov1-1/+1
async_pf_execute() passes tsk == current to gup(), this is doesn't hurt but unnecessary and misleading. "tsk" is only used to account the number of faults and current is the random workqueue thread. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-28KVM: async_pf: kill the unnecessary use_mm/unuse_mm async_pf_execute()Oleg Nesterov1-2/+0
async_pf_execute() has no reasons to adopt apf->mm, gup(current, mm) should work just fine even if current has another or NULL ->mm. Recently kvm_async_page_present_sync() was added insedie the "use_mm" section, but it seems that it doesn't need current->mm too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-28KVM: arm/arm64: vgic: fix GICD_ICFGR register accessesAndre Przywara1-5/+4
Since KVM internally represents the ICFGR registers by stuffing two of them into one word, the offset for accessing the internal representation and the one for the MMIO based access are different. So keep the original offset around, but adjust the internal array offset by one bit. Reported-by: Haibin Wang <wanghaibin.wang@huawei.com> Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-04-28KVM: async_pf: mm->mm_users can not pin apf->mmOleg Nesterov1-4/+4
get_user_pages(mm) is simply wrong if mm->mm_users == 0 and exit_mmap/etc was already called (or is in progress), mm->mm_count can only pin mm->pgd and mm_struct itself. Change kvm_setup_async_pf/async_pf_execute to inc/dec mm->mm_users. kvm_create_vm/kvm_destroy_vm play with ->mm_count too but this case looks fine at first glance, it seems that this ->mm is only used to verify that current->mm == kvm->mm. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-28KVM: ARM: vgic: Fix sgi dispatch problemHaibin Wang1-0/+1
When dispatch SGI(mode == 0), that is the vcpu of VM should send sgi to the cpu which the target_cpus list. So, there must add the "break" to branch of case 0. Cc: <stable@vger.kernel.org> # 3.10+ Signed-off-by: Haibin Wang <wanghaibin.wang@huawei.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2014-04-28kvm: Use pci_enable_msix_exact() instead of pci_enable_msix()Alexander Gordeev1-1/+2
As result of deprecation of MSI-X/MSI enablement functions pci_enable_msix() and pci_enable_msi_block() all drivers using these two interfaces need to be updated to use the new pci_enable_msi_range() or pci_enable_msi_exact() and pci_enable_msix_range() or pci_enable_msix_exact() interfaces. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-pci@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>