summaryrefslogtreecommitdiff
path: root/drivers/vhost/vhost.c
AgeCommit message (Collapse)AuthorFilesLines
2021-08-26vhost: Fix the calculation in vhost_overflow()Xie Yongji1-2/+8
[ Upstream commit f7ad318ea0ad58ebe0e595e59aed270bb643b29b ] This fixes the incorrect calculation for integer overflow when the last address of iova range is 0xffffffff. Fixes: ec33d031a14b ("vhost: detect 32 bit integer wrap around") Reported-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20210728130756.97-2-xieyongji@bytedance.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-07vhost: Fix vhost_vq_reset()Laurent Vivier1-1/+1
[ Upstream commit beb691e69f4dec7bfe8b81b509848acfd1f0dbf9 ] vhost_reset_is_le() is vhost_init_is_le(), and in the case of cross-endian legacy, vhost_init_is_le() depends on vq->user_be. vq->user_be is set by vhost_disable_cross_endian(). But in vhost_vq_reset(), we have: vhost_reset_is_le(vq); vhost_disable_cross_endian(vq); And so user_be is used before being set. To fix that, reverse the lines order as there is no other dependency between them. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Link: https://lore.kernel.org/r/20210312140913.788592-1-lvivier@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-16vhost: make sure log_num < in_numyongduan1-2/+2
commit 060423bfdee3f8bc6e2c1bac97de24d5415e2bc4 upstream. The code assumes log_num < in_num everywhere, and that is true as long as in_num is incremented by descriptor iov count, and log_num by 1. However this breaks if there's a zero sized descriptor. As a result, if a malicious guest creates a vring desc with desc.len = 0, it may cause the host kernel to crash by overflowing the log array. This bug can be triggered during the VM migration. There's no need to log when desc.len = 0, so just don't increment log_num in this case. Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Cc: stable@vger.kernel.org Reviewed-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: ruippan <ruippan@tencent.com> Signed-off-by: yongduan <yongduan@tencent.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-10vhost: introduce vhost_exceeds_weight()Jason Wang1-1/+19
commit e82b9b0727ff6d665fff2d326162b460dded554d upstream. We used to have vhost_exceeds_weight() for vhost-net to: - prevent vhost kthread from hogging the cpu - balance the time spent between TX and RX This function could be useful for vsock and scsi as well. So move it to vhost.c. Device must specify a weight which counts the number of requests, or it can also specific a byte_weight which counts the number of bytes that has been processed. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Balbir Singh <sblbir@amzn.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27vhost: reject zero size iova rangeJason Wang1-1/+5
[ Upstream commit 813dbeb656d6c90266f251d8bd2b02d445afa63f ] We used to accept zero size iova range which will lead a infinite loop in translate_desc(). Fixing this by failing the request in this case. Reported-by: syzbot+d21e6e297322a900c128@syzkaller.appspotmail.com Fixes: 6b1e6cc7 ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23vhost: correctly check the return value of translate_desc() in log_used()Jason Wang1-1/+1
[ Upstream commit 816db7663565cd23f74ed3d5c9240522e3fb0dda ] When fail, translate_desc() returns negative value, otherwise the number of iovs. So we should fail when the return value is negative instead of a blindly check against zero. Detected by CoverityScan, CID# 1442593: Control flow issues (DEADCODE) Fixes: cc5e71075947 ("vhost: log dirty page correctly") Acked-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31vhost: log dirty page correctlyJason Wang1-14/+83
[ Upstream commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4 ] Vhost dirty page logging API is designed to sync through GPA. But we try to log GIOVA when device IOTLB is enabled. This is wrong and may lead to missing data after migration. To solve this issue, when logging with device IOTLB enabled, we will: 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to get HVA, for writable descriptor, get HVA through iovec. For used ring update, translate its GIOVA to HVA 2) traverse the GPA->HVA mapping to get the possible GPA and log through GPA. Pay attention this reverse mapping is not guaranteed to be unique, so we should log each possible GPA in this case. This fix the failure of scp to guest during migration. In -next, we will probably support passing GIOVA->GPA instead of GIOVA->HVA. Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Reported-by: Jintack Lim <jintack@cs.columbia.edu> Cc: Jintack Lim <jintack@cs.columbia.edu> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09vhost: make sure used idx is seen before log in vhost_add_used_n()Jason Wang1-0/+2
[ Upstream commit 841df922417eb82c835e93d4b93eb6a68c99d599 ] We miss a write barrier that guarantees used idx is updated and seen before log. This will let userspace sync and copy used ring before used idx is update. Fix this by adding a barrier before log_write(). Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support") Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-04vhost: Fix Spectre V1 vulnerabilityJason Wang1-0/+2
[ Upstream commit ff002269a4ee9c769dbf9365acef633ebcbd6cbe ] The idx in vhost_vring_ioctl() was controlled by userspace, hence a potential exploitation of the Spectre variant 1 vulnerability. Fixing this by sanitizing idx before using it to index d->vqs. Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15vhost: correctly check the iova range when waking virtqueueJason Wang1-1/+1
[ Upstream commit 2d66f997f0545c8f7fc5cf0b49af1decb35170e7 ] We don't wakeup the virtqueue if the first byte of pending iova range is the last byte of the range we just got updated. This will lead a virtqueue to wait for IOTLB updating forever. Fixing by correct the check and wake up the virtqueue in this case. Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Reported-by: Peter Xu <peterx@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Tested-by: Peter Xu <peterx@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-22vhost: reset metadata cache when initializing new IOTLBJason Wang1-3/+6
[ Upstream commit b13f9c6364373a1b9f71e9846dc4fb199296f926 ] We need to reset metadata cache during new IOTLB initialization, otherwise the stale pointers to previous IOTLB may be still accessed which will lead a use after free. Reported-by: syzbot+c51e6736a1bf614b3272@syzkaller.appspotmail.com Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-26vhost: fix info leak due to uninitialized memoryMichael S. Tsirkin1-0/+3
commit 670ae9caaca467ea1bfd325cb2a5c98ba87f94ad upstream. struct vhost_msg within struct vhost_msg_node is copied to userspace. Unfortunately it turns out on 64 bit systems vhost_msg has padding after type which gcc doesn't initialize, leaking 4 uninitialized bytes to userspace. This padding also unfortunately means 32 bit users of this interface are broken on a 64 bit kernel which will need to be fixed separately. Fixes: CVE-2018-1118 Cc: stable@vger.kernel.org Reported-by: Kevin Easton <kevin@guarana.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-11vhost: synchronize IOTLB message with dev cleanupJason Wang1-0/+3
[ Upstream commit 1b15ad683ab42a203f98b67045b40720e99d0e9a ] DaeRyong Jeong reports a race between vhost_dev_cleanup() and vhost_process_iotlb_msg(): Thread interleaving: CPU0 (vhost_process_iotlb_msg) CPU1 (vhost_dev_cleanup) (In the case of both VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE) ===== ===== vhost_umem_clean(dev->iotlb); if (!dev->iotlb) { ret = -EFAULT; break; } dev->iotlb = NULL; The reason is we don't synchronize between them, fixing by protecting vhost_process_iotlb_msg() with dev mutex. Reported-by: DaeRyong Jeong <threeearcat@gmail.com> Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-19vhost: Fix vhost_copy_to_user()Eric Auger1-1/+1
[ Upstream commit 7ced6c98c7ab7a1f6743931e28671b833af79b1e ] vhost_copy_to_user is used to copy vring used elements to userspace. We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC. Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache") Signed-off-by: Eric Auger <eric.auger@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-19vhost: fix vhost_vq_access_ok() log checkStefan Hajnoczi1-3/+5
[ Upstream commit d14d2b78090c7de0557362b26a4ca591aa6a9faa ] Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when IOTLB is enabled") introduced a regression. The logic was originally: if (vq->iotlb) return 1; return A && B; After the patch the short-circuit logic for A was inverted: if (A || vq->iotlb) return A; return B; This patch fixes the regression by rewriting the checks in the obvious way, no longer returning A when vq->iotlb is non-NULL (which is hard to understand). Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12vhost: validate log when IOTLB is enabledJason Wang1-8/+6
[ Upstream commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ] Vq log_base is the userspace address of bitmap which has nothing to do with IOTLB. So it needs to be validated unconditionally otherwise we may try use 0 as log_base which may lead to pin pages that will lead unexpected result (e.g trigger BUG_ON() in set_bit_to_user()). Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12vhost: correctly remove wait queue during poll failureJason Wang1-2/+1
[ Upstream commit dc6455a71c7fc5117977e197f67f71b49f27baba ] We tried to remove vq poll from wait queue, but do not check whether or not it was in a list before. This will lead double free. Fixing this by switching to use vhost_poll_stop() which zeros poll->wqh after removing poll from waitqueue to make sure it won't be freed twice. Cc: Darren Kenny <darren.kenny@oracle.com> Reported-by: syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()Jason Wang1-1/+1
commit e9cb4239134c860e5f92c75bf5321bd377bb505b upstream. We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to hold mutexes of all virtqueues. This may confuse lockdep to report a possible deadlock because of trying to hold locks belong to same class. Switch to use mutex_lock_nested() to avoid false positive. Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API") Reported-by: syzbot+dbb7c1161485e61b0241@syzkaller.appspotmail.com Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-09lib/interval_tree: fast overlap detectionDavidlohr Bueso1-1/+1
Allow interval trees to quickly check for overlaps to avoid unnecesary tree lookups in interval_tree_iter_first(). As of this patch, all interval tree flavors will require using a 'rb_root_cached' such that we can have the leftmost node easily available. While most users will make use of this feature, those with special functions (in addition to the generic insert, delete, search calls) will avoid using the cached option as they can do funky things with insertions -- for example, vma_interval_tree_insert_after(). [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()] Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Doug Ledford <dledford@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: David Airlie <airlied@linux.ie> Cc: Jason Wang <jasowang@redhat.com> Cc: Christian Benvenuti <benve@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-30Revert "vhost: cache used event for better performance"Jason Wang1-22/+6
This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it was reported to break vhost_net. We want to cache used event and use it to check for notification. The assumption was that guest won't move the event idx back, but this could happen in fact when 16 bit index wraps around after 64K entries. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-20sched/wait: Rename wait_queue_t => wait_queue_entry_tIngo Molnar1-1/+1
Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-09mm: support __GFP_REPEAT in kvmalloc_node for >32kBMichal Hocko1-12/+3
vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp. vhost_vsock because it would really like to prefer kmalloc to the vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device allocation to vmalloc") for more context. Michael Tsirkin has also noted: "__GFP_REPEAT overhead is during allocation time. Using vmalloc means all accesses are slowed down. Allocation is not on data path, accesses are." The similar applies to other vhost_kvzalloc users. Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are two things to be careful about. First we should prevent from the OOM killer and so have to involve __GFP_NORETRY by default and secondly override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is ignored for !costly orders. Supporting __GFP_REPEAT like semantic for !costly request is possible it would require changes in the page allocator. This is out of scope of this patch. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-03Merge branch 'WIP.sched-core-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...
2017-03-03Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-41/+132
Pull vhost updates from Michael Tsirkin: "virtio, vhost: optimizations, fixes Looks like a quiet cycle for vhost/virtio, just a couple of minor tweaks. Most notable is automatic interrupt affinity for blk and scsi. Hopefully other devices are not far behind" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio-console: avoid DMA from stack vhost: introduce O(1) vq metadata cache virtio_scsi: use virtio IRQ affinity virtio_blk: use virtio IRQ affinity blk-mq: provide a default queue mapping for virtio device virtio: provide a method to get the IRQ affinity mask for a virtqueue virtio: allow drivers to request IRQ affinity when creating VQs virtio_pci: simplify MSI-X setup virtio_pci: don't duplicate the msix_enable flag in struct pci_dev virtio_pci: use shared interrupts for virtqueues virtio_pci: remove struct virtio_pci_vq_info vhost: try avoiding avail index access when getting descriptor virtio_mmio: expose header to userspace
2017-03-02sched/headers: Prepare to move signal wakeup & sigpending methods from ↵Ingo Molnar1-0/+1
<linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02vhost: introduce O(1) vq metadata cacheJason Wang1-26/+110
When device IOTLB is enabled, all address translations were stored in interval tree. O(lgN) searching time could be slow for virtqueue metadata (avail, used and descriptors) since they were accessed much often than other addresses. So this patch introduces an O(1) array which points to the interval tree nodes that store the translations of vq metadata. Those array were update during vq IOTLB prefetching and were reset during each invalidation and tlb update. Each time we want to access vq metadata, this small array were queried before interval tree. This would be sufficient for static mappings but not dynamic mappings, we could do optimizations on top. Test were done with l2fwd in guest (2M hugepage): noiommu | before | after tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%) rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%) We can almost reach the same performance as noiommu mode. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-02-27vhost: try avoiding avail index access when getting descriptorJason Wang1-16/+23
If last avail idx is not equal to cached avail idx, we're sure there's still available buffers in the virtqueue so there's no need to re-read avail idx. So let's skip this to avoid unnecessary userspace memory access and memory barrier. Pktgen test show about 3% improvement on rx pps. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-02-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-6/+4
The conflict was an interaction between a bug fix in the netvsc driver in 'net' and an optimization of the RX path in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04vhost: fix initialization for vq->is_leHalil Pasic1-6/+4
Currently, under certain circumstances vhost_init_is_le does just a part of the initialization job, and depends on vhost_reset_is_le being called too. For this reason vhost_vq_init_access used to call vhost_reset_is_le when vq->private_data is NULL. This is not only counter intuitive, but also real a problem because it breaks vhost_net. The bug was introduced to vhost_net with commit 2751c9882b94 ("vhost: cross-endian support for legacy devices"). The symptom is corruption of the vq's used.idx field (virtio) after VHOST_NET_SET_BACKEND was issued as a part of the vhost shutdown on a vq with pending descriptors. Let us make sure the outcome of vhost_init_is_le never depend on the state it is actually supposed to initialize, and fix virtio_net by removing the reset from vhost_vq_init_access. With the above, there is no reason for vhost_reset_is_le to do just half of the job. Let us make vhost_reset_is_le reinitialize is_le. Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reported-by: Michael A. Tebolt <miket@us.ibm.com> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Fixes: commit 2751c9882b94 ("vhost: cross-endian support for legacy devices") Cc: <stable@vger.kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Michael A. Tebolt <miket@us.ibm.com>
2017-01-19vhost: better detection of available buffersJason Wang1-2/+6
This patch tries to do several tweaks on vhost_vq_avail_empty() for a better performance: - check cached avail index first which could avoid userspace memory access. - using unlikely() for the failure of userspace access - check vq->last_avail_idx instead of cached avail index as the last step. This patch is need for batching supports which needs to peek whether or not there's still available buffers in the ring. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-16Merge branch 'for-linus' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: - more ->d_init() stuff (work.dcache) - pathname resolution cleanups (work.namei) - a few missing iov_iter primitives - copy_from_iter_full() and friends. Either copy the full requested amount, advance the iterator and return true, or fail, return false and do _not_ advance the iterator. Quite a few open-coded callers converted (and became more readable and harder to fuck up that way) (work.iov_iter) - several assorted patches, the big one being logfs removal * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: logfs: remove from tree vfs: fix put_compat_statfs64() does not handle errors namei: fold should_follow_link() with the step into not-followed link namei: pass both WALK_GET and WALK_MORE to should_follow_link() namei: invert WALK_PUT logics namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link() namei: saner calling conventions for mountpoint_last() namei.c: get rid of user_path_parent() switch getfrag callbacks to ..._full() primitives make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success [iov_iter] new primitives - copy_from_iter_full() and friends don't open-code file_inode() ceph: switch to use of ->d_init() ceph: unify dentry_operations instances lustre: switch to use of ->d_init()
2016-12-16vhost: cache used event for better performanceJason Wang1-6/+22
When event index was enabled, we need to fetch used event from userspace memory each time. This userspace fetch (with memory barrier) could be saved sometime when 1) caching used event and 2) if used event is ahead of new and old to new updating does not cross it, we're sure there's no need to notify guest. This will be useful for heavy tx load e.g guest pktgen test with Linux driver shows ~3.5% improvement. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-15vhost: add missing __user annotationsMichael S. Tsirkin1-5/+5
Several vhost functions were missing __user annotations on pointers, causing sparse warnings. Fix this up. sparse also warns about vhost_process_iotlb_msg which is local and should be static. Fix that up as well. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-15vhost: make interval tree static inlineMichael S. Tsirkin1-1/+1
vhost_umem_interval_tree is only used locally within vhost.c, mark it static. As some functions generated go unused, this triggers warnings unless we also mark it inline. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-08vhost: remove unnecessary smp_mb from vhost_work_queuePeng Tao1-1/+1
test_and_set_bit() already implies a memory barrier. Signed-off-by: Peng Tao <bergwolf@gmail.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05[iov_iter] new primitives - copy_from_iter_full() and friendsAl Viro1-2/+1
copy_from_iter_full(), copy_from_iter_full_nocache() and csum_and_copy_from_iter_full() - counterparts of copy_from_iter() et.al., advancing iterator only in case of successful full copy and returning whether it had been successful or not. Convert some obvious users. *NOTE* - do not blindly assume that something is a good candidate for those unless you are sure that not advancing iov_iter in failure case is the right thing in this case. Anything that does short read/short write kind of stuff (or is in a loop, etc.) is unlikely to be a good one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-08-02vhost: detect 32 bit integer wrap aroundMichael S. Tsirkin1-2/+14
Detect and fail early if long wrap around is triggered. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-08-02vhost: new device IOTLB APIJason Wang1-43/+593
This patch tries to implement an device IOTLB for vhost. This could be used with userspace(qemu) implementation of DMA remapping to emulate an IOMMU for the guest. The idea is simple, cache the translation in a software device IOTLB (which is implemented as an interval tree) in vhost and use vhost_net file descriptor for reporting IOTLB miss and IOTLB update/invalidation. When vhost meets an IOTLB miss, the fault address, size and access can be read from the file. After userspace finishes the translation, it writes the translated address to the vhost_net file to update the device IOTLB. When device IOTLB is enabled by setting VIRTIO_F_IOMMU_PLATFORM all vq addresses set by ioctl are treated as iova instead of virtual address and the accessing can only be done through IOTLB instead of direct userspace memory access. Before each round or vq processing, all vq metadata is prefetched in device IOTLB to make sure no translation fault happens during vq processing. In most cases, virtqueues are contiguous even in virtual address space. The IOTLB translation for virtqueue itself may make it a little slower. We might add fast path cache on top of this patch. Signed-off-by: Jason Wang <jasowang@redhat.com> [mst: use virtio feature bit: VHOST_F_DEVICE_IOTLB -> VIRTIO_F_IOMMU_PLATFORM ] [mst: fix build warnings ] Signed-off-by: Michael S. Tsirkin <mst@redhat.com> [ weiyj.lk: missing unlock on error ] Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
2016-08-02vhost: convert pre sorted vhost memory array to interval treeJason Wang1-81/+101
Current pre-sorted memory region array has some limitations for future device IOTLB conversion: 1) need extra work for adding and removing a single region, and it's expected to be slow because of sorting or memory re-allocation. 2) need extra work of removing a large range which may intersect several regions with different size. 3) need trick for a replacement policy like LRU To overcome the above shortcomings, this patch convert it to interval tree which can easily address the above issue with almost no extra work. The patch could be used for: - Extend the current API and only let the userspace to send diffs of memory table. - Simplify Device IOTLB implementation. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-08-02vhost: introduce vhost memory accessorsJason Wang1-15/+35
This patch introduces vhost memory accessors which were just wrappers for userspace address access helpers. This is a requirement for vhost device iotlb implementation which will add iotlb translations in those accessors. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-08-01vhost: lockless enqueuingJason Wang1-27/+25
We use spinlock to synchronize the work list now which may cause unnecessary contentions. So this patch switch to use llist to remove this contention. Pktgen tests shows about 5% improvement: Before: ~1300000 pps After: ~1370000 pps Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-08-01vhost: simplify work flushingJason Wang1-32/+21
We used to implement the work flushing through tracking queued seq, done seq, and the number of flushing. This patch simplify this by just implement work flushing through another kind of vhost work with completion. This will be used by lockless enqueuing patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-03-11vhost_net: basic polling supportJason Wang1-0/+14
This patch tries to poll for new added tx buffer or socket receive queue for a while at the end of tx/rx processing. The maximum time spent on polling were specified through a new kind of vring ioctl. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-03-11vhost: introduce vhost_vq_avail_empty()Jason Wang1-0/+14
This patch introduces a helper which will return true if we're sure that the available ring is empty for a specific vq. When we're not sure, e.g vq access failure, return false instead. This could be used for busy polling code to exit the busy loop. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-03-11vhost: introduce vhost_has_work()Jason Wang1-0/+7
This path introduces a helper which can give a hint for whether or not there's a work queued in the work list. This could be used for busy polling code to exit the busy loop. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-03-02vhost: rename vhost_init_used()Greg Kurz1-2/+2
Looking at how callers use this, maybe we should just rename init_used to vhost_vq_init_access. The _used suffix was a hint that we access the vq used ring. But maybe what callers care about is that it must be called after access_ok. Also, this function manipulates the vq->is_le field which isn't related to the vq used ring. This patch simply renames vhost_init_used() to vhost_vq_init_access() as suggested by Michael. No behaviour change. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-03-02vhost: rename cross-endian helpersGreg Kurz1-6/+24
The default use case for vhost is when the host and the vring have the same endianness (default native endianness). But there are cases where they differ and vhost should byteswap when accessing the vring. The first case is when the host is big endian and the vring belongs to a virtio 1.0 device, which is always little endian. This is covered by the vq->is_le field. This field is initialized when userspace calls the VHOST_SET_FEATURES ioctl. It is reset when the device stops. We already have a vhost_init_is_le() helper, but the reset operation is opencoded as follows: vq->is_le = virtio_legacy_is_little_endian(); It isn't clear that we are resetting vq->is_le here. This patch moves the code to a helper with a more explicit name. The other case where we may have to byteswap is when the architecture can switch endianness at runtime (bi-endian). If endianness differs in the host and in the guest, then legacy devices need to be used in cross-endian mode. This mode is available with CONFIG_VHOST_CROSS_ENDIAN_LEGACY=y, which introduces a vq->user_be field. Userspace may enable cross-endian mode by calling the SET_VRING_ENDIAN ioctl before the device is started. The cross-endian mode is disabled when the device is stopped. The current names of the helpers that manipulate vq->user_be are unclear. This patch renames those helpers to clearly show that this is cross-endian stuff and with explicit enable/disable semantics. No behaviour change. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-03-02vhost: fix error path in vhost_init_used()Greg Kurz1-4/+11
We don't want side effects. If something fails, we rollback vq->is_le to its previous value. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2015-12-07vhost: replace % with & on data pathMichael S. Tsirkin1-3/+3
We know vring num is a power of 2, so use & to mask the high bits. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>