kernel/linux.git/drivers/vfio, branch v4.14.85

vfio/type1: Fix task tracking for QEMU vCPU hotplug

2018-08-03T05:50:23+00:00

[ Upstream commit 48d8476b41eed63567dd2f0ad125c895b9ac648a ] MAP_DMA ioctls might be called from various threads within a process, for example when using QEMU, the vCPU threads are often generating these calls and we therefore take a reference to that vCPU task. However, QEMU also supports vCPU hotplug on some machines and the task that called MAP_DMA may have exited by the time UNMAP_DMA is called, resulting in the mm_struct pointer being NULL and thus a failure to match against the existing mapping. To resolve this, we instead take a reference to the thread group_leader, which has the same mm_struct and resource limits, but is less likely exit, at least in the QEMU case. A difficulty here is guaranteeing that the capabilities of the group_leader match that of the calling thread, which we resolve by tracking CAP_IPC_LOCK at the time of calling rather than at an indeterminate time in the future. Potentially this also results in better efficiency as this is now recorded once per MAP_DMA ioctl. Reported-by: Xu Yandong Signed-off-by: Alex Williamson Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

vfio/mdev: Check globally for duplicate devices

2018-08-03T05:50:22+00:00

[ Upstream commit 002fe996f67f4f46d8917b14cfb6e4313c20685a ] When we create an mdev device, we check for duplicates against the parent device and return -EEXIST if found, but the mdev device namespace is global since we'll link all devices from the bus. We do catch this later in sysfs_do_create_link_sd() to return -EEXIST, but with it comes a kernel warning and stack trace for trying to create duplicate sysfs links, which makes it an undesirable response. Therefore we should really be looking for duplicates across all mdev parent devices, or as implemented here, against our mdev device list. Using mdev_list to prevent duplicates means that we can remove mdev_parent.lock, but in order not to serialize mdev device creation and removal globally, we add mdev_device.active which allows UUIDs to be reserved such that we can drop the mdev_list_lock before the mdev device is fully in place. Two behavioral notes; first, mdev_parent.lock had the side-effect of serializing mdev create and remove ops per parent device. This was an implementation detail, not an intentional guarantee provided to the mdev vendor drivers. Vendor drivers can trivially provide this serialization internally if necessary. Second, review comments note the new -EAGAIN behavior when the device, and in particular the remove attribute, becomes visible in sysfs. If a remove is triggered prior to completion of mdev_device_create() the user will see a -EAGAIN error. While the errno is different, receiving an error during this period is not, the previous implementation returned -ENODEV for the same condition. Furthermore, the consistency to the user is improved in the case where mdev_device_remove_ops() returns error. Previously concurrent calls to mdev_device_remove() could see the device disappear with -ENODEV and return in the case of error. Now a user would see -EAGAIN while the device is in this transitory state. Reviewed-by: Kirti Wankhede Reviewed-by: Cornelia Huck Acked-by: Halil Pasic Acked-by: Zhenyu Wang Signed-off-by: Alex Williamson Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

vfio: platform: Fix reset module leak in error path

2018-08-03T05:50:22+00:00

[ Upstream commit 28a68387888997e8a7fa57940ea5d55f2e16b594 ] If the IOMMU group setup fails, the reset module is not released. Fixes: b5add544d677d363 ("vfio, platform: make reset driver a requirement by default") Signed-off-by: Geert Uytterhoeven Reviewed-by: Eric Auger Reviewed-by: Simon Horman Acked-by: Eric Auger Signed-off-by: Alex Williamson Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-07-28T05:55:41+00:00

commit 76fa4975f3ed12d15762bc979ca44078598ed8ee upstream. A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson Signed-off-by: Michael Ellerman Signed-off-by: Alexey Kardashevskiy Signed-off-by: Greg Kroah-Hartman

vfio/spapr: Use IOMMU pageshift rather than pagesize

2018-07-25T09:25:08+00:00

commit 1463edca6734d42ab4406fa2896e20b45478ea36 upstream. The size is always equal to 1 page so let's use this. Later on this will be used for other checks which use page shifts to check the granularity of access. This should cause no behavioral change. Cc: stable@vger.kernel.org # v4.12+ Reviewed-by: David Gibson Acked-by: Alex Williamson Signed-off-by: Alexey Kardashevskiy Signed-off-by: Michael Ellerman Signed-off-by: Greg Kroah-Hartman

vfio/pci: Fix potential Spectre v1

2018-07-25T09:25:08+00:00

commit 0e714d27786ce1fb3efa9aac58abc096e68b1c2a upstream. info.index can be indirectly controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: drivers/vfio/pci/vfio_pci.c:734 vfio_pci_ioctl() warn: potential spectre issue 'vdev->region' Fix this by sanitizing info.index before indirectly using it to index vdev->region Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva Signed-off-by: Alex Williamson Signed-off-by: Greg Kroah-Hartman

vfio: Use get_user_pages_longterm correctly

2018-07-11T14:29:15+00:00

commit bb94b55af3461e26b32f0e23d455abeae0cfca5d upstream. The patch noted in the fixes below converted get_user_pages_fast() to get_user_pages_longterm(), however the two calls differ in a few ways. First _fast() is documented to not require the mmap_sem, while _longterm() is documented to need it. Hold the mmap sem as required. Second, _fast accepts an 'int write' while _longterm uses 'unsigned int gup_flags', so the expression '!!(prot & IOMMU_WRITE)' is only working by luck as FOLL_WRITE is currently == 0x1. Use the expected FOLL_WRITE constant instead. Fixes: 94db151dc892 ("vfio: disable filesystem-dax page pinning") Cc: Signed-off-by: Jason Gunthorpe Acked-by: Dan Williams Signed-off-by: Alex Williamson Signed-off-by: Greg Kroah-Hartman

vfio/pci: Virtualize Maximum Read Request Size

2018-04-24T07:36:34+00:00

commit cf0d53ba4947aad6e471491d5b20a567cbe92e56 upstream. MRRS defines the maximum read request size a device is allowed to make. Drivers will often increase this to allow more data transfer with a single request. Completions to this request are bound by the MPS setting for the bus. Aside from device quirks (none known), it doesn't seem to make sense to set an MRRS value less than MPS, yet this is a likely scenario given that user drivers do not have a system-wide view of the PCI topology. Virtualize MRRS such that the user can set MRRS >= MPS, but use MPS as the floor value that we'll write to hardware. Signed-off-by: Alex Williamson Signed-off-by: Greg Kroah-Hartman

vfio: disable filesystem-dax page pinning

2018-03-09T06:41:06+00:00

commit 94db151dc89262bfa82922c44e8320cea2334667 upstream. Filesystem-DAX is incompatible with 'longterm' page pinning. Without page cache indirection a DAX mapping maps filesystem blocks directly. This means that the filesystem must not modify a file's block map while any page in a mapping is pinned. In order to prevent the situation of userspace holding of filesystem operations indefinitely, disallow 'longterm' Filesystem-DAX mappings. RDMA has the same conflict and the plan there is to add a 'with lease' mechanism to allow the kernel to notify userspace that the mapping is being torn down for block-map maintenance. Perhaps something similar can be put in place for vfio. Note that xfs and ext4 still report: "DAX enabled. Warning: EXPERIMENTAL, use at your own risk" ...at mount time, and resolving the dax-dma-vs-truncate problem is one of the last hurdles to remove that designation. Acked-by: Alex Williamson Cc: Michal Hocko Cc: kvm@vger.kernel.org Cc: Reported-by: Haozhong Zhang Tested-by: Haozhong Zhang Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O") Reviewed-by: Christoph Hellwig Signed-off-by: Dan Williams Signed-off-by: Greg Kroah-Hartman

vfio/pci: Virtualize Maximum Payload Size

2017-12-25T13:26:29+00:00

[ Upstream commit 523184972b282cd9ca17a76f6ca4742394856818 ] With virtual PCI-Express chipsets, we now see userspace/guest drivers trying to match the physical MPS setting to a virtual downstream port. Of course a lone physical device surrounded by virtual interconnects cannot make a correct decision for a proper MPS setting. Instead, let's virtualize the MPS control register so that writes through to hardware are disallowed. Userspace drivers like QEMU assume they can write anything to the device and we'll filter out anything dangerous. Since mismatched MPS can lead to AER and other faults, let's add it to the kernel side rather than relying on userspace virtualization to handle it. Signed-off-by: Alex Williamson Reviewed-by: Eric Auger Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman