summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-12-02vfio-iommufd: Allow iommufd to be used in place of a container fdJason Gunthorpe4-16/+82
This makes VFIO_GROUP_SET_CONTAINER accept both a vfio container FD and an iommufd. In iommufd mode an IOAS will exist after the SET_CONTAINER, but it will not be attached to any groups. For VFIO this means that the VFIO_GROUP_GET_STATUS and VFIO_GROUP_FLAGS_VIABLE works subtly differently. With the container FD the iommu_group_claim_dma_owner() is done during SET_CONTAINER but for IOMMUFD this is done during VFIO_GROUP_GET_DEVICE_FD. Meaning that VFIO_GROUP_FLAGS_VIABLE could be set but GET_DEVICE_FD will fail due to viability. As GET_DEVICE_FD can fail for many reasons already this is not expected to be a meaningful difference. Reorganize the tests for if the group has an assigned container or iommu into a vfio_group_has_iommu() function and consolidate all the duplicated WARN_ON's etc related to this. Call container functions only if a container is actually present on the group. Link: https://lore.kernel.org/r/5-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02vfio: Use IOMMU_CAP_ENFORCE_CACHE_COHERENCY for vfio_file_enforced_coherent()Jason Gunthorpe3-17/+19
iommufd doesn't establish the iommu_domains until after the device FD is opened, even if the container has been set. This design is part of moving away from the group centric iommu APIs. This is fine, except that the normal sequence of establishing the kvm wbinvd won't work: group = open("/dev/vfio/XX") ioctl(group, VFIO_GROUP_SET_CONTAINER) ioctl(kvm, KVM_DEV_VFIO_GROUP_ADD) ioctl(group, VFIO_GROUP_GET_DEVICE_FD) As the domains don't start existing until GET_DEVICE_FD. Further, GET_DEVICE_FD requires that KVM_DEV_VFIO_GROUP_ADD already be done as that is what sets the group->kvm and thus device->kvm for the driver to use during open. Now that we have device centric cap ops and the new IOMMU_CAP_ENFORCE_CACHE_COHERENCY we know what the iommu_domain will be capable of without having to create it. Use this to compute vfio_file_enforced_coherent() and resolve the ordering problems. VFIO always tries to upgrade domains to enforce cache coherency, it never attaches a device that supports enforce cache coherency to a less capable domain, so the cap test is a sufficient proxy for the ultimate outcome. iommufd also ensures that devices that set the cap will be connected to enforcing domains. Link: https://lore.kernel.org/r/4-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02vfio: Rename vfio_device_assign/unassign_container()Jason Gunthorpe3-13/+11
These functions don't really assign anything anymore, they just increment some refcounts and do a sanity check. Call them vfio_group_[un]use_container() Link: https://lore.kernel.org/r/3-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02vfio: Move vfio_device_assign_container() into vfio_device_first_open()Jason Gunthorpe2-15/+13
The only thing this function does is assert the group has an assigned container and incrs refcounts. The overall model we have is that once a container_users refcount is incremented it cannot be de-assigned from the group - vfio_group_ioctl_unset_container() will fail and the group FD cannot be closed. Thus we do not need to check this on every device FD open, just the first. Reorganize the code so that only the first open and last close manages the container. Link: https://lore.kernel.org/r/2-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02vfio: Move vfio_device driver open/close code to a functionJason Gunthorpe1-42/+53
This error unwind is getting complicated. Move all the code into two pair'd function. The functions should be called when the open_count == 1 after incrementing/before decrementing. Link: https://lore.kernel.org/r/1-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02vfio/ap: Validate iova during dma_unmap and trigger irq disableMatthew Rosato1-1/+17
Currently, each mapped iova is stashed in its associated vfio_ap_queue; when we get an unmap request, validate that it matches with one or more of these stashed values before attempting unpins. Each stashed iova represents IRQ that was enabled for a queue. Therefore, if a match is found, trigger IRQ disable for this queue to ensure that underlying firmware will no longer try to use the associated pfn after the page is unpinned. IRQ disable will also handle the associated unpin. Link: https://lore.kernel.org/r/20221202135402.756470-3-yi.l.liu@intel.com Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-02i915/gvt: Move gvt mapping cache initialization to intel_vgpu_init_dev()Yi Liu1-4/+14
vfio container registers .dma_unmap() callback after the device is opened. So it's fine for mdev drivers to initialize internal mapping cache in .open_device(). See vfio_device_container_register(). Now with iommufd an access ops with an unmap callback is registered when the device is bound to iommufd which is before .open_device() is called. This implies gvt's .dma_unmap() could be called before its internal mapping cache is initialized. The fix is moving gvt mapping cache initialization to vGPU init. While at it also move ptable initialization together. Link: https://lore.kernel.org/r/20221202135402.756470-2-yi.l.liu@intel.com Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01Merge patch series "IOMMUFD Generic interface"Jason Gunthorpe40-37/+10452
Jason Gunthorpe <jgg@nvidia.com> says: ================== iommufd is the user API to control the IOMMU subsystem as it relates to managing IO page tables that point at user space memory. It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO container) which is the VFIO specific interface for a similar idea. We see a broad need for extended features, some being highly IOMMU device specific: - Binding iommu_domain's to PASID/SSID - Userspace IO page tables, for ARM, x86 and S390 - Kernel bypassed invalidation of user page tables - Re-use of the KVM page table in the IOMMU - Dirty page tracking in the IOMMU - Runtime Increase/Decrease of IOPTE size - PRI support with faults resolved in userspace Many of these HW features exist to support VM use cases - for instance the combination of PASID, PRI and Userspace IO Page Tables allows an implementation of DMA Shared Virtual Addressing (vSVA) within a guest. Dirty tracking enables VM live migration with SRIOV devices and PASID support allow creating "scalable IOV" devices, among other things. As these features are fundamental to a VM platform they need to be uniformly exposed to all the driver families that do DMA into VMs, which is currently VFIO and VDPA. The pre-v1 series proposed re-using the VFIO type 1 data structure, however it was suggested that if we are doing this big update then we should also come with an improved data structure that solves the limitations that VFIO type1 has. Notably this addresses: - Multiple IOAS/'containers' and multiple domains inside a single FD - Single-pin operation no matter how many domains and containers use a page - A fine grained locking scheme supporting user managed concurrency for multi-threaded map/unmap - A pre-registration mechanism to optimize vIOMMU use cases by pre-pinning pages - Extended ioctl API that can manage these new objects and exposes domains directly to user space - domains are sharable between subsystems, eg VFIO and VDPA The bulk of this code is a new data structure design to track how the IOVAs are mapped to PFNs. iommufd intends to be general and consumable by any driver that wants to DMA to userspace. From a driver perspective it can largely be dropped in in-place of iommu_attach_device() and provides a uniform full feature set to all consumers. As this is a larger project this series is the first step. This series provides the iommfd "generic interface" which is designed to be suitable for applications like DPDK and VMM flows that are not optimized to specific HW scenarios. It is close to being a drop in replacement for the existing VFIO type 1 and supports existing qemu based VM flows. Several follow-on series are being prepared: - Patches integrating with qemu in native mode: https://github.com/yiliu1765/qemu/commits/qemu-iommufd-6.0-rc2 - A completed integration with VFIO now exists that covers "emulated" mdev use cases now, and can pass testing with qemu/etc in compatability mode: https://github.com/jgunthorpe/linux/commits/vfio_iommufd - A draft providing system iommu dirty tracking on top of iommufd, including iommu driver implementations: https://github.com/jpemartins/linux/commits/x86-iommufd This pairs with patches for providing a similar API to support VFIO-device tracking to give a complete vfio solution: https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com/ - Userspace page tables aka 'nested translation' for ARM and Intel iommu drivers: https://github.com/nicolinc/iommufd/commits/iommufd_nesting - "device centric" vfio series to expose the vfio_device FD directly as a normal cdev, and provide an extended API allowing dynamically changing the IOAS binding: https://github.com/yiliu1765/iommufd/commits/iommufd-v6.0-rc2-nesting-0901 - Drafts for PASID and PRI interfaces are included above as well Overall enough work is done now to show the merit of the new API design and at least draft solutions to many of the main problems. Several people have contributed directly to this work: Eric Auger, Joao Martins, Kevin Tian, Lu Baolu, Nicolin Chen, Yi L Liu. Many more have participated in the discussions that lead here, and provided ideas. Thanks to all! The v1/v2 iommufd series has been used to guide a large amount of preparatory work that has now been merged. The general theme is to organize things in a way that makes injecting iommufd natural: - VFIO live migration support with mlx5 and hisi_acc drivers. These series need a dirty tracking solution to be really usable. https://lore.kernel.org/kvm/20220224142024.147653-1-yishaih@nvidia.com/ https://lore.kernel.org/kvm/20220308184902.2242-1-shameerali.kolothum.thodi@huawei.com/ - Significantly rework the VFIO gvt mdev and remove struct mdev_parent_ops https://lore.kernel.org/lkml/20220411141403.86980-1-hch@lst.de/ - Rework how PCIe no-snoop blocking works https://lore.kernel.org/kvm/0-v3-2cf356649677+a32-intel_no_snoop_jgg@nvidia.com/ - Consolidate dma ownership into the iommu core code https://lore.kernel.org/linux-iommu/20220418005000.897664-1-baolu.lu@linux.intel.com/ - Make all vfio driver interfaces use struct vfio_device consistently https://lore.kernel.org/kvm/0-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.com/ - Remove the vfio_group from the kvm/vfio interface https://lore.kernel.org/kvm/0-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com/ - Simplify locking in vfio https://lore.kernel.org/kvm/0-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com/ - Remove the vfio notifiter scheme that faces drivers https://lore.kernel.org/kvm/0-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com/ - Improve the driver facing API for vfio pin/unpin pages to make the presence of struct page clear https://lore.kernel.org/kvm/20220723020256.30081-1-nicolinc@nvidia.com/ - Clean up in the Intel IOMMU driver https://lore.kernel.org/linux-iommu/20220301020159.633356-1-baolu.lu@linux.intel.com/ https://lore.kernel.org/linux-iommu/20220510023407.2759143-1-baolu.lu@linux.intel.com/ https://lore.kernel.org/linux-iommu/20220514014322.2927339-1-baolu.lu@linux.intel.com/ https://lore.kernel.org/linux-iommu/20220706025524.2904370-1-baolu.lu@linux.intel.com/ https://lore.kernel.org/linux-iommu/20220702015610.2849494-1-baolu.lu@linux.intel.com/ - Rework s390 vfio drivers https://lore.kernel.org/kvm/20220707135737.720765-1-farman@linux.ibm.com/ - Normalize vfio ioctl handling https://lore.kernel.org/kvm/0-v2-0f9e632d54fb+d6-vfio_ioctl_split_jgg@nvidia.com/ - VFIO API for dirty tracking (aka dma logging) managed inside a PCI device, with mlx5 implementation https://lore.kernel.org/kvm/20220901093853.60194-1-yishaih@nvidia.com - Introduce a struct device sysfs presence for struct vfio_device https://lore.kernel.org/kvm/20220901143747.32858-1-kevin.tian@intel.com/ - Complete restructuring the vfio mdev model https://lore.kernel.org/kvm/20220822062208.152745-1-hch@lst.de/ - Isolate VFIO container code in preperation for iommufd to provide an alternative implementation of it all https://lore.kernel.org/kvm/0-v1-a805b607f1fb+17b-vfio_container_split_jgg@nvidia.com - Simplify and consolidate iommu_domain/device compatability checking https://lore.kernel.org/linux-iommu/cover.1666042872.git.nicolinc@nvidia.com/ - Align iommu SVA support with the domain-centric model https://lore.kernel.org/all/20221031005917.45690-1-baolu.lu@linux.intel.com/ This is about 233 patches applied since March, thank you to everyone involved in all this work! Currently there are a number of supporting series still in progress: - DMABUF exporter support for VFIO to allow PCI P2P with VFIO https://lore.kernel.org/r/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com - Start to provide iommu_domain ops for POWER https://lore.kernel.org/all/20220714081822.3717693-1-aik@ozlabs.ru/ However, these are not necessary for this series to advance. Syzkaller coverage has been merged and is now running in the syzbot environment on linux-next: https://github.com/google/syzkaller/pull/3515 https://github.com/google/syzkaller/pull/3521 ================== Link: https://lore.kernel.org/r/0-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Add a selftestJason Gunthorpe7-0/+2530
Cover the essential functionality of the iommufd with a directed test from userspace. This aims to achieve reasonable functional coverage using the in-kernel self test framework. A second test does a failure injection sweep of the success paths to study error unwind behaviors. This allows achieving high coverage of the corner cases in pages.c. The selftest requires CONFIG_IOMMUFD_TEST to be enabled, and several huge pages which may require: echo 4 > /proc/sys/vm/nr_hugepages Link: https://lore.kernel.org/r/19-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Eric Auger <eric.auger@redhat.com> # aarch64 Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Add additional invariant assertionsJason Gunthorpe4-2/+70
These are on performance paths so we protect them using the CONFIG_IOMMUFD_TEST to not take a hit during normal operation. These are useful when running the test suite and syzkaller to find data structure inconsistencies early. Link: https://lore.kernel.org/r/18-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Add some fault injection pointsJason Gunthorpe2-0/+29
This increases the coverage the fail_nth test gets, as well as via syzkaller. Link: https://lore.kernel.org/r/17-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Add kernel support for testing iommufdJason Gunthorpe10-0/+1061
Provide a mock kernel module for the iommu_domain that allows it to run without any HW and the mocking provides a way to directly validate that the PFNs loaded into the iommu_domain are correct. This exposes the access kAPI toward userspace to allow userspace to explore the functionality of pages.c and io_pagetable.c The mock also simulates the rare case of PAGE_SIZE > iommu page size as the mock will operate at a 2K iommu page size. This allows exercising all of the calculations to support this mismatch. This is also intended to support syzkaller exploring the same space. However, it is an unusually invasive config option to enable all of this. The config option should not be enabled in a production kernel. Link: https://lore.kernel.org/r/16-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Tested-by: Eric Auger <eric.auger@redhat.com> # aarch64 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: vfio container FD ioctl compatibilityJason Gunthorpe6-6/+534
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by mapping them into io_pagetable operations. A userspace application can test against iommufd and confirm compatibility then simply make a small change to open /dev/iommu instead of /dev/vfio/vfio. For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and then all applications will use the compatibility path with no code changes. A later series allows /dev/vfio/vfio to be directly provided by iommufd, which allows the rlimit mode to work the same as well. This series just provides the iommufd side of compatibility. Actually linking this to VFIO_SET_CONTAINER is a followup series, with a link in the cover letter. Internally the compatibility API uses a normal IOAS object that, like vfio, is automatically allocated when the first device is attached. Userspace can also query or set this IOAS object directly using the IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only features while still using the VFIO style map/unmap ioctls. While this is enough to operate qemu, it has a few differences: - Resource limits rely on memory cgroups to bound what userspace can do instead of the module parameter dma_entry_limit. - VFIO P2P is not implemented. The DMABUF patches for vfio are a start at a solution where iommufd would import a special DMABUF. This is to avoid further propogating the follow_pfn() security problem. - A full audit for pedantic compatibility details (eg errnos, etc) has not yet been done - powerpc SPAPR is left out, as it is not connected to the iommu_domain framework. It seems interest in SPAPR is minimal as it is currently non-working in v6.1-rc1. They will have to convert to the iommu subsystem framework to enjoy iommfd. The following are not going to be implemented and we expect to remove them from VFIO type1: - SW access 'dirty tracking'. As discussed in the cover letter this will be done in VFIO. - VFIO_TYPE1_NESTING_IOMMU https://lore.kernel.org/all/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/ - VFIO_DMA_MAP_FLAG_VADDR https://lore.kernel.org/all/Yz777bJZjTyLrHEQ@nvidia.com/ Link: https://lore.kernel.org/r/15-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Add kAPI toward external drivers for kernel accessJason Gunthorpe5-3/+377
Kernel access is the mode that VFIO "mdevs" use. In this case there is no struct device and no IOMMU connection. iommufd acts as a record keeper for accesses and returns the actual struct pages back to the caller to use however they need. eg with kmap or the DMA API. Each caller must create a struct iommufd_access with iommufd_access_create(), similar to how iommufd_device_bind() works. Using this struct the caller can access blocks of IOVA using iommufd_access_pin_pages() or iommufd_access_rw(). Callers must provide a callback that immediately unpins any IOVA being used within a range. This happens if userspace unmaps the IOVA under the pin. The implementation forwards the access requests directly to the iopt infrastructure that manages the iopt_pages_access. Link: https://lore.kernel.org/r/14-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Add kAPI toward external drivers for physical devicesJason Gunthorpe5-0/+437
Add the four functions external drivers need to connect physical DMA to the IOMMUFD: iommufd_device_bind() / iommufd_device_unbind() Register the device with iommufd and establish security isolation. iommufd_device_attach() / iommufd_device_detach() Connect a bound device to a page table Binding a device creates a device object ID in the uAPI, however the generic API does not yet provide any IOCTLs to manipulate them. Link: https://lore.kernel.org/r/13-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Add a HW pagetable objectJason Gunthorpe5-0/+97
The hw_pagetable object exposes the internal struct iommu_domain's to userspace. An iommu_domain is required when any DMA device attaches to an IOAS to control the io page table through the iommu driver. For compatibility with VFIO the hw_pagetable is automatically created when a DMA device is attached to the IOAS. If a compatible iommu_domain already exists then the hw_pagetable associated with it is used for the attachment. In the initial series there is no iommufd uAPI for the hw_pagetable object. The next patch provides driver facing APIs for IO page table attachment that allows drivers to accept either an IOAS or a hw_pagetable ID and for the driver to return the hw_pagetable ID that was auto-selected from an IOAS. The expectation is the driver will provide uAPI through its own FD for attaching its device to iommufd. This allows userspace to learn the mapping of devices to iommu_domains and to override the automatic attachment. The future HW specific interface will allow userspace to create hw_pagetable objects using iommu_domains with IOMMU driver specific parameters. This infrastructure will allow linking those domains to IOAS's and devices. Link: https://lore.kernel.org/r/12-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: IOCTLs for the io_pagetableJason Gunthorpe5-1/+731
Connect the IOAS to its IOCTL interface. This exposes most of the functionality in the io_pagetable to userspace. This is intended to be the core of the generic interface that IOMMUFD will provide. Every IOMMU driver should be able to implement an iommu_domain that is compatible with this generic mechanism. It is also designed to be easy to use for simple non virtual machine monitor users, like DPDK: - Universal simple support for all IOMMUs (no PPC special path) - An IOVA allocator that considers the aperture and the allowed/reserved ranges - io_pagetable allows any number of iommu_domains to be connected to the IOAS - Automatic allocation and re-use of iommu_domains Along with room in the design to add non-generic features to cater to specific HW functionality. Link: https://lore.kernel.org/r/11-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Data structure to provide IOVA to PFN mappingJason Gunthorpe5-0/+1295
This is the remainder of the IOAS data structure. Provide an object called an io_pagetable that is composed of iopt_areas pointing at iopt_pages, along with a list of iommu_domains that mirror the IOVA to PFN map. At the top this is a simple interval tree of iopt_areas indicating the map of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based on the attached domains there is a minimum alignment for areas (which may be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't be mapped and an IOVA of allowed IOVA that can always be mappable. The concept of an 'access' refers to something like a VFIO mdev that is accessing the IOVA and using a 'struct page *' for CPU based access. Externally an API is provided that matches the requirements of the IOCTL interface for map/unmap and domain attachment. The API provides a 'copy' primitive to establish a new IOVA map in a different IOAS from an existing mapping by re-using the iopt_pages. This is the basic mechanism to provide single pinning. This is designed to support a pre-registration flow where userspace would setup an dummy IOAS with no domains, map in memory and then establish an access to pin all PFNs into the xarray. Copy can then be used to create new IOVA mappings in a different IOAS, with iommu_domains attached. Upon copy the PFNs will be read out of the xarray and mapped into the iommu_domains, avoiding any pin_user_pages() overheads. Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Algorithms for PFN storageJason Gunthorpe2-0/+917
The iopt_pages which represents a logical linear list of full PFNs held in different storage tiers. Each area points to a slice of exactly one iopt_pages, and each iopt_pages can have multiple areas and accesses. The three storage tiers are managed to meet these objectives: - If no iommu_domain or in-kerenel access exists then minimal memory should be consumed by iomufd - If a page has been pinned then an iopt_pages will not pin it again - If an in-kernel access exists then the xarray must provide the backing storage to avoid allocations on domain removals - Otherwise any iommu_domain will be used for storage In a common configuration with only an iommu_domain the iopt_pages does not allocate significant memory itself. The external interface for pages has several logical operations: iopt_area_fill_domain() will load the PFNs from storage into a single domain. This is used when attaching a new domain to an existing IOAS. iopt_area_fill_domains() will load the PFNs from storage into multiple domains. This is used when creating a new IOVA map in an existing IOAS iopt_pages_add_access() creates an iopt_pages_access that tracks an in-kernel access of PFNs. This is some external driver that might be accessing the IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO mdev. iopt_pages_rw_access() directly perform a memcpy on the PFNs, without the overhead of iopt_pages_add_access() iopt_pages_fill_xarray() will load PFNs into the xarray and return a 'struct page *' array. It is used by iopt_pages_access's to extract PFNs for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it is known the xarray is already filled. As an iopt_pages can be referred to in slices by many areas and accesses it uses interval trees to keep track of which storage tiers currently hold the PFNs. On a page-by-page basis any request for a PFN will be satisfied from one of the storage tiers and the PFN copied to target domain/array. Unfill actions are similar, on a page by page basis domains are unmapped, xarray entries freed or struct pages fully put back. Significant complexity is required to fully optimize all of these data motions. The implementation calculates the largest consecutive range of same-storage indexes and operates in blocks. The accumulation of PFNs always generates the largest contiguous PFN range possible to optimize and this gathering can cross storage tier boundaries. For cases like 'fill domains' care is taken to avoid duplicated work and PFNs are read once and pushed into all domains. The map/unmap interaction with the iommu_domain always works in contiguous PFN blocks. The implementation does not require or benefit from any split/merge optimization in the iommu_domain driver. This design suggests several possible improvements in the IOMMU API that would greatly help performance, particularly a way for the driver to map and read the pfns lists instead of working with one driver call per page to read, and one driver call per contiguous range to store. Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: PFN handling for iopt_pagesJason Gunthorpe7-1/+1262
The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01kernel/user: Allow user_struct::locked_vm to be usable for iommufdJason Gunthorpe2-1/+2
Following the pattern of io_uring, perf, skb, and bpf, iommfd will use user->locked_vm for accounting pinned pages. Ensure the value is included in the struct and export free_uid() as iommufd is modular. user->locked_vm is the good accounting to use for ulimit because it is per-user, and the security sandboxing of locked pages is not supposed to be per-process. Other places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this is only per-process and inconsistent with the new FOLL_LONGTERM users in the kernel. Concurrent work is underway to try to put this in a cgroup, so everything can be consistent and the kernel can provide a FOLL_LONGTERM limit that actually provides security. Link: https://lore.kernel.org/r/7-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: File descriptor, context, kconfig and makefilesJason Gunthorpe10-1/+571
This is the basic infrastructure of a new miscdevice to hold the iommufd IOCTL API. It provides: - A miscdevice to create file descriptors to run the IOCTL interface over - A table based ioctl dispatch and centralized extendable pre-validation step - An xarray mapping userspace ID's to kernel objects. The design has multiple inter-related objects held within in a single IOMMUFD fd - A simple usage count to build a graph of object relations and protect against hostile userspace racing ioctls The only IOCTL provided in this patch is the generic 'destroy any object by handle' operation. Link: https://lore.kernel.org/r/6-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-12-01iommufd: Document overview of iommufdKevin Tian2-0/+224
Add iommufd into the documentation tree, and supply initial documentation. Much of this is linked from code comments by kdoc. Link: https://lore.kernel.org/r/5-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29scripts/kernel-doc: support EXPORT_SYMBOL_NS_GPL() with -exportJason Gunthorpe1-4/+8
Parse EXPORT_SYMBOL_NS_GPL() in addition to EXPORT_SYMBOL_GPL() for use with the -export flag. Link: https://lore.kernel.org/r/4-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Acked-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29interval-tree: Add a utility to iterate over spans in an interval treeJason Gunthorpe4-0/+195
The span iterator travels over the indexes of the interval_tree, not the nodes, and classifies spans of indexes as either 'used' or 'hole'. 'used' spans are fully covered by nodes in the tree and 'hole' spans have no node intersecting the span. This is done greedily such that spans are maximally sized and every iteration step switches between used/hole. As an example a trivial allocator can be written as: for (interval_tree_span_iter_first(&span, itree, 0, ULONG_MAX); !interval_tree_span_iter_done(&span); interval_tree_span_iter_next(&span)) if (span.is_hole && span.last_hole - span.start_hole >= allocation_size - 1) return span.start_hole; With all the tricky boundary conditions handled by the library code. The following iommufd patches have several algorithms for its overlapping node interval trees that are significantly simplified with this kind of iteration primitive. As it seems generally useful, put it into lib/. Link: https://lore.kernel.org/r/3-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29iommu: Add device-centric DMA ownership interfacesLu Baolu2-26/+107
These complement the group interfaces used by VFIO and are for use by iommufd. The main difference is that multiple devices in the same group can all share the ownership by passing the same ownership pointer. Move the common code into shared functions. Link: https://lore.kernel.org/r/2-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCYJason Gunthorpe3-5/+18
This queries if a domain linked to a device should expect to support enforce_cache_coherency() so iommufd can negotiate the rules for when a domain should be shared or not. For iommufd a device that declares IOMMU_CAP_ENFORCE_CACHE_COHERENCY will not be attached to a domain that does not support it. Link: https://lore.kernel.org/r/1-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-03Merge tag 'for-joerg' of ↵Joerg Roedel16-56/+60
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd into core iommu: Define EINVAL as device/domain incompatibility This series is to replace the previous EMEDIUMTYPE patch in a VFIO series: https://lore.kernel.org/kvm/Yxnt9uQTmbqul5lf@8bytes.org/ The purpose is to regulate all existing ->attach_dev callback functions to use EINVAL exclusively for an incompatibility error between a device and a domain. This allows VFIO and IOMMUFD to detect such a soft error, and then try a different domain with the same device. Among all the patches, the first two are preparatory changes. And then one patch to update kdocs and another three patches for the enforcement effort. Link: https://lore.kernel.org/r/cover.1666042872.git.nicolinc@nvidia.com
2022-11-03iommu: Rename iommu-sva-lib.{c,h}Lu Baolu9-11/+11
Rename iommu-sva-lib.c[h] to iommu-sva.c[h] as it contains all code for SVA implementation in iommu core. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-14-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu: Per-domain I/O page fault handlingLu Baolu1-59/+9
Tweak the I/O page fault handling framework to route the page faults to the domain and call the page fault handler retrieved from the domain. This makes the I/O page fault handling framework possible to serve more usage scenarios as long as they have an IOMMU domain and install a page fault handler in it. Some unused functions are also removed to avoid dead code. The iommu_get_domain_for_dev_pasid() which retrieves attached domain for a {device, PASID} pair is used. It will be used by the page fault handling framework which knows {device, PASID} reported from the iommu driver. We have a guarantee that the SVA domain doesn't go away during IOPF handling, because unbind() won't free the domain until all the pending page requests have been flushed from the pipeline. The drivers either call iopf_queue_flush_dev() explicitly, or in stall case, the device driver is required to flush all DMAs including stalled transactions before calling unbind(). This also renames iopf_handle_group() to iopf_handler() to avoid confusing. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-13-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu: Prepare IOMMU domain for IOPFLu Baolu5-0/+80
This adds some mechanisms around the iommu_domain so that the I/O page fault handling framework could route a page fault to the domain and call the fault handler from it. Add pointers to the page fault handler and its private data in struct iommu_domain. The fault handler will be called with the private data as a parameter once a page fault is routed to the domain. Any kernel component which owns an iommu domain could install handler and its private parameter so that the page fault could be further routed and handled. This also prepares the SVA implementation to be the first consumer of the per-domain page fault handling model. The I/O page fault handler for SVA is copied to the SVA file with mmget_not_zero() added before mmap_read_lock(). Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-12-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu: Remove SVA related callbacks from iommu opsLu Baolu7-121/+0
These ops'es have been deprecated. There's no need for them anymore. Remove them to avoid dead code. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-11-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu/sva: Refactoring iommu_sva_bind/unbind_device()Lu Baolu3-111/+134
The existing iommu SVA interfaces are implemented by calling the SVA specific iommu ops provided by the IOMMU drivers. There's no need for any SVA specific ops in iommu_ops vector anymore as we can achieve this through the generic attach/detach_dev_pasid domain ops. This refactors the IOMMU SVA interfaces implementation by using the iommu_attach/detach_device_pasid interfaces and align them with the concept of the SVA iommu domain. Put the new SVA code in the SVA related file in order to make it self-contained. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-10-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03arm-smmu-v3/sva: Add SVA domain supportLu Baolu3-0/+90
Add support for SVA domain allocation and provide an SVA-specific iommu_domain_ops. This implementation is based on the existing SVA code. Possible cleanup and refactoring are left for incremental changes later. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20221031005917.45690-9-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu/vt-d: Add SVA domain supportLu Baolu3-0/+82
Add support for SVA domain allocation and provide an SVA-specific iommu_domain_ops. This implementation is based on the existing SVA code. Possible cleanup and refactoring are left for incremental changes later. The VT-d driver will also need to support setting a DMA domain to a PASID of device. Current SVA implementation uses different data structures to track the domain and device PASID relationship. That's the reason why we need to check the domain type in remove_dev_pasid callback. Eventually we'll consolidate the data structures and remove the need of domain type check. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-8-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu: Add IOMMU SVA domain supportLu Baolu2-2/+43
The SVA iommu_domain represents a hardware pagetable that the IOMMU hardware could use for SVA translation. This adds some infrastructures to support SVA domain in the iommu core. It includes: - Extend the iommu_domain to support a new IOMMU_DOMAIN_SVA domain type. The IOMMU drivers that support allocation of the SVA domain should provide its own SVA domain specific iommu_domain_ops. - Add a helper to allocate an SVA domain. The iommu_domain_free() is still used to free an SVA domain. The report_iommu_fault() should be replaced by the new iommu_report_device_fault(). Leave the existing fault handler with the existing users and the newly added SVA members excludes it. Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-7-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu: Add attach/detach_dev_pasid iommu interfacesLu Baolu2-4/+169
Attaching an IOMMU domain to a PASID of a device is a generic operation for modern IOMMU drivers which support PASID-granular DMA address translation. Currently visible usage scenarios include (but not limited): - SVA (Shared Virtual Address) - kernel DMA with PASID - hardware-assist mediated device This adds the set_dev_pasid domain ops for setting the domain onto a PASID of a device and remove_dev_pasid iommu ops for removing any setup on a PASID of device. This also adds interfaces for device drivers to attach/detach/retrieve a domain for a PASID of a device. If multiple devices share a single group, it's fine as long the fabric always routes every TLP marked with a PASID to the host bridge and only the host bridge. For example, ACS achieves this universally and has been checked when pci_enable_pasid() is called. As we can't reliably tell the source apart in a group, all the devices in a group have to be considered as the same source, and mapped to the same PASID table. The DMA ownership is about the whole device (more precisely, iommu group), including the RID and PASIDs. When the ownership is converted, the pasid array must be empty. This also adds necessary checks in the DMA ownership interfaces. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-6-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03PCI: Enable PASID only when ACS RR & UF enabled on upstream pathLu Baolu1-0/+3
The Requester ID/Process Address Space ID (PASID) combination identifies an address space distinct from the PCI bus address space, e.g., an address space defined by an IOMMU. But the PCIe fabric routes Memory Requests based on the TLP address, ignoring any PASID (PCIe r6.0, sec 2.2.10.4), so a TLP with PASID that SHOULD go upstream to the IOMMU may instead be routed as a P2P Request if its address falls in a bridge window. To ensure that all Memory Requests with PASID are routed upstream, only enable PASID if ACS P2P Request Redirect and Upstream Forwarding are enabled for the path leading to the device. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-5-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu: Remove SVM_FLAG_SUPERVISOR_MODE supportLu Baolu10-97/+25
The current kernel DMA with PASID support is based on the SVA with a flag SVM_FLAG_SUPERVISOR_MODE. The IOMMU driver binds the kernel memory address space to a PASID of the device. The device driver programs the device with kernel virtual address (KVA) for DMA access. There have been security and functional issues with this approach: - The lack of IOTLB synchronization upon kernel page table updates. (vmalloc, module/BPF loading, CONFIG_DEBUG_PAGEALLOC etc.) - Other than slight more protection, using kernel virtual address (KVA) has little advantage over physical address. There are also no use cases yet where DMA engines need kernel virtual addresses for in-kernel DMA. This removes SVM_FLAG_SUPERVISOR_MODE support from the IOMMU interface. The device drivers are suggested to handle kernel DMA with PASID through the kernel DMA APIs. The drvdata parameter in iommu_sva_bind_device() and all callbacks is not needed anymore. Cleanup them as well. Link: https://lore.kernel.org/linux-iommu/20210511194726.GP1002214@nvidia.com/ Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Fenghua Yu <fenghua.yu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-4-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu: Add max_pasids field in struct dev_iommuLu Baolu2-0/+22
Use this field to save the number of PASIDs that a device is able to consume. It is a generic attribute of a device and lifting it into the per-device dev_iommu struct could help to avoid the boilerplate code in various IOMMU drivers. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-03iommu: Add max_pasids field in struct iommu_deviceLu Baolu4-2/+12
Use this field to keep the number of supported PASIDs that an IOMMU hardware is able to support. This is a generic attribute of an IOMMU and lifting it into the per-IOMMU device structure makes it possible to allocate a PASID for device without calls into the IOMMU drivers. Any iommu driver that supports PASID related features should set this field before enabling them on the devices. In the Intel IOMMU driver, intel_iommu_sm is moved to CONFIG_INTEL_IOMMU enclave so that the pasid_supported() helper could be used in dmar.c without compilation errors. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Tony Zhu <tony.zhu@intel.com> Link: https://lore.kernel.org/r/20221031005917.45690-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2022-11-01iommu: Propagate return value in ->attach_dev callback functionsNicolin Chen2-2/+2
The mtk_iommu and virtio drivers have places in the ->attach_dev callback functions that return hardcode errnos instead of the returned values, but callers of these ->attach_dv callback functions may care. Propagate them directly without the extra conversions. Link: https://lore.kernel.org/r/ca8c5a447b87002334f83325f28823008b4ce420.1666042873.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Yong Wu <yong.wu@mediatek.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-01iommu: Use EINVAL for incompatible device/domain in ->attach_devNicolin Chen9-35/+9
Following the new rules in include/linux/iommu.h kdocs, update all drivers ->attach_dev callback functions to return EINVAL in the failure paths that are related to domain incompatibility. Also, drop adjacent error prints to prevent a kernel log spam. Link: https://lore.kernel.org/r/f52a07f7320da94afe575c9631340d0019a203a7.1666042873.git.nicolinc@nvidia.com Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-01iommu: Regulate EINVAL in ->attach_dev callback functionsNicolin Chen6-9/+11
Following the new rules in include/linux/iommu.h kdocs, EINVAL now can be used to indicate that domain and device are incompatible by a caller that treats it as a soft failure and tries attaching to another domain. On the other hand, there are ->attach_dev callback functions returning it for obvious device-specific errors. They will result in some inefficiency in the caller handling routine. Update these places to corresponding errnos following the new rules. Link: https://lore.kernel.org/r/5924c03bea637f05feb2a20d624bae086b555ec5.1666042872.git.nicolinc@nvidia.com Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-01iommu: Add return value rules to attach_dev op and APIsNicolin Chen2-0/+36
Cases like VFIO wish to attach a device to an existing domain that was not allocated specifically from the device. This raises a condition where the IOMMU driver can fail the domain attach because the domain and device are incompatible with each other. This is a soft failure that can be resolved by using a different domain. Provide a dedicated errno EINVAL from the IOMMU driver during attach that the reason why the attach failed is because of domain incompatibility. VFIO can use this to know that the attach is a soft failure and it should continue searching. Otherwise, the attach will be a hard failure and VFIO will return the code to userspace. Update kdocs to add rules of return value to the attach_dev op and APIs. Link: https://lore.kernel.org/r/bd56d93c18621104a0fa1b0de31e9b760b81b769.1666042872.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-01iommu/amd: Drop unnecessary checks in amd_iommu_attach_device()Nicolin Chen1-10/+2
The same checks are done in amd_iommu_probe_device(). If any of them fails there, then the device won't get a group, so there's no way for it to even reach amd_iommu_attach_device anymore. Link: https://lore.kernel.org/r/c054654a81f2b675c73108fe4bf10e45335a721a.1666042872.git.nicolinc@nvidia.com Suggested-by: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-10-31Linux 6.1-rc3v6.1-rc3Linus Torvalds1-1/+1
2022-10-30Merge tag 'fbdev-for-6.1-rc3' of ↵Linus Torvalds11-38/+47
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev fixes from Helge Deller: "A use-after-free bugfix in the smscufx driver and various minor error path fixes, smaller build fixes, sysfs fixes and typos in comments in the stifb, sisfb, da8xxfb, xilinxfb, sm501fb, gbefb and cyber2000fb drivers" * tag 'fbdev-for-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: fbdev: cyber2000fb: fix missing pci_disable_device() fbdev: sisfb: use explicitly signed char fbdev: smscufx: Fix several use-after-free bugs fbdev: xilinxfb: Make xilinxfb_release() return void fbdev: sisfb: fix repeated word in comment fbdev: gbefb: Convert sysfs snprintf to sysfs_emit fbdev: sm501fb: Convert sysfs snprintf to sysfs_emit fbdev: stifb: Fall back to cfb_fillrect() on 32-bit HCRX cards fbdev: da8xx-fb: Fix error handling in .remove() fbdev: MIPS supports iomem addresses
2022-10-30Merge tag 'char-misc-6.1-rc3' of ↵Linus Torvalds17-91/+175
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc fixes from Greg KH: "Some small driver fixes for 6.1-rc3. They include: - iio driver bugfixes - counter driver bugfixes - coresight bugfixes, including a revert and then a second fix to get it right. All of these have been in linux-next with no reported problems" * tag 'char-misc-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits) misc: sgi-gru: use explicitly signed char coresight: cti: Fix hang in cti_disable_hw() Revert "coresight: cti: Fix hang in cti_disable_hw()" counter: 104-quad-8: Fix race getting function mode and direction counter: microchip-tcb-capture: Handle Signal1 read and Synapse coresight: cti: Fix hang in cti_disable_hw() coresight: Fix possible deadlock with lock dependency counter: ti-ecap-capture: fix IS_ERR() vs NULL check counter: Reduce DEFINE_COUNTER_ARRAY_POLARITY() to defining counter_array iio: bmc150-accel-core: Fix unsafe buffer attributes iio: adxl367: Fix unsafe buffer attributes iio: adxl372: Fix unsafe buffer attributes iio: at91-sama5d2_adc: Fix unsafe buffer attributes iio: temperature: ltc2983: allocate iio channels once tools: iio: iio_utils: fix digit calculation iio: adc: stm32-adc: fix channel sampling time init iio: adc: mcp3911: mask out device ID in debug prints iio: adc: mcp3911: use correct id bits iio: adc: mcp3911: return proper error code on failure to allocate trigger iio: adc: mcp3911: fix sizeof() vs ARRAY_SIZE() bug ...
2022-10-30Merge tag 'usb-6.1-rc3' of ↵Linus Torvalds15-133/+169
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "A few small USB fixes for 6.1-rc3. Include in here are: - MAINTAINERS update, including a big one for the USB gadget subsystem. Many thanks to Felipe for all of the years of hard work he has done on this codebase, it was greatly appreciated. - dwc3 driver fixes for reported problems. - xhci driver fixes for reported problems. - typec driver fixes for minor issues - uvc gadget driver change, and then revert as it wasn't relevant for 6.1-final, as it is a new feature and people are still reviewing and modifying it. All of these have been in the linux-next tree with no reported issues" * tag 'usb-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: dwc3: gadget: Don't set IMI for no_interrupt usb: dwc3: gadget: Stop processing more requests on IMI Revert "usb: gadget: uvc: limit isoc_sg to super speed gadgets" xhci: Remove device endpoints from bandwidth list when freeing the device xhci-pci: Set runtime PM as default policy on all xHC 1.2 or later devices xhci: Add quirk to reset host back to default state at shutdown usb: xhci: add XHCI_SPURIOUS_SUCCESS to ASM1042 despite being a V0.96 controller usb: dwc3: st: Rely on child's compatible instead of name usb: gadget: uvc: limit isoc_sg to super speed gadgets usb: bdc: change state when port disconnected usb: typec: ucsi: acpi: Implement resume callback usb: typec: ucsi: Check the connection on resume usb: gadget: aspeed: Fix probe regression usb: gadget: uvc: fix sg handling during video encode usb: gadget: uvc: fix sg handling in error case usb: gadget: uvc: fix dropped frame after missed isoc usb: dwc3: gadget: Don't delay End Transfer on delayed_status usb: dwc3: Don't switch OTG -> peripheral if extcon is present MAINTAINERS: Update maintainers for broadcom USB MAINTAINERS: move USB gadget and phy entries under the main USB entry