summaryrefslogtreecommitdiff
path: root/drivers/iommu/intel
AgeCommit message (Collapse)AuthorFilesLines
20 hoursiommu/vt-d: Fix __domain_mapping()'s usage of switch_to_super_page()Eugene Koira1-1/+6
commit dce043c07ca1ac19cfbe2844a6dc71e35c322353 upstream. switch_to_super_page() assumes the memory range it's working on is aligned to the target large page level. Unfortunately, __domain_mapping() doesn't take this into account when using it, and will pass unaligned ranges ultimately freeing a PTE range larger than expected. Take for example a mapping with the following iov_pfn range [0x3fe400, 0x4c0600), which should be backed by the following mappings: iov_pfn [0x3fe400, 0x3fffff] covered by 2MiB pages iov_pfn [0x400000, 0x4bffff] covered by 1GiB pages iov_pfn [0x4c0000, 0x4c05ff] covered by 2MiB pages Under this circumstance, __domain_mapping() will pass [0x400000, 0x4c05ff] to switch_to_super_page() at a 1 GiB granularity, which will in turn free PTEs all the way to iov_pfn 0x4fffff. Mitigate this by rounding down the iov_pfn range passed to switch_to_super_page() in __domain_mapping() to the target large page level. Additionally add range alignment checks to switch_to_super_page. Fixes: 9906b9352a35 ("iommu/vt-d: Avoid duplicate removing in __domain_mapping()") Signed-off-by: Eugene Koira <eugkoira@amazon.com> Cc: stable@vger.kernel.org Reviewed-by: Nicolas Saenz Julienne <nsaenz@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20250826143816.38686-1-eugkoira@amazon.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 daysiommu/vt-d: Make iotlb_sync_map a static property of dmar_domainLu Baolu1-14/+29
[ Upstream commit cee686775f9cd4eae31f3c1f7ec24b2048082667 ] Commit 12724ce3fe1a ("iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes") dynamically set iotlb_sync_map. This causes synchronization issues due to lack of locking on map and attach paths, racing iommufd userspace operations. Invalidation changes must precede device attachment to ensure all flushes complete before hardware walks page tables, preventing coherence issues. Make domain->iotlb_sync_map static, set once during domain allocation. If an IOMMU requires iotlb_sync_map but the domain lacks it, attach is rejected. This won't reduce domain sharing: RWBF and shadowing page table caching are legacy uses with legacy hardware. Mixed configs (some IOMMUs in caching mode, others not) are unlikely in real-world scenarios. Fixes: 12724ce3fe1a ("iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes") Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250721051657.1695788-1-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysiommu/vt-d: Split paging_domain_compatible()Jason Gunthorpe1-12/+54
[ Upstream commit 85cfaacc99377a63e47412eeef66eff77197acea ] Make First/Second stage specific functions that follow the same pattern in intel_iommu_domain_alloc_first/second_stage() for computing EOPNOTSUPP. This makes the code easier to understand as if we couldn't create a domain with the parameters for this IOMMU instance then we certainly are not compatible with it. Check superpage support directly against the per-stage cap bits and the pgsize_bitmap. Add a note that the force_snooping is read without locking. The locking needs to cover the compatible check and the add of the device to the list. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/7-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-10-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Stable-dep-of: cee686775f9c ("iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain") Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysiommu/vt-d: Create unique domain ops for each stageJason Gunthorpe5-24/+58
[ Upstream commit b33125296b5047115469b8a3b74c0fdbf4976548 ] Use the domain ops pointer to tell what kind of domain it is instead of the internal use_first_level indication. This also protects against wrongly using a SVA/nested/IDENTITY/BLOCKED domain type in places they should not be. The only remaining uses of use_first_level outside the paging domain are in paging_domain_compatible() and intel_iommu_enforce_cache_coherency(). Thus, remove the useless sets of use_first_level in intel_svm_domain_alloc() and intel_iommu_domain_alloc_nested(). None of the unique ops for these domain types ever reference it on their call chains. Add a WARN_ON() check in domain_context_mapping_one() as it only works with second stage. This is preparation for iommupt which will have different ops for each of the stages. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-8-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Stable-dep-of: cee686775f9c ("iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain") Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysiommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()Jason Gunthorpe1-42/+58
[ Upstream commit b9434ba97c44f5744ea537adfd1f9f3fe102681c ] Create stage specific functions that check the stage specific conditions if each stage can be supported. Have intel_iommu_domain_alloc_paging_flags() call both stages in sequence until one does not return EOPNOTSUPP and prefer to use the first stage if available and suitable for the requested flags. Move second stage only operations like nested_parent and dirty_tracking into the second stage function for clarity. Move initialization of the iommu_domain members into paging_domain_alloc(). Drop initialization of domain->owner as the callers all do it. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-7-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Stable-dep-of: cee686775f9c ("iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28iommu: Remove ops.pgsize_bitmap from drivers that don't use itJason Gunthorpe1-1/+0
[ Upstream commit 8901812485de1356e3757958af40fe0d3a48e986 ] These drivers all set the domain->pgsize_bitmap in their domain_alloc_paging() functions, so the ops value is never used. Delete it. Reviewed-by: Sven Peter <sven@svenpeter.dev> # for Apple DART Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> # for RISC-V Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/3-v2-68a2e1ba507c+1fb-iommu_rm_ops_pgsize_jgg@nvidia.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Stable-dep-of: 72b6f7cd89ce ("iommu/virtio: Make instance lookup robust") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modesLu Baolu2-1/+21
commit 12724ce3fe1a3d8f30d56e48b4f272d8860d1970 upstream. The iotlb_sync_map iommu ops allows drivers to perform necessary cache flushes when new mappings are established. For the Intel iommu driver, this callback specifically serves two purposes: - To flush caches when a second-stage page table is attached to a device whose iommu is operating in caching mode (CAP_REG.CM==1). - To explicitly flush internal write buffers to ensure updates to memory- resident remapping structures are visible to hardware (CAP_REG.RWBF==1). However, in scenarios where neither caching mode nor the RWBF flag is active, the cache_tag_flush_range_np() helper, which is called in the iotlb_sync_map path, effectively becomes a no-op. Despite being a no-op, cache_tag_flush_range_np() involves iterating through all cache tags of the iommu's attached to the domain, protected by a spinlock. This unnecessary execution path introduces overhead, leading to a measurable I/O performance regression. On systems with NVMes under the same bridge, performance was observed to drop from approximately ~6150 MiB/s down to ~4985 MiB/s. Introduce a flag in the dmar_domain structure. This flag will only be set when iotlb_sync_map is required (i.e., when CM or RWBF is set). The cache_tag_flush_range_np() is called only for domains where this flag is set. This flag, once set, is immutable, given that there won't be mixed configurations in real-world scenarios where some IOMMUs in a system operate in caching mode while others do not. Theoretically, the immutability of this flag does not impact functionality. Reported-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738 Link: https://lore.kernel.org/r/20250701171154.52435-1-ioanna-maria.alifieraki@canonical.com Fixes: 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map") Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250703031545.3378602-1-baolu.lu@linux.intel.com Link: https://lore.kernel.org/r/20250714045028.958850-3-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15iommu/vt-d: Fix UAF on sva unbind with pending IOPFsLu Baolu1-1/+1
[ Upstream commit f0b9d31c6edd50a6207489cd1bd4ddac814b9cd2 ] Commit 17fce9d2336d ("iommu/vt-d: Put iopf enablement in domain attach path") disables IOPF on device by removing the device from its IOMMU's IOPF queue when the last IOPF-capable domain is detached from the device. Unfortunately, it did this in a wrong place where there are still pending IOPFs. As a result, a use-after-free error is potentially triggered and eventually a kernel panic with a kernel trace similar to the following: refcount_t: underflow; use-after-free. WARNING: CPU: 3 PID: 313 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0 Workqueue: iopf_queue/dmar0-iopfq iommu_sva_handle_iopf Call Trace: <TASK> iopf_free_group+0xe/0x20 process_one_work+0x197/0x3d0 worker_thread+0x23a/0x350 ? rescuer_thread+0x4a0/0x4a0 kthread+0xf8/0x230 ? finish_task_switch.isra.0+0x81/0x260 ? kthreads_online_cpu+0x110/0x110 ? kthreads_online_cpu+0x110/0x110 ret_from_fork+0x13b/0x170 ? kthreads_online_cpu+0x110/0x110 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- The intel_pasid_tear_down_entry() function is responsible for blocking hardware from generating new page faults and flushing all in-flight ones. Therefore, moving iopf_for_domain_remove() after this function should resolve this. Fixes: 17fce9d2336d ("iommu/vt-d: Put iopf enablement in domain attach path") Reported-by: Ethan Milon <ethan.milon@eviden.com> Closes: https://lore.kernel.org/r/e8b37f3e-8539-40d4-8993-43a1f3ffe5aa@eviden.com Suggested-by: Ethan Milon <ethan.milon@eviden.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250723072045.1853328-1-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_allEthan Milon1-17/+1
[ Upstream commit 3141153816bf4f0257747bd4dda176d38f1a9a49 ] The function cache_tag_flush_all() was originally implemented with incorrect device TLB invalidation logic that does not handle PASID, in commit c4d27ffaa8eb ("iommu/vt-d: Add cache tag invalidation helpers") This causes regressions where full address space TLB invalidations occur with a PASID attached, such as during transparent hugepage unmapping in SVA configurations or when calling iommu_flush_iotlb_all(). In these cases, the device receives a TLB invalidation that lacks PASID. This incorrect logic was later extracted into cache_tag_flush_devtlb_all(), in commit 3297d047cd7f ("iommu/vt-d: Refactor IOTLB and Dev-IOTLB flush for batching") The fix replaces the call to cache_tag_flush_devtlb_all() with cache_tag_flush_devtlb_psi(), which properly handles PASID. Fixes: 4f609dbff51b ("iommu/vt-d: Use cache helpers in arch_invalidate_secondary_tlbs") Fixes: 4e589a53685c ("iommu/vt-d: Use cache_tag_flush_all() in flush_iotlb_all") Signed-off-by: Ethan Milon <ethan.milon@eviden.com> Link: https://lore.kernel.org/r/20250708214821.30967-1-ethan.milon@eviden.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-11-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15iommu/vt-d: Do not wipe out the page table NID when devices detachJason Gunthorpe1-1/+0
[ Upstream commit 5c3687d5789cfff8d285a2c76bceb47f145bf01f ] The NID is used to control which NUMA node memory for the page table is allocated it from. It should be a permanent property of the page table when it was allocated and not change during attach/detach of devices. Reviewed-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Fixes: 7c204426b818 ("iommu/vt-d: Add domain_alloc_paging support") Link: https://lore.kernel.org/r/20250714045028.958850-6-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-04iommu/vt-d: Assign devtlb cache tag on ATS enablementLu Baolu3-4/+14
Commit <4f1492efb495> ("iommu/vt-d: Revert ATS timing change to fix boot failure") placed the enabling of ATS in the probe_finalize callback. This occurs after the default domain attachment, which is when the ATS cache tag is assigned. Consequently, the device TLB cache tag is missed when the domain is attached, leading to the device TLB not being invalidated in the iommu_unmap paths. Fix this by assigning the CACHE_TAG_DEVTLB cache tag when ATS is enabled. Fixes: 4f1492efb495 ("iommu/vt-d: Revert ATS timing change to fix boot failure") Cc: stable@vger.kernel.org Suggested-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250625050135.3129955-1-baolu.lu@linux.intel.com Link: https://lore.kernel.org/r/20250628100351.3198955-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-05-23Merge branches 'fixes', 'apple/dart', 'arm/smmu/updates', ↵Joerg Roedel10-182/+207
'arm/smmu/bindings', 'fsl/pamu', 'mediatek', 'renesas/ipmmu', 's390', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
2025-05-23iommu/vt-d: Restore context entry setup order for aliased devicesLu Baolu3-2/+14
Commit 2031c469f816 ("iommu/vt-d: Add support for static identity domain") changed the context entry setup during domain attachment from a set-and-check policy to a clear-and-reset approach. This inadvertently introduced a regression affecting PCI aliased devices behind PCIe-to-PCI bridges. Specifically, keyboard and touchpad stopped working on several Apple Macbooks with below messages: kernel: platform pxa2xx-spi.3: Adding to iommu group 20 kernel: input: Apple SPI Keyboard as /devices/pci0000:00/0000:00:1e.3/pxa2xx-spi.3/spi_master/spi2/spi-APP000D:00/input/input0 kernel: DMAR: DRHD: handling fault status reg 3 kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr 0xffffa000 [fault reason 0x06] PTE Read access is not set kernel: DMAR: DRHD: handling fault status reg 3 kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr 0xffffa000 [fault reason 0x06] PTE Read access is not set kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00 kernel: DMAR: DRHD: handling fault status reg 3 kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr 0xffffa000 [fault reason 0x06] PTE Read access is not set kernel: DMAR: DRHD: handling fault status reg 3 kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00 Fix this by restoring the previous context setup order. Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain") Closes: https://lore.kernel.org/all/4dada48a-c5dd-4c30-9c85-5b03b0aa01f0@bfh.ch/ Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20250514060523.2862195-1-baolu.lu@linux.intel.com Link: https://lore.kernel.org/r/20250520075849.755012-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-16iommu/vt-d: Change dmar_ats_supported() to return booleanWei Wang1-9/+10
According to "Function return values and names" in coding-style.rst, the dmar_ats_supported() function should return a boolean instead of an integer. Also, rename "ret" to "supported" to be more straightforward. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20250509140021.4029303-3-wei.w.wang@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-16iommu/vt-d: Eliminate pci_physfn() in dmar_find_matched_satc_unit()Wei Wang1-1/+0
The function dmar_find_matched_satc_unit() contains a duplicate call to pci_physfn(). This call is unnecessary as pci_physfn() has already been invoked by the caller. Removing the redundant call simplifies the code and improves efficiency a bit. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20250509140021.4029303-2-wei.w.wang@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-16iommu/vt-d: Replace spin_lock with mutex to protect domain idaLu Baolu3-8/+7
The domain ID allocator is currently protected by a spin_lock. However, ida_alloc_range can potentially block if it needs to allocate memory to grow its internal structures. Replace the spin_lock with a mutex which allows sleep on block. Thus, the memory allocation flags can be updated from GFP_ATOMIC to GFP_KERNEL to allow blocking memory allocations if necessary. Introduce a new mutex, did_lock, specifically for protecting the domain ida. The existing spinlock will remain for protecting other intel_iommu fields. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250430021135.2370244-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-16iommu/vt-d: Use ida to manage domain idLu Baolu3-68/+34
Switch the intel iommu driver to use the ida mechanism for managing domain IDs, replacing the previous fixed-size bitmap. The previous approach allocated a bitmap large enough to cover the maximum number of domain IDs supported by the hardware, regardless of the actual number of domains in use. This led to unnecessary memory consumption, especially on systems supporting a large number of iommu units but only utilizing a small number of domain IDs. The ida allocator dynamically manages the allocation and freeing of integer IDs, only consuming memory for the IDs that are currently in use. This significantly optimizes memory usage compared to the fixed-size bitmap. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250430021135.2370244-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-16iommu/vt-d: Restore WO permissions on second-level paging entriesJason Gunthorpe1-2/+1
VT-D HW can do WO permissions on the second-stage but not the first-stage page table formats. The commit eea53c581688 ("iommu/vt-d: Remove WO permissions on second-level paging entries") wanted to make this uniform for VT-D by disabling the support for WO permissions in the second-stage. This isn't consistent with how other drivers are working. Instead if the underlying HW can support WO, it should. For instance AMD already supports WO on its second stage (v1) format and not its first (v2). If WO support needs to be discoverable it should be done through an iommu_domain capability flag. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/0-v1-c26553717e90+65f-iommu_vtd_ss_wo_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-05-16iommu: make inclusion of intel directory conditionalRolf Eike Beer1-5/+2
Nothing in there is active if CONFIG_INTEL_IOMMU is not enabled, so the whole directory can depend on that switch as well. Fixes: ab65ba57e3ac ("iommu/vt-d: Move Kconfig and Makefile bits down into intel directory") Signed-off-by: Rolf Eike Beer <eb@emlix.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/3818749.MHq7AAxBmi@devpool92.emlix.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Remove iommu_dev_enable/disable_feature()Lu Baolu1-25/+0
No external drivers use these interfaces anymore. Furthermore, no existing iommu drivers implement anything in the callbacks. Remove them to avoid dead code. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/20250418080130.1844424-9-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu/vt-d: Put iopf enablement in domain attach pathLu Baolu4-10/+90
Update iopf enablement in the driver to use the new method, similar to the arm-smmu-v3 driver. Enable iopf support when any domain with an iopf_handler is attached, and disable it when the domain is removed. Place all the logic for controlling the PRI and iopf queue in the domain set/remove/replace paths. Keep track of the number of domains set to the device and PASIDs that require iopf. When the first domain requiring iopf is attached, add the device to the iopf queue and enable PRI. When the last domain is removed, remove it from the iopf queue and disable PRI. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-4-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Remove IOMMU_DEV_FEAT_SVAJason Gunthorpe1-6/+0
None of the drivers implement anything here anymore, remove the dead code. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu/vt-d: Apply quirk_iommu_igfx for 8086:0044 (QM57/QS57)Mingcong Bai1-1/+3
On the Lenovo ThinkPad X201, when Intel VT-d is enabled in the BIOS, the kernel boots with errors related to DMAR, the graphical interface appeared quite choppy, and the system resets erratically within a minute after it booted: DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xb97ff000 [fault reason 0x05] PTE Write access is not set Upon comparing boot logs with VT-d on/off, I found that the Intel Calpella quirk (`quirk_calpella_no_shadow_gtt()') correctly applied the igfx IOMMU disable/quirk correctly: pci 0000:00:00.0: DMAR: BIOS has allocated no shadow GTT; disabling IOMMU for graphics Whereas with VT-d on, it went into the "else" branch, which then triggered the DMAR handling fault above: ... else if (!disable_igfx_iommu) { /* we have to ensure the gfx device is idle before we flush */ pci_info(dev, "Disabling batched IOTLB flush on Ironlake\n"); iommu_set_dma_strict(); } Now, this is not exactly scientific, but moving 0x0044 to quirk_iommu_igfx seems to have fixed the aforementioned issue. Running a few `git blame' runs on the function, I have found that the quirk was originally introduced as a fix specific to ThinkPad X201: commit 9eecabcb9a92 ("intel-iommu: Abort IOMMU setup for igfx if BIOS gave no shadow GTT space") Which was later revised twice to the "else" branch we saw above: - 2011: commit 6fbcfb3e467a ("intel-iommu: Workaround IOTLB hang on Ironlake GPU") - 2024: commit ba00196ca41c ("iommu/vt-d: Decouple igfx_off from graphic identity mapping") I'm uncertain whether further testings on this particular laptops were done in 2011 and (honestly I'm not sure) 2024, but I would be happy to do some distro-specific testing if that's what would be required to verify this patch. P.S., I also see IDs 0x0040, 0x0062, and 0x006a listed under the same `quirk_calpella_no_shadow_gtt()' quirk, but I'm not sure how similar these chipsets are (if they share the same issue with VT-d or even, indeed, if this issue is specific to a bug in the Lenovo BIOS). With regards to 0x0062, it seems to be a Centrino wireless card, but not a chipset? I have also listed a couple (distro and kernel) bug reports below as references (some of them are from 7-8 years ago!), as they seem to be similar issue found on different Westmere/Ironlake, Haswell, and Broadwell hardware setups. Cc: stable@vger.kernel.org Fixes: 6fbcfb3e467a ("intel-iommu: Workaround IOTLB hang on Ironlake GPU") Fixes: ba00196ca41c ("iommu/vt-d: Decouple igfx_off from graphic identity mapping") Link: https://groups.google.com/g/qubes-users/c/4NP4goUds2c?pli=1 Link: https://bugs.archlinux.org/task/65362 Link: https://bbs.archlinux.org/viewtopic.php?id=230323 Reported-by: Wenhao Sun <weiguangtwk@outlook.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=197029 Signed-off-by: Mingcong Bai <jeffbai@aosc.io> Link: https://lore.kernel.org/r/20250415133330.12528-1-jeffbai@aosc.io Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/vt-d: Revert ATS timing change to fix boot failureLu Baolu1-12/+19
Commit <5518f239aff1> ("iommu/vt-d: Move scalable mode ATS enablement to probe path") changed the PCI ATS enablement logic to run earlier, specifically before the default domain attachment. On some client platforms, this change resulted in boot failures, causing the kernel to panic with the following message and call trace: Kernel panic - not syncing: DMAR hardware is malfunctioning CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #175 Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 dump_stack+0x10/0x16 panic+0x10a/0x2b7 iommu_enable_translation.cold+0xc/0xc intel_iommu_init+0xe39/0xec0 ? trace_hardirqs_on+0x1e/0xd0 ? __pfx_pci_iommu_init+0x10/0x10 pci_iommu_init+0xd/0x40 do_one_initcall+0x5b/0x390 kernel_init_freeable+0x26d/0x2b0 ? __pfx_kernel_init+0x10/0x10 kernel_init+0x15/0x120 ret_from_fork+0x35/0x60 ? __pfx_kernel_init+0x10/0x10 ret_from_fork_asm+0x1a/0x30 RIP: 1f0f:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0000:0000000000000000 EFLAGS: 841f0f2e66 ORIG_RAX: 1f0f2e6600000000 RAX: 0000000000000000 RBX: 1f0f2e6600000000 RCX: 2e66000000000084 RDX: 0000000000841f0f RSI: 000000841f0f2e66 RDI: 00841f0f2e660000 RBP: 00841f0f2e660000 R08: 00841f0f2e660000 R09: 000000841f0f2e66 R10: 0000000000841f0f R11: 2e66000000000084 R12: 000000841f0f2e66 R13: 0000000000841f0f R14: 2e66000000000084 R15: 1f0f2e6600000000 </TASK> ---[ end Kernel panic - not syncing: DMAR hardware is malfunctioning ]--- Fix this by reverting the timing change for ATS enablement introduced by the offending commit and restoring the previous behavior. Fixes: 5518f239aff1 ("iommu/vt-d: Move scalable mode ATS enablement to probe path") Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Closes: https://lore.kernel.org/linux-iommu/01b9c72f-460d-4f77-b696-54c6825babc9@linux.intel.com/ Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250416073608.1799578-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/vtd: Remove iommu_alloc_pages_node()Jason Gunthorpe4-10/+11
Intel is the only thing that uses this now, convert to the size versions, trying to avoid PAGE_SHIFT. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/23-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove iommu_alloc_page_node()Jason Gunthorpe2-6/+10
Use iommu_alloc_pages_node_sz() instead. AMD and Intel are both using 4K pages for these structures since those drivers only work on 4K PAGE_SIZE. riscv is also spec'd to use SZ_4K. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/21-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu: Update various drivers to pass in lg2sz instead of order to iommu pagesJason Gunthorpe1-3/+3
Convert most of the places calling get_order() as an argument to the iommu-pages allocator into order_base_2() or the _sz flavour instead. These places already have an exact size, there is no particular reason to use order here. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/19-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu: Change iommu_iotlb_gather to use iommu_page_listJason Gunthorpe1-12/+12
This converts the remaining places using list of pages to the new API. The Intel free path was shared with its gather path, so it is converted at the same time. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/11-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove iommu_free_page()Jason Gunthorpe3-10/+10
Use iommu_free_pages() instead. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/6-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove the order argument to iommu_free_pages()Jason Gunthorpe4-7/+5
Now that we have a folio under the allocation iommu_free_pages() can know the order of the original allocation and do the correct thing to free it. The next patch will rename iommu_free_page() to iommu_free_pages() so we have naming consistency with iommu_alloc_pages_node(). Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-11iommu/vt-d: Remove an unnecessary call set_dma_ops()Petr Tesarik1-1/+0
Do not touch per-device DMA ops when the driver has been converted to use the dma-iommu API. Fixes: c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops") Signed-off-by: Petr Tesarik <ptesarik@suse.com> Link: https://lore.kernel.org/r/20250403165605.278541-1-ptesarik@suse.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-11iommu/vt-d: Wire up irq_ack() to irq_move_irq() for posted MSIsSean Christopherson1-14/+15
Set the posted MSI irq_chip's irq_ack() hook to irq_move_irq() instead of a dummy/empty callback so that posted MSIs process pending changes to the IRQ's SMP affinity. Failure to honor a pending set-affinity results in userspace being unable to change the effective affinity of the IRQ, as IRQD_SETAFFINITY_PENDING is never cleared and so irq_set_affinity_locked() always defers moving the IRQ. The issue is most easily reproducible by setting /proc/irq/xx/smp_affinity multiple times in quick succession, as only the first update is likely to be handled in process context. Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs") Cc: Robert Lippert <rlippert@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Wentao Yang <wentaoyang@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250321194249.1217961-1-seanjc@google.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-02Merge tag 'for-linus-iommufd' of ↵Linus Torvalds2-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "Two significant new items: - Allow reporting IOMMU HW events to userspace when the events are clearly linked to a device. This is linked to the VIOMMU object and is intended to be used by a VMM to forward HW events to the virtual machine as part of emulating a vIOMMU. ARM SMMUv3 is the first driver to use this mechanism. Like the existing fault events the data is delivered through a simple FD returning event records on read(). - PASID support in VFIO. The "Process Address Space ID" is a PCI feature that allows the device to tag all PCI DMA operations with an ID. The IOMMU will then use the ID to select a unique translation for those DMAs. This is part of Intel's vIOMMU support as VT-D HW requires the hypervisor to manage each PASID entry. The support is generic so any VFIO user could attach any translation to a PASID, and the support should work on ARM SMMUv3 as well. AMD requires additional driver work. Some minor updates, along with fixes: - Prevent using nested parents with fault's, no driver support today - Put a single "cookie_type" value in the iommu_domain to indicate what owns the various opaque owner fields" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits) iommufd: Test attach before detaching pasid iommufd: Fix iommu_vevent_header tables markup iommu: Convert unreachable() to BUG() iommufd: Balance veventq->num_events inc/dec iommufd: Initialize the flags of vevent in iommufd_viommu_report_event() iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices ida: Add ida_find_first_range() iommufd/selftest: Add coverage for iommufd pasid attach/detach iommufd/selftest: Add test ops to test pasid attach/detach iommufd/selftest: Add a helper to get test device iommufd/selftest: Add set_dev_pasid in mock iommu iommufd: Allow allocating PASID-compatible domain iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support iommufd: Enforce PASID-compatible domain for RID iommufd: Support pasid attach/replace iommufd: Enforce PASID-compatible domain in PASID path iommufd/device: Add pasid_attach array to track per-PASID attach ...
2025-03-25iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID supportYi Liu2-2/+3
Intel iommu driver just treats it as a nop since Intel VT-d does not have special requirement on domains attached to either the PASID or RID of a PASID-capable device. Link: https://patch.msgid.link/r/20250321171940.7213-14-yi.l.liu@intel.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-20Merge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', ↵Joerg Roedel6-225/+172
'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next
2025-03-20iommu/vt-d: Fix possible circular locking dependencyLu Baolu1-0/+2
We have recently seen report of lockdep circular lock dependency warnings on platforms like Skylake and Kabylake: ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc6-CI_DRM_16276-gca2c04fe76e8+ #1 Not tainted ------------------------------------------------------ swapper/0/1 is trying to acquire lock: ffffffff8360ee48 (iommu_probe_device_lock){+.+.}-{3:3}, at: iommu_probe_device+0x1d/0x70 but task is already holding lock: ffff888102c7efa8 (&device->physical_node_lock){+.+.}-{3:3}, at: intel_iommu_init+0xe75/0x11f0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (&device->physical_node_lock){+.+.}-{3:3}: __mutex_lock+0xb4/0xe40 mutex_lock_nested+0x1b/0x30 intel_iommu_init+0xe75/0x11f0 pci_iommu_init+0x13/0x70 do_one_initcall+0x62/0x3f0 kernel_init_freeable+0x3da/0x6a0 kernel_init+0x1b/0x200 ret_from_fork+0x44/0x70 ret_from_fork_asm+0x1a/0x30 -> #5 (dmar_global_lock){++++}-{3:3}: down_read+0x43/0x1d0 enable_drhd_fault_handling+0x21/0x110 cpuhp_invoke_callback+0x4c6/0x870 cpuhp_issue_call+0xbf/0x1f0 __cpuhp_setup_state_cpuslocked+0x111/0x320 __cpuhp_setup_state+0xb0/0x220 irq_remap_enable_fault_handling+0x3f/0xa0 apic_intr_mode_init+0x5c/0x110 x86_late_time_init+0x24/0x40 start_kernel+0x895/0xbd0 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0xbf/0x110 common_startup_64+0x13e/0x141 -> #4 (cpuhp_state_mutex){+.+.}-{3:3}: __mutex_lock+0xb4/0xe40 mutex_lock_nested+0x1b/0x30 __cpuhp_setup_state_cpuslocked+0x67/0x320 __cpuhp_setup_state+0xb0/0x220 page_alloc_init_cpuhp+0x2d/0x60 mm_core_init+0x18/0x2c0 start_kernel+0x576/0xbd0 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0xbf/0x110 common_startup_64+0x13e/0x141 -> #3 (cpu_hotplug_lock){++++}-{0:0}: __cpuhp_state_add_instance+0x4f/0x220 iova_domain_init_rcaches+0x214/0x280 iommu_setup_dma_ops+0x1a4/0x710 iommu_device_register+0x17d/0x260 intel_iommu_init+0xda4/0x11f0 pci_iommu_init+0x13/0x70 do_one_initcall+0x62/0x3f0 kernel_init_freeable+0x3da/0x6a0 kernel_init+0x1b/0x200 ret_from_fork+0x44/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&domain->iova_cookie->mutex){+.+.}-{3:3}: __mutex_lock+0xb4/0xe40 mutex_lock_nested+0x1b/0x30 iommu_setup_dma_ops+0x16b/0x710 iommu_device_register+0x17d/0x260 intel_iommu_init+0xda4/0x11f0 pci_iommu_init+0x13/0x70 do_one_initcall+0x62/0x3f0 kernel_init_freeable+0x3da/0x6a0 kernel_init+0x1b/0x200 ret_from_fork+0x44/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&group->mutex){+.+.}-{3:3}: __mutex_lock+0xb4/0xe40 mutex_lock_nested+0x1b/0x30 __iommu_probe_device+0x24c/0x4e0 probe_iommu_group+0x2b/0x50 bus_for_each_dev+0x7d/0xe0 iommu_device_register+0xe1/0x260 intel_iommu_init+0xda4/0x11f0 pci_iommu_init+0x13/0x70 do_one_initcall+0x62/0x3f0 kernel_init_freeable+0x3da/0x6a0 kernel_init+0x1b/0x200 ret_from_fork+0x44/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 (iommu_probe_device_lock){+.+.}-{3:3}: __lock_acquire+0x1637/0x2810 lock_acquire+0xc9/0x300 __mutex_lock+0xb4/0xe40 mutex_lock_nested+0x1b/0x30 iommu_probe_device+0x1d/0x70 intel_iommu_init+0xe90/0x11f0 pci_iommu_init+0x13/0x70 do_one_initcall+0x62/0x3f0 kernel_init_freeable+0x3da/0x6a0 kernel_init+0x1b/0x200 ret_from_fork+0x44/0x70 ret_from_fork_asm+0x1a/0x30 other info that might help us debug this: Chain exists of: iommu_probe_device_lock --> dmar_global_lock --> &device->physical_node_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&device->physical_node_lock); lock(dmar_global_lock); lock(&device->physical_node_lock); lock(iommu_probe_device_lock); *** DEADLOCK *** This driver uses a global lock to protect the list of enumerated DMA remapping units. It is necessary due to the driver's support for dynamic addition and removal of remapping units at runtime. Two distinct code paths require iteration over this remapping unit list: - Device registration and probing: the driver iterates the list to register each remapping unit with the upper layer IOMMU framework and subsequently probe the devices managed by that unit. - Global configuration: Upper layer components may also iterate the list to apply configuration changes. The lock acquisition order between these two code paths was reversed. This caused lockdep warnings, indicating a risk of deadlock. Fix this warning by releasing the global lock before invoking upper layer interfaces for device registration. Fixes: b150654f74bf ("iommu/vt-d: Fix suspicious RCU usage") Closes: https://lore.kernel.org/linux-iommu/SJ1PR11MB612953431F94F18C954C4A9CB9D32@SJ1PR11MB6129.namprd11.prod.outlook.com/ Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250317035714.1041549-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-20iommu/vt-d: Don't clobber posted vCPU IRTE when host IRQ affinity changesSean Christopherson1-10/+15
Don't overwrite an IRTE that is posting IRQs to a vCPU with a posted MSI entry if the host IRQ affinity happens to change. If/when the IRTE is reverted back to "host mode", it will be reconfigured as a posted MSI or remapped entry as appropriate. Drop the "mode" field, which doesn't differentiate between posted MSIs and posted vCPUs, in favor of a dedicated posted_vcpu flag. Note! The two posted_{msi,vcpu} flags are intentionally not mutually exclusive; an IRTE can transition between posted MSI and posted vCPU. Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs") Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250315025135.2365846-3-seanjc@google.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-20iommu/vt-d: Put IRTE back into posted MSI mode if vCPU posting is disabledSean Christopherson1-6/+13
Add a helper to take care of reconfiguring an IRTE to deliver IRQs to the host, i.e. not to a vCPU, and use the helper when an IRTE's vCPU affinity is nullified, i.e. when KVM puts an IRTE back into "host" mode. Because posted MSIs use an ephemeral IRTE, using modify_irte() puts the IRTE into full remapped mode, i.e. unintentionally disables posted MSIs on the IRQ. Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs") Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250315025135.2365846-2-seanjc@google.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Cleanup intel_context_flush_present()Lu Baolu3-38/+10
The intel_context_flush_present() is called in places where either the scalable mode is disabled, or scalable mode is enabled but all PASID entries are known to be non-present. In these cases, the flush_domains path within intel_context_flush_present() will never execute. This dead code is therefore removed. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250228092631.3425464-7-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Move PRI enablement in probe pathLu Baolu4-90/+55
Update PRI enablement to use the new method, similar to the amd iommu driver. Enable PRI in the device probe path and disable it when the device is released. PRI is enabled throughout the device's iommu lifecycle. The infrastructure for the iommu subsystem to handle iopf requests is created during iopf enablement and released during iopf disablement. All invalid page requests from the device are automatically handled by the iommu subsystem if iopf is not enabled. Add iopf_refcount to track the iopf enablement. Convert the return type of intel_iommu_disable_iopf() to void, as there is no way to handle a failure when disabling this feature. Make intel_iommu_enable/disable_iopf() helpers global, as they will be used beyond the current file in the subsequent patch. The iopf_refcount is not protected by any lock. This is acceptable, as there is no concurrent access to it in the current code. The following patch will address this by moving it to the domain attach/detach paths, which are protected by the iommu group mutex. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250228092631.3425464-6-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Move scalable mode ATS enablement to probe pathLu Baolu1-24/+27
Device ATS is currently enabled when a domain is attached to the device and disabled when the domain is detached. This creates a limitation: when the IOMMU is operating in scalable mode and IOPF is enabled, the device's domain cannot be changed. The previous code enables ATS when a domain is set to a device's RID and disables it during RID domain switch. So, if a PASID is set with a domain requiring PRI, ATS should remain enabled until the domain is removed. During the PASID domain's lifecycle, if the RID's domain changes, PRI will be disrupted because it depends on ATS, which is disabled when the blocking domain is set for the device's RID. Remove this limitation by moving ATS enablement to the device probe path. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250228092631.3425464-5-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Check if SVA is supported when attaching the SVA domainJason Gunthorpe2-36/+44
Attach of a SVA domain should fail if SVA is not supported, move the check for SVA support out of IOMMU_DEV_FEAT_SVA and into attach. Also check when allocating a SVA domain to match other drivers. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250228092631.3425464-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Use virt_to_phys()Jason Gunthorpe2-20/+2
If all the inlines are unwound virt_to_dma_pfn() is simply: return page_to_pfn(virt_to_page(p)) << (PAGE_SHIFT - VTD_PAGE_SHIFT); Which can be re-arranged to: (page_to_pfn(virt_to_page(p)) << PAGE_SHIFT) >> VTD_PAGE_SHIFT The only caller is: ((uint64_t)virt_to_dma_pfn(tmp_page) << VTD_PAGE_SHIFT) re-arranged to: ((page_to_pfn(virt_to_page(tmp_page)) << PAGE_SHIFT) >> VTD_PAGE_SHIFT) << VTD_PAGE_SHIFT Which simplifies to: page_to_pfn(virt_to_page(tmp_page)) << PAGE_SHIFT That is the same as virt_to_phys(tmp_page), so just remove all of this. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/8-v3-e797f4dc6918+93057-iommu_pages_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-03-10iommu/vt-d: Fix system hang on reboot -fYunhui Cui1-7/+10
We found that executing the command ./a.out &;reboot -f (where a.out is a program that only executes a while(1) infinite loop) can probabilistically cause the system to hang in the intel_iommu_shutdown() function, rendering it unresponsive. Through analysis, we identified that the factors contributing to this issue are as follows: 1. The reboot -f command does not prompt the kernel to notify the application layer to perform cleanup actions, allowing the application to continue running. 2. When the kernel reaches the intel_iommu_shutdown() function, only the BSP (Bootstrap Processor) CPU is operational in the system. 3. During the execution of intel_iommu_shutdown(), the function down_write (&dmar_global_lock) causes the process to sleep and be scheduled out. 4. At this point, though the processor's interrupt flag is not cleared, allowing interrupts to be accepted. However, only legacy devices and NMI (Non-Maskable Interrupt) interrupts could come in, as other interrupts routing have already been disabled. If no legacy or NMI interrupts occur at this stage, the scheduler will not be able to run. 5. If the application got scheduled at this time is executing a while(1)- type loop, it will be unable to be preempted, leading to an infinite loop and causing the system to become unresponsive. To resolve this issue, the intel_iommu_shutdown() function should not execute down_write(), which can potentially cause the process to be scheduled out. Furthermore, since only the BSP is running during the later stages of the reboot, there is no need for protection against parallel access to the DMAR (DMA Remapping) unit. Therefore, the following lines could be removed: down_write(&dmar_global_lock); up_write(&dmar_global_lock); After testing, the issue has been resolved. Fixes: 6c3a44ed3c55 ("iommu/vt-d: Turn off translations at shutdown") Co-developed-by: Ethan Zhao <haifeng.zhao@linux.intel.com> Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com> Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com> Link: https://lore.kernel.org/r/20250303062421.17929-1-cuiyunhui@bytedance.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-02-28iommu/vt-d: Fix suspicious RCU usageLu Baolu2-0/+8
Commit <d74169ceb0d2> ("iommu/vt-d: Allocate DMAR fault interrupts locally") moved the call to enable_drhd_fault_handling() to a code path that does not hold any lock while traversing the drhd list. Fix it by ensuring the dmar_global_lock lock is held when traversing the drhd list. Without this fix, the following warning is triggered: ============================= WARNING: suspicious RCU usage 6.14.0-rc3 #55 Not tainted ----------------------------- drivers/iommu/intel/dmar.c:2046 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 2 locks held by cpuhp/1/23: #0: ffffffff84a67c50 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0 #1: ffffffff84a6a380 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0 stack backtrace: CPU: 1 UID: 0 PID: 23 Comm: cpuhp/1 Not tainted 6.14.0-rc3 #55 Call Trace: <TASK> dump_stack_lvl+0xb7/0xd0 lockdep_rcu_suspicious+0x159/0x1f0 ? __pfx_enable_drhd_fault_handling+0x10/0x10 enable_drhd_fault_handling+0x151/0x180 cpuhp_invoke_callback+0x1df/0x990 cpuhp_thread_fun+0x1ea/0x2c0 smpboot_thread_fn+0x1f5/0x2e0 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0x12a/0x2d0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x4a/0x60 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Holding the lock in enable_drhd_fault_handling() triggers a lockdep splat about a possible deadlock between dmar_global_lock and cpu_hotplug_lock. This is avoided by not holding dmar_global_lock when calling iommu_device_register(), which initiates the device probe process. Fixes: d74169ceb0d2 ("iommu/vt-d: Allocate DMAR fault interrupts locally") Reported-and-tested-by: Ido Schimmel <idosch@nvidia.com> Closes: https://lore.kernel.org/linux-iommu/Zx9OwdLIc_VoQ0-a@shredder.mtl.com/ Tested-by: Breno Leitao <leitao@debian.org> Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250218022422.2315082-1-baolu.lu@linux.intel.com Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-02-28iommu/vt-d: Remove device comparison in context_setup_pass_through_cbJerry Snitselaar1-3/+0
Remove the device comparison check in context_setup_pass_through_cb. pci_for_each_dma_alias already makes a decision on whether the callback function should be called for a device. With the check in place it will fail to create context entries for aliases as it walks up to the root bus. Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain") Closes: https://lore.kernel.org/linux-iommu/82499eb6-00b7-4f83-879a-e97b4144f576@linux.intel.com/ Cc: stable@vger.kernel.org Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com> Link: https://lore.kernel.org/r/20250224180316.140123-1-jsnitsel@redhat.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-02-14iommu/vt-d: Make intel_iommu_drain_pasid_prq() cover faults for RIDLu Baolu1-1/+3
This driver supports page faults on PCI RID since commit <9f831c16c69e> ("iommu/vt-d: Remove the pasid present check in prq_event_thread") by allowing the reporting of page faults with the pasid_present field cleared to the upper layer for further handling. The fundamental assumption here is that the detach or replace operations act as a fence for page faults. This implies that all pending page faults associated with a specific RID or PASID are flushed when a domain is detached or replaced from a device RID or PASID. However, the intel_iommu_drain_pasid_prq() helper does not correctly handle faults for RID. This leads to faults potentially remaining pending in the iommu hardware queue even after the domain is detached, thereby violating the aforementioned assumption. Fix this issue by extending intel_iommu_drain_pasid_prq() to cover faults for RID. Fixes: 9f831c16c69e ("iommu/vt-d: Remove the pasid present check in prq_event_thread") Cc: stable@vger.kernel.org Suggested-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250121023150.815972-1-baolu.lu@linux.intel.com Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20250211005512.985563-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-01-24Merge tag 'for-linus-iommufd' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "No major functionality this cycle: - iommufd part of the domain_alloc_paging_flags() conversion - Move IOMMU_HWPT_FAULT_ID_VALID processing out of drivers - Increase a timeout waiting for other threads to drop transient refcounts that syzkaller was hitting - Fix a UBSAN hit in iova_bitmap due to shift out of bounds - Add missing cleanup of fault events during FD shutdown, fixing a memory leak - Improve the fault delivery flow to have a smaller locking critical region that does not include copy_to_user() - Fix 32 bit ABI breakage due to missed implicit padding, and fix the stack memory leakage" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: iommufd: Fix struct iommu_hwpt_pgfault init and padding iommufd/fault: Use a separate spinlock to protect fault->deliver list iommufd/fault: Destroy response and mutex in iommufd_fault_destroy() iommufd: Keep OBJ/IOCTL lists in an alphabetical order iommufd/iova_bitmap: Fix shift-out-of-bounds in iova_bitmap_offset_to_index() iommu: iommufd: fix WARNING in iommufd_device_unbind iommufd: Deal with IOMMU_HWPT_FAULT_ID_VALID in iommufd core iommufd/selftest: Remove domain_alloc_paging()
2025-01-24Merge tag 'iommu-updates-v6.14' of ↵Linus Torvalds8-391/+53
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu updates from Joerg Roedel: "Core changes: - PASID support for the blocked_domain ARM-SMMU Updates: - SMMUv2: - Implement per-client prefetcher configuration on Qualcomm SoCs - Support for the Adreno SMMU on Qualcomm's SDM670 SOC - SMMUv3: - Pretty-printing of event records - Drop the ->domain_alloc_paging implementation in favour of domain_alloc_paging_flags(flags==0) - IO-PGTable: - Generalisation of the page-table walker to enable external walkers (e.g. for debugging unexpected page-faults from the GPU) - Minor fix for handling concatenated PGDs at stage-2 with 16KiB pages - Misc: - Clean-up device probing and replace the crufty probe-deferral hack with a more robust implementation of arm_smmu_get_by_fwnode() - Device-tree binding updates for a bunch of Qualcomm platforms Intel VT-d Updates: - Remove domain_alloc_paging() - Remove capability audit code - Draining PRQ in sva unbind path when FPD bit set - Link cache tags of same iommu unit together AMD-Vi Updates: - Use CMPXCHG128 to update DTE - Cleanups of the domain_alloc_paging() path RiscV IOMMU: - Platform MSI support - Shutdown support Rockchip IOMMU: - Add DT bindings for Rockchip RK3576 More smaller fixes and cleanups" * tag 'iommu-updates-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (66 commits) iommu: Use str_enable_disable-like helpers iommu/amd: Fully decode all combinations of alloc_paging_flags iommu/amd: Move the nid to pdom_setup_pgtable() iommu/amd: Change amd_iommu_pgtable to use enum protection_domain_mode iommu/amd: Remove type argument from do_iommu_domain_alloc() and related iommu/amd: Remove dev == NULL checks iommu/amd: Remove domain_alloc() iommu/amd: Remove unused amd_iommu_domain_update() iommu/riscv: Fixup compile warning iommu/arm-smmu-v3: Add missing #include of linux/string_choices.h iommu/arm-smmu-v3: Use str_read_write helper w/ logs iommu/io-pgtable-arm: Add way to debug pgtable walk iommu/io-pgtable-arm: Re-use the pgtable walk for iova_to_phys iommu/io-pgtable-arm: Make pgtable walker more generic iommu/arm-smmu: Add ACTLR data and support for qcom_smmu_500 iommu/arm-smmu: Introduce ACTLR custom prefetcher settings iommu/arm-smmu: Add support for PRR bit setup iommu/arm-smmu: Refactor qcom_smmu structure to include single pointer iommu/arm-smmu: Re-enable context caching in smmu reset operation iommu/vt-d: Link cache tags of same iommu unit together ...
2025-01-22Merge tag 'irq-core-2025-01-21' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: - Consolidate the machine_kexec_mask_interrupts() by providing a generic implementation and replacing the copy & pasta orgy in the relevant architectures. - Prevent unconditional operations on interrupt chips during kexec shutdown, which can trigger warnings in certain cases when the underlying interrupt has been shut down before. - Make the enforcement of interrupt handling in interrupt context unconditionally available, so that it actually works for non x86 related interrupt chips. The earlier enablement for ARM GIC chips set the required chip flag, but did not notice that the check was hidden behind a config switch which is not selected by ARM[64]. - Decrapify the handling of deferred interrupt affinity setting. Some interrupt chips require that affinity changes are made from the context of handling an interrupt to avoid certain race conditions. For x86 this was the default, but with interrupt remapping this requirement was lifted and a flag was introduced which tells the core code that affinity changes can be done in any context. Unrestricted affinity changes are the default for the majority of interrupt chips. RISCV has the requirement to add the deferred mode to one of it's interrupt controllers, but with the original implementation this would require to add the any context flag to all other RISC-V interrupt chips. That's backwards, so reverse the logic and require that chips, which need the deferred mode have to be marked accordingly. That avoids chasing the 'sane' chips and marking them. - Add multi-node support to the Loongarch AVEC interrupt controller driver. - The usual tiny cleanups, fixes and improvements all over the place. * tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set() genirq/timings: Add kernel-doc for a function parameter genirq: Remove IRQ_MOVE_PCNTXT and related code x86/apic: Convert to IRQCHIP_MOVE_DEFERRED genirq: Provide IRQCHIP_MOVE_DEFERRED hexagon: Remove GENERIC_PENDING_IRQ leftover ARC: Remove GENERIC_PENDING_IRQ genirq: Remove handle_enforce_irqctx() wrapper genirq: Make handle_enforce_irqctx() unconditionally available irqchip/loongarch-avec: Add multi-nodes topology support irqchip/ts4800: Replace seq_printf() by seq_puts() irqchip/ti-sci-inta : Add module build support irqchip/ti-sci-intr: Add module build support irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown kexec: Consolidate machine_kexec_mask_interrupts() implementation genirq: Reuse irq_thread_fn() for forced thread case genirq: Move irq_thread_fn() further up in the code