kernel/linux.git/drivers/iommu, branch v6.18.21

iommu/sva: Fix crash in iommu_sva_unbind_device()

2026-03-25T10:10:45+00:00

[ Upstream commit 06e14c36e20b48171df13d51b89fe67c594ed07a ] domain->mm->iommu_mm can be freed by iommu_domain_free(): iommu_domain_free() mmdrop() __mmdrop() mm_pasid_drop() After iommu_domain_free() returns, accessing domain->mm->iommu_mm may dereference a freed mm structure, leading to a crash. Fix this by moving the code that accesses domain->mm->iommu_mm to before the call to iommu_domain_free(). Fixes: e37d5a2d60a3 ("iommu/sva: invalidate stale IOTLB entries for kernel address space") Signed-off-by: Lizhi Hou Reviewed-by: Jason Gunthorpe Reviewed-by: Yi Liu Reviewed-by: Vasant Hegde Reviewed-by: Lu Baolu Signed-off-by: Joerg Roedel Signed-off-by: Sasha Levin

iommu/vt-d: Only handle IOPF for SVA when PRI is supported

2026-03-25T10:10:34+00:00

commit 39c20c4e83b9f78988541d829aa34668904e54a0 upstream. In intel_svm_set_dev_pasid(), the driver unconditionally manages the IOPF handling during a domain transition. However, commit a86fb7717320 ("iommu/vt-d: Allow SVA with device-specific IOPF") introduced support for SVA on devices that handle page faults internally without utilizing the PCI PRI. On such devices, the IOMMU-side IOPF infrastructure is not required. Calling iopf_for_domain_replace() on these devices is incorrect and can lead to unexpected failures during PASID attachment or unwinding. Add a check for info->pri_supported to ensure that the IOPF queue logic is only invoked for devices that actually rely on the IOMMU's PRI-based fault handling. Fixes: 17fce9d2336d ("iommu/vt-d: Put iopf enablement in domain attach path") Cc: stable@vger.kernel.org Suggested-by: Kevin Tian Reviewed-by: Kevin Tian Signed-off-by: Lu Baolu Link: https://lore.kernel.org/r/20260310075520.295104-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel Signed-off-by: Greg Kroah-Hartman

iommu/vt-d: Fix intel iommu iotlb sync hardlockup and retry

2026-03-25T10:10:34+00:00

commit fe89277c9ceb0d6af0aa665bcf24a41d8b1b79cd upstream. During the qi_check_fault process after an IOMMU ITE event, requests at odd-numbered positions in the queue are set to QI_ABORT, only satisfying single-request submissions. However, qi_submit_sync now supports multiple simultaneous submissions, and can't guarantee that the wait_desc will be at an odd-numbered position. Therefore, if an item times out, IOMMU can't re-initiate the request, resulting in an infinite polling wait. This modifies the process by setting the status of all requests already fetched by IOMMU and recorded as QI_IN_USE status (including wait_desc requests) to QI_ABORT, thus enabling multiple requests to be resubmitted. Fixes: 8a1d82462540 ("iommu/vt-d: Multiple descriptors per qi_submit_sync()") Cc: stable@vger.kernel.org Signed-off-by: Guanghui Feng Tested-by: Shuai Xue Reviewed-by: Shuai Xue Reviewed-by: Samiullah Khawaja Link: https://lore.kernel.org/r/20260306101516.3885775-1-guanghuifeng@linux.alibaba.com Signed-off-by: Lu Baolu Fixes: 8a1d82462540 ("iommu/vt-d: Multiple descriptors per qi_submit_sync()") Signed-off-by: Joerg Roedel Signed-off-by: Greg Kroah-Hartman

iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode

2026-03-12T11:09:29+00:00

[ Upstream commit 42662d19839f34735b718129ea200e3734b07e50 ] PCIe endpoints with ATS enabled and passed through to userspace (e.g., QEMU, DPDK) can hard-lock the host when their link drops, either by surprise removal or by a link fault. Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") adds pci_dev_is_disconnected() to devtlb_invalidation_with_pasid() so ATS invalidation is skipped only when the device is being safely removed, but it applies only when Intel IOMMU scalable mode is enabled. With scalable mode disabled or unsupported, a system hard-lock occurs when a PCIe endpoint's link drops because the Intel IOMMU waits indefinitely for an ATS invalidation that cannot complete. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Commit 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") adds intel_pasid_teardown_sm_context() to intel_iommu_release_device(), which calls qi_flush_dev_iotlb() and can also hard-lock the system when a PCIe endpoint's link drops. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 intel_context_flush_no_pasid device_pasid_table_teardown pci_pasid_table_teardown pci_for_each_dma_alias intel_pasid_teardown_sm_context intel_iommu_release_device iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Sometimes the endpoint loses connection without a link-down event (e.g., due to a link fault); killing the process (virsh destroy) then hard-locks the host. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput pci_dev_is_disconnected() only covers safe-removal paths; pci_device_is_present() tests accessibility by reading vendor/device IDs and internally calls pci_dev_is_disconnected(). On a ConnectX-5 (8 GT/s, x2) this costs ~70 µs. Since __context_flush_dev_iotlb() is only called on {attach,release}_dev paths (not hot), add pci_device_is_present() there to skip inaccessible devices and avoid the hard-lock. Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo Link: https://lore.kernel.org/r/20251211035946.2071-2-guojinhui.liam@bytedance.com Signed-off-by: Lu Baolu Signed-off-by: Joerg Roedel Signed-off-by: Sasha Levin

iommu/arm-smmu-v3: Do not set disable_ats unless vSTE is Translate

2026-03-04T12:21:14+00:00

[ Upstream commit a45dd34663025c75652b27e384e91c9c05ba1d80 ] A vSTE may have three configuration types: Abort, Bypass, and Translate. An Abort vSTE wouldn't enable ATS, but the other two might. It makes sense for a Transalte vSTE to rely on the guest vSTE.EATS field. For a Bypass vSTE, it would end up with an S2-only physical STE, similar to an attachment to a regular S2 domain. However, the nested case always disables ATS following the Bypass vSTE, while the regular S2 case always enables ATS so long as arm_smmu_ats_supported(master) == true. Note that ATS is needed for certain VM centric workloads and historically non-vSMMU cases have relied on this automatic enablement. So, having the nested case behave differently causes problems. To fix that, add a condition to disable_ats, so that it might enable ATS for a Bypass vSTE, aligning with the regular S2 case. Fixes: f27298a82ba0 ("iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED") Cc: stable@vger.kernel.org Suggested-by: Jason Gunthorpe Signed-off-by: Nicolin Chen Reviewed-by: Pranjal Shrivastava Reviewed-by: Jason Gunthorpe Signed-off-by: Will Deacon Signed-off-by: Sasha Levin

iommu/arm-smmu-v3: Mark EATS_TRANS safe when computing the update sequence

2026-03-04T12:21:14+00:00

[ Upstream commit 7cad800485956a263318930613f8f4a084af8c70 ] If VM wants to toggle EATS_TRANS off at the same time as changing the CFG, hypervisor will see EATS change to 0 and insert a V=0 breaking update into the STE even though the VM did not ask for that. In bare metal, EATS_TRANS is ignored by CFG=ABORT/BYPASS, which is why this does not cause a problem until we have the nested case where CFG is always a variation of S2 trans that does use EATS_TRANS. Relax the rules for EATS_TRANS sequencing, we don't need it to be exact as the enclosing code will always disable ATS at the PCI device when changing EATS_TRANS. This ensures there are no ATS transactions that can race with an EATS_TRANS change so we don't need to carefully sequence these bits. Fixes: 1e8be08d1c91 ("iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED") Cc: stable@vger.kernel.org Signed-off-by: Jason Gunthorpe Reviewed-by: Shuai Xue Signed-off-by: Nicolin Chen Signed-off-by: Will Deacon Signed-off-by: Sasha Levin

iommu/arm-smmu-v3: Mark STE MEV safe when computing the update sequence

2026-03-04T12:21:14+00:00

[ Upstream commit f3c1d372dbb8e5a86923f20db66deabef42bfc9d ] Nested CD tables set the MEV bit to try to reduce multi-fault spamming on the hypervisor. Since MEV is in STE word 1 this causes a breaking update sequence that is not required and impacts real workloads. For the purposes of STE updates the value of MEV doesn't matter, if it is set/cleared early or late it just results in a change to the fault reports that must be supported by the kernel anyhow. The spec says: Note: Software must expect, and be able to deal with, coalesced fault records even when MEV == 0. So mark STE MEV safe when computing the update sequence, to avoid creating a breaking update. Fixes: da0c56520e88 ("iommu/arm-smmu-v3: Set MEV bit in nested STE for DoS mitigations") Cc: stable@vger.kernel.org Signed-off-by: Jason Gunthorpe Reviewed-by: Shuai Xue Reviewed-by: Mostafa Saleh Reviewed-by: Pranjal Shrivastava Signed-off-by: Nicolin Chen Signed-off-by: Will Deacon Signed-off-by: Sasha Levin

iommu/arm-smmu-v3: Add update_safe bits to fix STE update sequence

2026-03-04T12:21:14+00:00

[ Upstream commit 2781f2a930abb5d27f80b8afbabfa19684833b65 ] C_BAD_STE was observed when updating nested STE from an S1-bypass mode to an S1DSS-bypass mode. As both modes enabled S2, the used bit is slightly different than the normal S1-bypass and S1DSS-bypass modes. As a result, fields like MEV and EATS in S2's used list marked the word1 as a critical word that requested a STE.V=0. This breaks a hitless update. However, both MEV and EATS aren't critical in terms of STE update. One controls the merge of the events and the other controls the ATS that is managed by the driver at the same time via pci_enable_ats(). Add an arm_smmu_get_ste_update_safe() to allow STE update algorithm to relax those fields, avoiding the STE update breakages. After this change, entry_set has no caller checking its return value, so change it to void. Note that this change is required by both MEV and EATS fields, which were introduced in different kernel versions. So add get_update_safe() first. MEV and EATS will be added to arm_smmu_get_ste_update_safe() separately. Fixes: 1e8be08d1c91 ("iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED") Cc: stable@vger.kernel.org Signed-off-by: Jason Gunthorpe Reviewed-by: Shuai Xue Reviewed-by: Mostafa Saleh Reviewed-by: Pranjal Shrivastava Signed-off-by: Nicolin Chen Signed-off-by: Will Deacon Signed-off-by: Sasha Levin

iommu/vt-d: Flush piotlb for SVM and Nested domain

2026-03-04T12:21:12+00:00

[ Upstream commit 04b1b069f151e793767755f58b51670bff00cbc1 ] Besides the paging domains that use FS, SVM and Nested domains need to use piotlb invalidation descriptor as well. Fixes: b33125296b50 ("iommu/vt-d: Create unique domain ops for each stage") Cc: stable@vger.kernel.org Signed-off-by: Yi Liu Reviewed-by: Kevin Tian Link: https://lore.kernel.org/r/20251223065824.6164-1-yi.l.liu@intel.com Signed-off-by: Lu Baolu Signed-off-by: Joerg Roedel Signed-off-by: Sasha Levin

iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

2026-03-04T12:21:12+00:00

[ Upstream commit 10e60d87813989e20eac1f3eda30b3bae461e7f9 ] Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system. For example, if a VM fails to connect to the PCIe device, "virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd. Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase. Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap. 1. mm-struct release 2. {attach,release}_dev 3. set/remove PASID 4. dirty-tracking setup The gain in system stability far outweighs the negligible cost of using pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation, especially under GDR high-load conditions. Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo Link: https://lore.kernel.org/r/20251211035946.2071-3-guojinhui.liam@bytedance.com Signed-off-by: Lu Baolu Signed-off-by: Joerg Roedel Signed-off-by: Sasha Levin