summaryrefslogtreecommitdiff
path: root/drivers/iommu
AgeCommit message (Collapse)AuthorFilesLines
2026-03-13iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without ↵Jinhui Guo1-0/+8
scalable mode [ Upstream commit 42662d19839f34735b718129ea200e3734b07e50 ] PCIe endpoints with ATS enabled and passed through to userspace (e.g., QEMU, DPDK) can hard-lock the host when their link drops, either by surprise removal or by a link fault. Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") adds pci_dev_is_disconnected() to devtlb_invalidation_with_pasid() so ATS invalidation is skipped only when the device is being safely removed, but it applies only when Intel IOMMU scalable mode is enabled. With scalable mode disabled or unsupported, a system hard-lock occurs when a PCIe endpoint's link drops because the Intel IOMMU waits indefinitely for an ATS invalidation that cannot complete. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Commit 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") adds intel_pasid_teardown_sm_context() to intel_iommu_release_device(), which calls qi_flush_dev_iotlb() and can also hard-lock the system when a PCIe endpoint's link drops. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 intel_context_flush_no_pasid device_pasid_table_teardown pci_pasid_table_teardown pci_for_each_dma_alias intel_pasid_teardown_sm_context intel_iommu_release_device iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist Sometimes the endpoint loses connection without a link-down event (e.g., due to a link fault); killing the process (virsh destroy) then hard-locks the host. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput pci_dev_is_disconnected() only covers safe-removal paths; pci_device_is_present() tests accessibility by reading vendor/device IDs and internally calls pci_dev_is_disconnected(). On a ConnectX-5 (8 GT/s, x2) this costs ~70 µs. Since __context_flush_dev_iotlb() is only called on {attach,release}_dev paths (not hot), add pci_device_is_present() there to skip inaccessible devices and avoid the hard-lock. Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> Link: https://lore.kernel.org/r/20251211035946.2071-2-guojinhui.liam@bytedance.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable modeJinhui Guo1-1/+1
[ Upstream commit 10e60d87813989e20eac1f3eda30b3bae461e7f9 ] Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system. For example, if a VM fails to connect to the PCIe device, "virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd. Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase. Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap. 1. mm-struct release 2. {attach,release}_dev 3. set/remove PASID 4. dirty-tracking setup The gain in system stability far outweighs the negligible cost of using pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation, especially under GDR high-load conditions. Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> Link: https://lore.kernel.org/r/20251211035946.2071-3-guojinhui.liam@bytedance.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04iommu/amd: move wait_on_sem() out of spinlockAnkit Soni1-8/+17
[ Upstream commit d2a0cac10597068567d336e85fa3cbdbe8ca62bf ] With iommu.strict=1, the existing completion wait path can cause soft lockups under stressed environment, as wait_on_sem() busy-waits under the spinlock with interrupts disabled. Move the completion wait in iommu_completion_wait() out of the spinlock. wait_on_sem() only polls the hardware-updated cmd_sem and does not require iommu->lock, so holding the lock during the busy wait unnecessarily increases contention and extends the time with interrupts disabled. Signed-off-by: Ankit Soni <Ankit.Soni@amd.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiencyAlexander Grest1-10/+21
[ Upstream commit df180b1a4cc51011c5f8c52c7ec02ad2e42962de ] The SMMU CMDQ lock is highly contentious when there are multiple CPUs issuing commands and the queue is nearly full. The lock has the following states: - 0: Unlocked - >0: Shared lock held with count - INT_MIN+N: Exclusive lock held, where N is the # of shared waiters - INT_MIN: Exclusive lock held, no shared waiters When multiple CPUs are polling for space in the queue, they attempt to grab the exclusive lock to update the cons pointer from the hardware. If they fail to get the lock, they will spin until either the cons pointer is updated by another CPU. The current code allows the possibility of shared lock starvation if there is a constant stream of CPUs trying to grab the exclusive lock. This leads to severe latency issues and soft lockups. Consider the following scenario where CPU1's attempt to acquire the shared lock is starved by CPU2 and CPU0 contending for the exclusive lock. CPU0 (exclusive) | CPU1 (shared) | CPU2 (exclusive) | `cmdq->lock` -------------------------------------------------------------------------- trylock() //takes | | | 0 | shared_lock() | | INT_MIN | fetch_inc() | | INT_MIN | no return | | INT_MIN + 1 | spins // VAL >= 0 | | INT_MIN + 1 unlock() | spins... | | INT_MIN + 1 set_release(0) | spins... | | 0 see[NOTE] (done) | (sees 0) | trylock() // takes | 0 | *exits loop* | cmpxchg(0, INT_MIN) | 0 | | *cuts in* | INT_MIN | cmpxchg(0, 1) | | INT_MIN | fails // != 0 | | INT_MIN | spins // VAL >= 0 | | INT_MIN | *starved* | | INT_MIN [NOTE] The current code resets the exclusive lock to 0 regardless of the state of the lock. This causes two problems: 1. It opens the possibility of back-to-back exclusive locks and the downstream effect of starving shared lock. 2. The count of shared lock waiters are lost. To mitigate this, we release the exclusive lock by only clearing the sign bit while retaining the shared lock waiter count as a way to avoid starving the shared lock waiters. Also deleted cmpxchg loop while trying to acquire the shared lock as it is not needed. The waiters can see the positive lock count and proceed immediately after the exclusive lock is released. Exclusive lock is not starved in that submitters will try exclusive lock first when new spaces become available. Reviewed-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com> Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04iommu/vt-d: Clear Present bit before tearing down PASID entryLu Baolu2-1/+19
[ Upstream commit 75ed00055c059dedc47b5daaaa2f8a7a019138ff ] The Intel VT-d Scalable Mode PASID table entry consists of 512 bits (64 bytes). When tearing down an entry, the current implementation zeros the entire 64-byte structure immediately using multiple 64-bit writes. Since the IOMMU hardware may fetch these 64 bytes using multiple internal transactions (e.g., four 128-bit bursts), updating or zeroing the entire entry while it is active (P=1) risks a "torn" read. If a hardware fetch occurs simultaneously with the CPU zeroing the entry, the hardware could observe an inconsistent state, leading to unpredictable behavior or spurious faults. Follow the "Guidance to Software for Invalidations" in the VT-d spec (Section 6.5.3.3) by implementing the recommended ownership handshake: 1. Clear only the 'Present' (P) bit of the PASID entry. 2. Use a dma_wmb() to ensure the cleared bit is visible to hardware before proceeding. 3. Execute the required invalidation sequence (PASID cache, IOTLB, and Device-TLB flush) to ensure the hardware has released all cached references. 4. Only after the flushes are complete, zero out the remaining fields of the PASID entry. Also, add a dma_wmb() in pasid_set_present() to ensure that all other fields of the PASID entry are visible to the hardware before the Present bit is set. Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Dmytro Maluka <dmaluka@chromium.org> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260120061816.2132558-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04iommu/vt-d: Avoid draining PRQ in sva mm release pathLu Baolu1-1/+2
[ Upstream commit dda2b8c3c6ccc50deae65cc75f246577348e2ec5 ] When a PASID is used for SVA by a device, it's possible that the PASID entry is cleared before the device flushes all ongoing DMA requests and removes the SVA domain. This can occur when an exception happens and the process terminates before the device driver stops DMA and calls the iommu driver to unbind the PASID. There's no need to drain the PRQ in the mm release path. Instead, the PRQ will be drained in the SVA unbind path. Unfortunately, commit c43e1ccdebf2 ("iommu/vt-d: Drain PRQs when domain removed from RID") changed this behavior by unconditionally draining the PRQ in intel_pasid_tear_down_entry(). This can lead to a potential sleeping-in-atomic-context issue. Smatch static checker warning: drivers/iommu/intel/prq.c:95 intel_iommu_drain_pasid_prq() warn: sleeping in atomic context To avoid this issue, prevent draining the PRQ in the SVA mm release path and restore the previous behavior. Fixes: c43e1ccdebf2 ("iommu/vt-d: Drain PRQs when domain removed from RID") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-iommu/c5187676-2fa2-4e29-94e0-4a279dc88b49@stanley.mountain/ Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20241212021529.1104745-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Stable-dep-of: 75ed00055c05 ("iommu/vt-d: Clear Present bit before tearing down PASID entry") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04iommu/vt-d: Drain PRQs when domain removed from RIDLu Baolu3-18/+10
[ Upstream commit c43e1ccdebf2c950545fdf12c5796ad6f7bad7ee ] As this iommu driver now supports page faults for requests without PASID, page requests should be drained when a domain is removed from the RID2PASID entry. This results in the intel_iommu_drain_pasid_prq() call being moved to intel_pasid_tear_down_entry(). This indicates that when a translation is removed from any PASID entry and the PRI has been enabled on the device, page requests are drained in the domain detachment path. The intel_iommu_drain_pasid_prq() helper has been modified to support sending device TLB invalidation requests for both PASID and non-PASID cases. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20241101045543.70086-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Stable-dep-of: 75ed00055c05 ("iommu/vt-d: Clear Present bit before tearing down PASID entry") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04iommu/vt-d: Separate page request queue from SVMJoel Granados5-419/+424
[ Upstream commit 4d5440957641fb5652cbef8df6183baf473cab6b ] IO page faults are no longer dependent on CONFIG_INTEL_IOMMU_SVM. Move all Page Request Queue (PRQ) functions that handle prq events to a new file in drivers/iommu/intel/prq.c. The page_req_des struct is now declared in drivers/iommu/intel/prq.c. No functional changes are intended. This is a preparation patch to enable the use of IO page faults outside the SVM/PASID use cases. Signed-off-by: Joel Granados <joel.granados@kernel.org> Link: https://lore.kernel.org/r/20241015-jag-iopfv8-v4-1-b696ca89ba29@kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Stable-dep-of: 75ed00055c05 ("iommu/vt-d: Clear Present bit before tearing down PASID entry") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04iommu/vt-d: Flush cache for PASID table before using itDmytro Maluka1-3/+4
[ Upstream commit 22d169bdd2849fe6bd18c2643742e1c02be6451c ] When writing the address of a freshly allocated zero-initialized PASID table to a PASID directory entry, do that after the CPU cache flush for this PASID table, not before it, to avoid the time window when this PASID table may be already used by non-coherent IOMMU hardware while its contents in RAM is still some random old data, not zero-initialized. Fixes: 194b3348bdbb ("iommu/vt-d: Fix PASID directory pointer coherency") Signed-off-by: Dmytro Maluka <dmaluka@chromium.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20251221123508.37495-1-dmaluka@chromium.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-19iommu/arm-smmu-qcom: do not register driver in probe()Danilo Krummrich4-5/+52
commit ed1ac3c977dd6b119405fa36dd41f7151bd5b4de upstream. Commit 0b4eeee2876f ("iommu/arm-smmu-qcom: Register the TBU driver in qcom_smmu_impl_init") intended to also probe the TBU driver when CONFIG_ARM_SMMU_QCOM_DEBUG is disabled, but also moved the corresponding platform_driver_register() call into qcom_smmu_impl_init() which is called from arm_smmu_device_probe(). However, it neither makes sense to register drivers from probe() callbacks of other drivers, nor does the driver core allow registering drivers with a device lock already being held. The latter was revealed by commit dc23806a7c47 ("driver core: enforce device_lock for driver_match_device()") leading to a deadlock condition described in [1]. Additionally, it was noted by Robin that the current approach is potentially racy with async probe [2]. Hence, fix this by registering the qcom_smmu_tbu_driver from module_init(). Unfortunately, due to the vendoring of the driver, this requires an indirection through arm-smmu-impl.c. Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/lkml/7ae38e31-ef31-43ad-9106-7c76ea0e8596@sirena.org.uk/ Link: https://lore.kernel.org/lkml/DFU7CEPUSG9A.1KKGVW4HIPMSH@kernel.org/ [1] Link: https://lore.kernel.org/lkml/0c0d3707-9ea5-44f9-88a1-a65c62e3df8d@arm.com/ [2] Fixes: dc23806a7c47 ("driver core: enforce device_lock for driver_match_device()") Fixes: 0b4eeee2876f ("iommu/arm-smmu-qcom: Register the TBU driver in qcom_smmu_impl_init") Acked-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Bjorn Andersson <andersson@kernel.org> Reviewed-by: Bjorn Andersson <andersson@kernel.org> Acked-by: Konrad Dybcio <konradybcio@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> #LX2160ARDB Tested-by: Wang Jiayue <akaieurus@gmail.com> Reviewed-by: Wang Jiayue <akaieurus@gmail.com> Tested-by: Mark Brown <broonie@kernel.org> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Link: https://patch.msgid.link/20260121141215.29658-1-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11Revert "iommu/amd: Skip enabling command/event buffers for kdump"Greg Kroah-Hartman1-19/+9
This reverts commit 3344716ddee01d7caa35b470d30fd87781c661b1 which is commit 9be15fbfc6c5c89c22cf6e209f66ea43ee0e58bb upstream. This causes problems in older kernel trees as SNP host kdump is not supported in them, so drop it from the stable branches. Reported-by: Ashish Kalra <ashish.kalra@amd.com> Link: https://lore.kernel.org/r/dacdff7f-0606-4ed5-b056-2de564404d51@amd.com Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Sairaj Kodilkar <sarunkod@amd.com> Cc: Joerg Roedel <joerg.roedel@amd.com> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu: disable SVA when CONFIG_X86 is setLu Baolu1-0/+3
commit 72f98ef9a4be30d2a60136dd6faee376f780d06c upstream. Patch series "Fix stale IOTLB entries for kernel address space", v7. This proposes a fix for a security vulnerability related to IOMMU Shared Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel page table entries. When a kernel page table page is freed and reallocated for another purpose, the IOMMU might still hold stale, incorrect entries. This can be exploited to cause a use-after-free or write-after-free condition, potentially leading to privilege escalation or data corruption. This solution introduces a deferred freeing mechanism for kernel page table pages, which provides a safe window to notify the IOMMU to invalidate its caches before the page is reused. This patch (of 8): In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The x86 architecture maps the kernel's virtual address space into the upper portion of every process's page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel page table entries. The Linux kernel currently lacks a notification mechanism for kernel page table changes, specifically when page table pages are freed and reused. The IOMMU driver is only notified of changes to user virtual address mappings. This can cause the IOMMU's internal caches to retain stale entries for kernel VA. Use-After-Free (UAF) and Write-After-Free (WAF) conditions arise when kernel page table pages are freed and later reallocated. The IOMMU could misinterpret the new data as valid page table entries. The IOMMU might then walk into attacker-controlled memory, leading to arbitrary physical memory DMA access or privilege escalation. This is also a Write-After-Free issue, as the IOMMU will potentially continue to write Accessed and Dirty bits to the freed memory while attempting to walk the stale page tables. Currently, SVA contexts are unprivileged and cannot access kernel mappings. However, the IOMMU will still walk kernel-only page tables all the way down to the leaf entries, where it realizes the mapping is for the kernel and errors out. This means the IOMMU still caches these intermediate page table entries, making the described vulnerability a real concern. Disable SVA on x86 architecture until the IOMMU can receive notification to flush the paging cache before freeing the CPU kernel page table pages. Link: https://lkml.kernel.org/r/20251022082635.2462433-1-baolu.lu@linux.intel.com Link: https://lkml.kernel.org/r/20251022082635.2462433-2-baolu.lu@linux.intel.com Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/tegra: fix device leak on probe_device()Johan Hovold1-3/+2
commit c08934a61201db8f1d1c66fcc63fb2eb526b656d upstream. Make sure to drop the reference taken to the iommu platform device when looking up its driver data during probe_device(). Note that commit 9826e393e4a8 ("iommu/tegra-smmu: Fix missing put_device() call in tegra_smmu_find") fixed the leak in an error path, but the reference is still leaking on success. Fixes: 891846516317 ("memory: Add NVIDIA Tegra memory controller support") Cc: stable@vger.kernel.org # 3.19: 9826e393e4a8 Cc: Miaoqian Lin <linmq006@gmail.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/sun50i: fix device leak on of_xlate()Johan Hovold1-0/+2
commit f916109bf53864605d10bf6f4215afa023a80406 upstream. Make sure to drop the reference taken to the iommu platform device when looking up its driver data during of_xlate(). Fixes: 4100b8c229b3 ("iommu: Add Allwinner H6 IOMMU driver") Cc: stable@vger.kernel.org # 5.8 Cc: Maxime Ripard <mripard@kernel.org> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/qcom: fix device leak on of_xlate()Johan Hovold1-6/+4
commit 6a3908ce56e6879920b44ef136252b2f0c954194 upstream. Make sure to drop the reference taken to the iommu platform device when looking up its driver data during of_xlate(). Note that commit e2eae09939a8 ("iommu/qcom: add missing put_device() call in qcom_iommu_of_xlate()") fixed the leak in a couple of error paths, but the reference is still leaking on success and late failures. Fixes: 0ae349a0f33f ("iommu/qcom: Add qcom_iommu") Cc: stable@vger.kernel.org # 4.14: e2eae09939a8 Cc: Rob Clark <robin.clark@oss.qualcomm.com> Cc: Yu Kuai <yukuai3@huawei.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/omap: fix device leaks on probe_device()Johan Hovold2-3/+1
commit b5870691065e6bbe6ba0650c0412636c6a239c5a upstream. Make sure to drop the references taken to the iommu platform devices when looking up their driver data during probe_device(). Note that the arch data device pointer added by commit 604629bcb505 ("iommu/omap: add support for late attachment of iommu devices") has never been used. Remove it to underline that the references are not needed. Fixes: 9d5018deec86 ("iommu/omap: Add support to program multiple iommus") Fixes: 7d6827748d54 ("iommu/omap: Fix iommu archdata name for DT-based devices") Cc: stable@vger.kernel.org # 3.18 Cc: Suman Anna <s-anna@ti.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/mediatek: fix device leak on of_xlate()Johan Hovold1-0/+2
commit b3f1ee18280363ef17f82b564fc379ceba9ec86f upstream. Make sure to drop the reference taken to the iommu platform device when looking up its driver data during of_xlate(). Fixes: 0df4fabe208d ("iommu/mediatek: Add mt8173 IOMMU driver") Cc: stable@vger.kernel.org # 4.6 Acked-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Yong Wu <yong.wu@mediatek.com> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/mediatek-v1: fix device leaks on probe()Johan Hovold1-5/+18
commit 46207625c9f33da0e43bb4ae1e91f0791b6ed633 upstream. Make sure to drop the references taken to the larb devices during probe on probe failure (e.g. probe deferral) and on driver unbind. Fixes: b17336c55d89 ("iommu/mediatek: add support for mtk iommu generation one HW") Cc: stable@vger.kernel.org # 4.8 Cc: Honghui Zhang <honghui.zhang@mediatek.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/mediatek-v1: fix device leak on probe_device()Johan Hovold1-0/+2
commit c77ad28bfee0df9cbc719eb5adc9864462cfb65b upstream. Make sure to drop the reference taken to the iommu platform device when looking up its driver data during probe_device(). Fixes: b17336c55d89 ("iommu/mediatek: add support for mtk iommu generation one HW") Cc: stable@vger.kernel.org # 4.8 Cc: Honghui Zhang <honghui.zhang@mediatek.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Yong Wu <yong.wu@mediatek.com> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/ipmmu-vmsa: fix device leak on of_xlate()Johan Hovold1-0/+2
commit 80aa518452c4aceb9459f9a8e3184db657d1b441 upstream. Make sure to drop the reference taken to the iommu platform device when looking up its driver data during of_xlate(). Fixes: 7b2d59611fef ("iommu/ipmmu-vmsa: Replace local utlb code with fwspec ids") Cc: stable@vger.kernel.org # 4.14 Cc: Magnus Damm <damm+renesas@opensource.se> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/exynos: fix device leak on of_xlate()Johan Hovold1-6/+3
commit 05913cc43cb122f9afecdbe775115c058b906e1b upstream. Make sure to drop the reference taken to the iommu platform device when looking up its driver data during of_xlate(). Note that commit 1a26044954a6 ("iommu/exynos: add missing put_device() call in exynos_iommu_of_xlate()") fixed the leak in a couple of error paths, but the reference is still leaking on success. Fixes: aa759fd376fb ("iommu/exynos: Add callback for initializing devices from device tree") Cc: stable@vger.kernel.org # 4.2: 1a26044954a6 Cc: Yu Kuai <yukuai3@huawei.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/apple-dart: fix device leak on of_xlate()Johan Hovold1-0/+2
commit a6eaa872c52a181ae9a290fd4e40c9df91166d7a upstream. Make sure to drop the reference taken to the iommu platform device when looking up its driver data during of_xlate(). Fixes: 46d1fb072e76 ("iommu/dart: Add DART iommu driver") Cc: stable@vger.kernel.org # 5.15 Cc: Sven Peter <sven@kernel.org> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/amd: Propagate the error code returned by __modify_irte_ga() in ↵Jinhui Guo1-1/+1
modify_irte_ga() commit 2381a1b40be4b286062fb3cf67dd7f005692aa2a upstream. The return type of __modify_irte_ga() is int, but modify_irte_ga() treats it as a bool. Casting the int to bool discards the error code. To fix the issue, change the type of ret to int in modify_irte_ga(). Fixes: 57cdb720eaa5 ("iommu/amd: Do not flush IRTE when only updating isRun and destination fields") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/amd: Fix pci_segment memleak in alloc_pci_segment()Jinhui Guo1-3/+12
commit 75ba146c2674ba49ed8a222c67f9abfb4a4f2a4f upstream. Fix a memory leak of struct amd_iommu_pci_segment in alloc_pci_segment() when system memory (or contiguous memory) is insufficient. Fixes: 04230c119930 ("iommu/amd: Introduce per PCI segment device table") Fixes: eda797a27795 ("iommu/amd: Introduce per PCI segment rlookup table") Fixes: 99fc4ac3d297 ("iommu/amd: Introduce per PCI segment alias_table") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommu/mediatek: fix use-after-free on probe deferralJohan Hovold1-7/+18
commit de83d4617f9fe059623e97acf7e1e10d209625b5 upstream. The driver is dropping the references taken to the larb devices during probe after successful lookup as well as on errors. This can potentially lead to a use-after-free in case a larb device has not yet been bound to its driver so that the iommu driver probe defers. Fix this by keeping the references as expected while the iommu driver is bound. Fixes: 26593928564c ("iommu/mediatek: Add error path for loop of mm_dts_parse") Cc: stable@vger.kernel.org Cc: Yong Wu <yong.wu@mediatek.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: Yong Wu <yong.wu@mediatek.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08x86/msi: Make irq_retrigger() functional for posted MSIThomas Gleixner1-4/+4
[ Upstream commit 0edc78b82bea85e1b2165d8e870a5c3535919695 ] Luigi reported that retriggering a posted MSI interrupt does not work correctly. The reason is that the retrigger happens at the vector domain by sending an IPI to the actual vector on the target CPU. That works correctly exactly once because the posted MSI interrupt chip does not issue an EOI as that's only required for the posted MSI notification vector itself. As a consequence the vector becomes stale in the ISR, which not only affects this vector but also any lower priority vector in the affected APIC because the ISR bit is not cleared. Luigi proposed to set the vector in the remap PIR bitmap and raise the posted MSI notification vector. That works, but that still does not cure a related problem: If there is ever a stray interrupt on such a vector, then the related APIC ISR bit becomes stale due to the lack of EOI as described above. Unlikely to happen, but if it happens it's not debuggable at all. So instead of playing games with the PIR, this can be actually solved for both cases by: 1) Keeping track of the posted interrupt vector handler state 2) Implementing a posted MSI specific irq_ack() callback which checks that state. If the posted vector handler is inactive it issues an EOI, otherwise it delegates that to the posted handler. This is correct versus affinity changes and concurrent events on the posted vector as the actual handler invocation is serialized through the interrupt descriptor lock. Fixes: ed1e48ea4370 ("iommu/vt-d: Enable posted mode for device MSIs") Reported-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Luigi Rizzo <lrizzo@google.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251125214631.044440658@linutronix.de Closes: https://lore.kernel.org/lkml/20251124104836.3685533-1-lrizzo@google.com [ DEFINE_PER_CPU_CACHE_HOT => DEFINE_PER_CPU ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08iommufd/selftest: Check for overflow in IOMMU_TEST_OP_ADD_RESERVEDJason Gunthorpe1-1/+7
[ Upstream commit e6a973af11135439de32ece3b9cbe3bfc043bea8 ] syzkaller found it could overflow math in the test infrastructure and cause a WARN_ON by corrupting the reserved interval tree. This only effects test kernels with CONFIG_IOMMUFD_TEST. Validate the user input length in the test ioctl. Fixes: f4b20bb34c83 ("iommufd: Add kernel support for testing iommufd") Link: https://patch.msgid.link/r/0-v1-cd99f6049ba5+51-iommufd_syz_add_resv_jgg@nvidia.com Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Reported-by: syzbot+57fdb0cf6a0c5d1f15a2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69368129.a70a0220.38f243.008f.GAE@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18iommu/arm-smmu-qcom: Enable use of all SMR groups when running bare-metalStephan Gerhold1-10/+17
[ Upstream commit 5583a55e074b33ccd88ac0542fd7cd656a7e2c8c ] Some platforms (e.g. SC8280XP and X1E) support more than 128 stream matching groups. This is more than what is defined as maximum by the ARM SMMU architecture specification. Commit 122611347326 ("iommu/arm-smmu-qcom: Limit the SMR groups to 128") disabled use of the additional groups because they don't exhibit the same behavior as the architecture supported ones. It seems like this is just another quirk of the hypervisor: When running bare-metal without the hypervisor, the additional groups appear to behave just like all others. The boot firmware uses some of the additional groups, so ignoring them in this situation leads to stream match conflicts whenever we allocate a new SMR group for the same SID. The workaround exists primarily because the bypass quirk detection fails when using a S2CR register from the additional matching groups, so let's perform the test with the last reliable S2CR (127) and then limit the number of SMR groups only if we detect that we are running below the hypervisor (because of the bypass quirk). Fixes: 122611347326 ("iommu/arm-smmu-qcom: Limit the SMR groups to 128") Signed-off-by: Stephan Gerhold <stephan.gerhold@linaro.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18iommu/arm-smmu-v3: Fix error check in arm_smmu_alloc_cd_tablesRyan Huang1-1/+1
[ Upstream commit 5941f0e0c1e0be03ebc15b461f64208f5250d3d9 ] In arm_smmu_alloc_cd_tables(), the error check following the dma_alloc_coherent() for cd_table->l2.l1tab incorrectly tests cd_table->l2.l2ptrs. This means an allocation failure for l1tab goes undetected, causing the function to return 0 (success) erroneously. Correct the check to test cd_table->l2.l1tab. Fixes: e3b1be2e73db ("iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg") Signed-off-by: Daniel Mentz <danielmentz@google.com> Signed-off-by: Ryan Huang <tzukui@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18iommu/vt-d: Fix unused invalidation hint in qi_desc_iotlbAashish Sharma1-1/+1
[ Upstream commit 6b38a108eeb3936b21643191db535a35dd7c890b ] Invalidation hint (ih) in the function 'qi_desc_iotlb' is initialized to zero and never used. It is embedded in the 0th bit of the 'addr' parameter. Get the correct 'ih' value from there. Fixes: f701c9f36bcb ("iommu/vt-d: Factor out invalidation descriptor composition") Signed-off-by: Aashish Sharma <aashish@aashishsharma.net> Link: https://lore.kernel.org/r/20251009010903.1323979-1-aashish@aashishsharma.net Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24iommufd: Make vfio_compat's unmap succeed if the range is already emptyJason Gunthorpe2-9/+7
[ Upstream commit afb47765f9235181fddc61c8633b5a8cfae29fd2 ] iommufd returns ENOENT when attempting to unmap a range that is already empty, while vfio type1 returns success. Fix vfio_compat to match. Fixes: d624d6652a65 ("iommufd: vfio container FD ioctl compatibility") Link: https://patch.msgid.link/r/0-v1-76be45eff0be+5d-iommufd_unmap_compat_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Alex Mastro <amastro@fb.com> Reported-by: Alex Mastro <amastro@fb.com> Closes: https://lore.kernel.org/r/aP0S5ZF9l3sWkJ1G@devgpu012.nha5.facebook.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13iommufd: Don't overflow during division for dirty trackingJason Gunthorpe1-3/+2
commit cb30dfa75d55eced379a42fd67bd5fb7ec38555e upstream. If pgshift is 63 then BITS_PER_TYPE(*bitmap->bitmap) * pgsize will overflow to 0 and this triggers divide by 0. In this case the index should just be 0, so reorganize things to divide by shift and avoid hitting any overflows. Link: https://patch.msgid.link/r/0-v1-663679b57226+172-iommufd_dirty_div0_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: 58ccf0190d19 ("vfio: Add an IOVA bitmap support") Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: syzbot+093a8a8b859472e6c257@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=093a8a8b859472e6c257 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13iommu/vt-d: Replace snprintf with scnprintf in dmar_latency_snapshot()Seyediman Seyedarab3-17/+8
[ Upstream commit 75c02a037609f34db17e91be195cedb33b61bae0 ] snprintf() returns the number of bytes that would have been written, not the number actually written. Using this for offset tracking can cause buffer overruns if truncation occurs. Replace snprintf() with scnprintf() to ensure the offset stays within bounds. Since scnprintf() never returns a negative value, and zero is not possible in this context because 'bytes' starts at 0 and 'size - bytes' is DEBUG_BUFFER_SIZE in the first call, which is large enough to hold the string literals used, the return value is always positive. An integer overflow is also completely out of reach here due to the small and fixed buffer size. The error check in latency_show_one() is therefore unnecessary. Remove it and make dmar_latency_snapshot() return void. Signed-off-by: Seyediman Seyedarab <ImanDevel@gmail.com> Link: https://lore.kernel.org/r/20250731225048.131364-1-ImanDevel@gmail.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13iommu/apple-dart: Clear stream error indicator bits for T8110 DARTsHector Martin1-0/+5
[ Upstream commit ecf6508923f87e4597228f70cc838af3d37f6662 ] These registers exist and at least on the t602x variant the IRQ only clears when theses are cleared. Signed-off-by: Hector Martin <marcan@marcan.st> Signed-off-by: Janne Grunau <j@jannau.net> Reviewed-by: Sven Peter <sven@kernel.org> Reviewed-by: Neal Gompa <neal@gompa.dev> Link: https://lore.kernel.org/r/20250826-dart-t8110-stream-error-v1-1-e33395112014@jannau.net Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13iommu/amd: Skip enabling command/event buffers for kdumpAshish Kalra1-9/+19
[ Upstream commit 9be15fbfc6c5c89c22cf6e209f66ea43ee0e58bb ] After a panic if SNP is enabled in the previous kernel then the kdump kernel boots with IOMMU SNP enforcement still enabled. IOMMU command buffers and event buffer registers remain locked and exclusive to the previous kernel. Attempts to enable command and event buffers in the kdump kernel will fail, as hardware ignores writes to the locked MMIO registers as per AMD IOMMU spec Section 2.12.2.1. Skip enabling command buffers and event buffers for kdump boot as they are already enabled in the previous kernel. Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Link: https://lore.kernel.org/r/576445eb4f168b467b0fc789079b650ca7c5b037.1756157913.git.ashish.kalra@amd.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02iommu/vt-d: Avoid use of NULL after WARN_ON_ONCEKees Bakker1-3/+4
[ Upstream commit 60f030f7418d3f1d94f2fb207fe3080e1844630b ] There is a WARN_ON_ONCE to catch an unlikely situation when domain_remove_dev_pasid can't find the `pasid`. In case it nevertheless happens we must avoid using a NULL pointer. Signed-off-by: Kees Bakker <kees@ijzerbout.nl> Link: https://lore.kernel.org/r/20241218201048.E544818E57E@bout3.ijzerbout.nl Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Amelia Crate <acrate@waldn.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19iommu/vt-d: PRS isn't usable if PDS isn't supportedLu Baolu1-1/+1
commit 5ef7e24c742038a5d8c626fdc0e3a21834358341 upstream. The specification, Section 7.10, "Software Steps to Drain Page Requests & Responses," requires software to submit an Invalidation Wait Descriptor (inv_wait_dsc) with the Page-request Drain (PD=1) flag set, along with the Invalidation Wait Completion Status Write flag (SW=1). It then waits for the Invalidation Wait Descriptor's completion. However, the PD field in the Invalidation Wait Descriptor is optional, as stated in Section 6.5.2.9, "Invalidation Wait Descriptor": "Page-request Drain (PD): Remapping hardware implementations reporting Page-request draining as not supported (PDS = 0 in ECAP_REG) treat this field as reserved." This implies that if the IOMMU doesn't support the PDS capability, software can't drain page requests and group responses as expected. Do not enable PCI/PRI if the IOMMU doesn't support PDS. Reported-by: Joel Granados <joel.granados@kernel.org> Closes: https://lore.kernel.org/r/20250909-jag-pds-v1-1-ad8cba0e494e@kernel.org Fixes: 66ac4db36f4c ("iommu/vt-d: Add page request draining support") Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250915062946.120196-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15iommu/vt-d: Disallow dirty tracking if incoherent page walkLu Baolu1-1/+2
[ Upstream commit 57f55048e564dedd8a4546d018e29d6bbfff0a7e ] Dirty page tracking relies on the IOMMU atomically updating the dirty bit in the paging-structure entry. For this operation to succeed, the paging- structure memory must be coherent between the IOMMU and the CPU. In another word, if the iommu page walk is incoherent, dirty page tracking doesn't work. The Intel VT-d specification, Section 3.10 "Snoop Behavior" states: "Remapping hardware encountering the need to atomically update A/EA/D bits in a paging-structure entry that is not snooped will result in a non- recoverable fault." To prevent an IOMMU from being incorrectly configured for dirty page tracking when it is operating in an incoherent mode, mark SSADS as supported only when both ecap_slads and ecap_smpwc are supported. Fixes: f35f22cc760e ("iommu/vt-d: Access/Dirty bit support for SS domains") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250924083447.123224-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15iommu/vt-d: debugfs: Fix legacy mode page table dump logicVineeth Pillai (Google)1-2/+15
[ Upstream commit fbe6070c73badca726e4ff7877320e6c62339917 ] In legacy mode, SSPTPTR is ignored if TT is not 00b or 01b. SSPTPTR maybe uninitialized or zero in that case and may cause oops like: Oops: general protection fault, probably for non-canonical address 0xf00087d3f000f000: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 786 Comm: cat Not tainted 6.16.0 #191 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014 RIP: 0010:pgtable_walk_level+0x98/0x150 RSP: 0018:ffffc90000f279c0 EFLAGS: 00010206 RAX: 0000000040000000 RBX: ffffc90000f27ab0 RCX: 000000000000001e RDX: 0000000000000003 RSI: f00087d3f000f000 RDI: f00087d3f0010000 RBP: ffffc90000f27a00 R08: ffffc90000f27a98 R09: 0000000000000002 R10: 0000000000000000 R11: 0000000000000000 R12: f00087d3f000f000 R13: 0000000000000000 R14: 0000000040000000 R15: ffffc90000f27a98 FS: 0000764566dcb740(0000) GS:ffff8881f812c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000764566d44000 CR3: 0000000109d81003 CR4: 0000000000772ef0 PKRU: 55555554 Call Trace: <TASK> pgtable_walk_level+0x88/0x150 domain_translation_struct_show.isra.0+0x2d9/0x300 dev_domain_translation_struct_show+0x20/0x40 seq_read_iter+0x12d/0x490 ... Avoid walking the page table if TT is not 00b or 01b. Fixes: 2b437e804566 ("iommu/vt-d: debugfs: Support dumping a specified page table") Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250814163153.634680-1-vineeth@bitbyteword.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02iommufd: Fix race during abort for file descriptorsJason Gunthorpe2-6/+32
[ Upstream commit 4e034bf045b12852a24d5d33f2451850818ba0c1 ] fput() doesn't actually call file_operations release() synchronously, it puts the file on a work queue and it will be released eventually. This is normally fine, except for iommufd the file and the iommufd_object are tied to gether. The file has the object as it's private_data and holds a users refcount, while the object is expected to remain alive as long as the file is. When the allocation of a new object aborts before installing the file it will fput() the file and then go on to immediately kfree() the obj. This causes a UAF once the workqueue completes the fput() and tries to decrement the users refcount. Fix this by putting the core code in charge of the file lifetime, and call __fput_sync() during abort to ensure that release() is called before kfree. __fput_sync() is a bit too tricky to open code in all the object implementations. Instead the objects tell the core code where the file pointer is and the core will take care of the life cycle. If the object is successfully allocated then the file will hold a users refcount and the iommufd_object cannot be destroyed. It is worth noting that close(); ioctl(IOMMU_DESTROY); doesn't have an issue because close() is already using a synchronous version of fput(). The UAF looks like this: BUG: KASAN: slab-use-after-free in iommufd_eventq_fops_release+0x45/0xc0 drivers/iommu/iommufd/eventq.c:376 Write of size 4 at addr ffff888059c97804 by task syz.0.46/6164 CPU: 0 UID: 0 PID: 6164 Comm: syz.0.46 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xcd/0x630 mm/kasan/report.c:482 kasan_report+0xe0/0x110 mm/kasan/report.c:595 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0x100/0x1b0 mm/kasan/generic.c:189 instrument_atomic_read_write include/linux/instrumented.h:96 [inline] atomic_fetch_sub_release include/linux/atomic/atomic-instrumented.h:400 [inline] __refcount_dec include/linux/refcount.h:455 [inline] refcount_dec include/linux/refcount.h:476 [inline] iommufd_eventq_fops_release+0x45/0xc0 drivers/iommu/iommufd/eventq.c:376 __fput+0x402/0xb70 fs/file_table.c:468 task_work_run+0x14d/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xeb/0x110 kernel/entry/common.c:43 exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline] do_syscall_64+0x41c/0x4c0 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f Link: https://patch.msgid.link/r/1-v1-02cd136829df+31-iommufd_syz_fput_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: 07838f7fd529 ("iommufd: Add iommufd fault object") Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Nirmoy Das <nirmoyd@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Reported-by: syzbot+80620e2d0d0a33b09f93@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/68c8583d.050a0220.2ff435.03a2.GAE@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> [ applied eventq.c changes to fault.c, drop veventq bits ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25iommu/amd/pgtbl: Fix possible race while increase page table levelVasant Hegde2-4/+22
commit 1e56310b40fd2e7e0b9493da9ff488af145bdd0c upstream. The AMD IOMMU host page table implementation supports dynamic page table levels (up to 6 levels), starting with a 3-level configuration that expands based on IOVA address. The kernel maintains a root pointer and current page table level to enable proper page table walks in alloc_pte()/fetch_pte() operations. The IOMMU IOVA allocator initially starts with 32-bit address and onces its exhuasted it switches to 64-bit address (max address is determined based on IOMMU and device DMA capability). To support larger IOVA, AMD IOMMU driver increases page table level. But in unmap path (iommu_v1_unmap_pages()), fetch_pte() reads pgtable->[root/mode] without lock. So its possible that in exteme corner case, when increase_address_space() is updating pgtable->[root/mode], fetch_pte() reads wrong page table level (pgtable->mode). It does compare the value with level encoded in page table and returns NULL. This will result is iommu_unmap ops to fail and upper layer may retry/log WARN_ON. CPU 0 CPU 1 ------ ------ map pages unmap pages alloc_pte() -> increase_address_space() iommu_v1_unmap_pages() -> fetch_pte() pgtable->root = pte (new root value) READ pgtable->[mode/root] Reads new root, old mode Updates mode (pgtable->mode += 1) Since Page table level updates are infrequent and already synchronized with a spinlock, implement seqcount to enable lock-free read operations on the read path. Fixes: 754265bcab7 ("iommu/amd: Fix race in increase_address_space()") Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Cc: stable@vger.kernel.org Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25iommu/vt-d: Fix __domain_mapping()'s usage of switch_to_super_page()Eugene Koira1-1/+6
commit dce043c07ca1ac19cfbe2844a6dc71e35c322353 upstream. switch_to_super_page() assumes the memory range it's working on is aligned to the target large page level. Unfortunately, __domain_mapping() doesn't take this into account when using it, and will pass unaligned ranges ultimately freeing a PTE range larger than expected. Take for example a mapping with the following iov_pfn range [0x3fe400, 0x4c0600), which should be backed by the following mappings: iov_pfn [0x3fe400, 0x3fffff] covered by 2MiB pages iov_pfn [0x400000, 0x4bffff] covered by 1GiB pages iov_pfn [0x4c0000, 0x4c05ff] covered by 2MiB pages Under this circumstance, __domain_mapping() will pass [0x400000, 0x4c05ff] to switch_to_super_page() at a 1 GiB granularity, which will in turn free PTEs all the way to iov_pfn 0x4fffff. Mitigate this by rounding down the iov_pfn range passed to switch_to_super_page() in __domain_mapping() to the target large page level. Additionally add range alignment checks to switch_to_super_page. Fixes: 9906b9352a35 ("iommu/vt-d: Avoid duplicate removing in __domain_mapping()") Signed-off-by: Eugene Koira <eugkoira@amazon.com> Cc: stable@vger.kernel.org Reviewed-by: Nicolas Saenz Julienne <nsaenz@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20250826143816.38686-1-eugkoira@amazon.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28iommu/amd: Avoid stack buffer overflow from kernel cmdlineKees Cook1-2/+2
[ Upstream commit 8503d0fcb1086a7cfe26df67ca4bd9bd9e99bdec ] While the kernel command line is considered trusted in most environments, avoid writing 1 byte past the end of "acpiid" if the "str" argument is maximum length. Reported-by: Simcha Kosman <simcha.kosman@cyberark.com> Closes: https://lore.kernel.org/all/AS8P193MB2271C4B24BCEDA31830F37AE84A52@AS8P193MB2271.EURP193.PROD.OUTLOOK.COM Fixes: b6b26d86c61c ("iommu/amd: Add a length limitation for the ivrs_acpihid command-line parameter") Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Ankit Soni <Ankit.Soni@amd.com> Link: https://lore.kernel.org/r/20250804154023.work.970-kees@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28iommu/arm-smmu-v3: Fix smmu_domain->nr_ats_masters decrementNicolin Chen1-1/+1
commit 685ca577b408ffd9c5a4057a2acc0cd3e6978b36 upstream. The arm_smmu_attach_commit() updates master->ats_enabled before calling arm_smmu_remove_master_domain() that is supposed to clean up everything in the old domain, including the old domain's nr_ats_masters. So, it is supposed to use the old ats_enabled state of the device, not an updated state. This isn't a problem if switching between two domains where: - old ats_enabled = false; new ats_enabled = false - old ats_enabled = true; new ats_enabled = true but can fail cases where: - old ats_enabled = false; new ats_enabled = true (old domain should keep the counter but incorrectly decreased it) - old ats_enabled = true; new ats_enabled = false (old domain needed to decrease the counter but incorrectly missed it) Update master->ats_enabled after arm_smmu_remove_master_domain() to fix this. Fixes: 7497f4211f4f ("iommu/arm-smmu-v3: Make changing domains be hitless for ATS") Cc: stable@vger.kernel.org Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Link: https://lore.kernel.org/r/20250801030127.2006979-1-nicolinc@nvidia.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20iommufd: Prevent ALIGN() overflowJason Gunthorpe1-16/+25
commit b42497e3c0e74db061eafad41c0cd7243c46436b upstream. When allocating IOVA the candidate range gets aligned to the target alignment. If the range is close to ULONG_MAX then the ALIGN() can wrap resulting in a corrupted iova. Open code the ALIGN() using get_add_overflow() to prevent this. This simplifies the checks as we don't need to check for length earlier either. Consolidate the two copies of this code under a single helper. This bug would allow userspace to create a mapping that overlaps with some other mapping or a reserved range. Cc: stable@vger.kernel.org Fixes: 51fe6141f0f6 ("iommufd: Data structure to provide IOVA to PFN mapping") Reported-by: syzbot+c2f65e2801743ca64e08@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/685af644.a00a0220.2e5631.0094.GAE@google.com Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://patch.msgid.link/all/1-v1-7b4a16fc390b+10f4-iommufd_alloc_overflow_jgg@nvidia.com/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_rangeNicolin Chen1-2/+5
commit b23e09f9997771b4b739c1c694fa832b5fa2de02 upstream. There are callers that read the unmapped bytes even when rc != 0. Thus, do not forget to report it in the error path too. Fixes: 8d40205f6093 ("iommufd: Add kAPI toward external drivers for kernel access") Link: https://patch.msgid.link/r/e2b61303bbc008ba1a4e2d7c2a2894749b59fdac.1752126748.git.nicolinc@nvidia.com Cc: stable@vger.kernel.org Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20iommu/arm-smmu-qcom: Add SM6115 MDSS compatibleAlexey Klimov1-0/+1
commit f7fa8520f30373ce99c436c4d57c76befdacbef3 upstream. Add the SM6115 MDSS compatible to clients compatible list, as it also needs that workaround. Without this workaround, for example, QRB4210 RB2 which is based on SM4250/SM6115 generates a lot of smmu unhandled context faults during boot: arm_smmu_context_fault: 116854 callbacks suppressed arm-smmu c600000.iommu: Unhandled context fault: fsr=0x402, iova=0x5c0ec600, fsynr=0x320021, cbfrsynra=0x420, cb=5 arm-smmu c600000.iommu: FSR = 00000402 [Format=2 TF], SID=0x420 arm-smmu c600000.iommu: FSYNR0 = 00320021 [S1CBNDX=50 PNU PLVL=1] arm-smmu c600000.iommu: Unhandled context fault: fsr=0x402, iova=0x5c0d7800, fsynr=0x320021, cbfrsynra=0x420, cb=5 arm-smmu c600000.iommu: FSR = 00000402 [Format=2 TF], SID=0x420 and also failed initialisation of lontium lt9611uxc, gpu and dpu is observed: (binding MDSS components triggered by lt9611uxc have failed) ------------[ cut here ]------------ !aspace WARNING: CPU: 6 PID: 324 at drivers/gpu/drm/msm/msm_gem_vma.c:130 msm_gem_vma_init+0x150/0x18c [msm] Modules linked in: ... (long list of modules) CPU: 6 UID: 0 PID: 324 Comm: (udev-worker) Not tainted 6.15.0-03037-gaacc73ceeb8b #4 PREEMPT Hardware name: Qualcomm Technologies, Inc. QRB4210 RB2 (DT) pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : msm_gem_vma_init+0x150/0x18c [msm] lr : msm_gem_vma_init+0x150/0x18c [msm] sp : ffff80008144b280 ... Call trace: msm_gem_vma_init+0x150/0x18c [msm] (P) get_vma_locked+0xc0/0x194 [msm] msm_gem_get_and_pin_iova_range+0x4c/0xdc [msm] msm_gem_kernel_new+0x48/0x160 [msm] msm_gpu_init+0x34c/0x53c [msm] adreno_gpu_init+0x1b0/0x2d8 [msm] a6xx_gpu_init+0x1e8/0x9e0 [msm] adreno_bind+0x2b8/0x348 [msm] component_bind_all+0x100/0x230 msm_drm_bind+0x13c/0x3d0 [msm] try_to_bring_up_aggregate_device+0x164/0x1d0 __component_add+0xa4/0x174 component_add+0x14/0x20 dsi_dev_attach+0x20/0x34 [msm] dsi_host_attach+0x58/0x98 [msm] devm_mipi_dsi_attach+0x34/0x90 lt9611uxc_attach_dsi.isra.0+0x94/0x124 [lontium_lt9611uxc] lt9611uxc_probe+0x540/0x5fc [lontium_lt9611uxc] i2c_device_probe+0x148/0x2a8 really_probe+0xbc/0x2c0 __driver_probe_device+0x78/0x120 driver_probe_device+0x3c/0x154 __driver_attach+0x90/0x1a0 bus_for_each_dev+0x68/0xb8 driver_attach+0x24/0x30 bus_add_driver+0xe4/0x208 driver_register+0x68/0x124 i2c_register_driver+0x48/0xcc lt9611uxc_driver_init+0x20/0x1000 [lontium_lt9611uxc] do_one_initcall+0x60/0x1d4 do_init_module+0x54/0x1fc load_module+0x1748/0x1c8c init_module_from_file+0x74/0xa0 __arm64_sys_finit_module+0x130/0x2f8 invoke_syscall+0x48/0x104 el0_svc_common.constprop.0+0xc0/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x2c/0x80 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x19c ---[ end trace 0000000000000000 ]--- msm_dpu 5e01000.display-controller: [drm:msm_gpu_init [msm]] *ERROR* could not allocate memptrs: -22 msm_dpu 5e01000.display-controller: failed to load adreno gpu platform a400000.remoteproc:glink-edge:apr:service@7:dais: Adding to iommu group 19 msm_dpu 5e01000.display-controller: failed to bind 5900000.gpu (ops a3xx_ops [msm]): -22 msm_dpu 5e01000.display-controller: adev bind failed: -22 lt9611uxc 0-002b: failed to attach dsi to host lt9611uxc 0-002b: probe with driver lt9611uxc failed with error -22 Suggested-by: Bjorn Andersson <andersson@kernel.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Fixes: 3581b7062cec ("drm/msm/disp/dpu1: add support for display on SM6115") Cc: stable@vger.kernel.org Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Link: https://lore.kernel.org/r/20250613173238.15061-1-alexey.klimov@linaro.org Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modesLu Baolu2-1/+21
commit 12724ce3fe1a3d8f30d56e48b4f272d8860d1970 upstream. The iotlb_sync_map iommu ops allows drivers to perform necessary cache flushes when new mappings are established. For the Intel iommu driver, this callback specifically serves two purposes: - To flush caches when a second-stage page table is attached to a device whose iommu is operating in caching mode (CAP_REG.CM==1). - To explicitly flush internal write buffers to ensure updates to memory- resident remapping structures are visible to hardware (CAP_REG.RWBF==1). However, in scenarios where neither caching mode nor the RWBF flag is active, the cache_tag_flush_range_np() helper, which is called in the iotlb_sync_map path, effectively becomes a no-op. Despite being a no-op, cache_tag_flush_range_np() involves iterating through all cache tags of the iommu's attached to the domain, protected by a spinlock. This unnecessary execution path introduces overhead, leading to a measurable I/O performance regression. On systems with NVMes under the same bridge, performance was observed to drop from approximately ~6150 MiB/s down to ~4985 MiB/s. Introduce a flag in the dmar_domain structure. This flag will only be set when iotlb_sync_map is required (i.e., when CM or RWBF is set). The cache_tag_flush_range_np() is called only for domains where this flag is set. This flag, once set, is immutable, given that there won't be mixed configurations in real-world scenarios where some IOMMUs in a system operate in caching mode while others do not. Theoretically, the immutability of this flag does not impact functionality. Reported-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738 Link: https://lore.kernel.org/r/20250701171154.52435-1-ioanna-maria.alifieraki@canonical.com Fixes: 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map") Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250703031545.3378602-1-baolu.lu@linux.intel.com Link: https://lore.kernel.org/r/20250714045028.958850-3-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15iommu/amd: Fix geometry.aperture_end for V2 tablesJason Gunthorpe1-2/+15
[ Upstream commit 8637afa79cfa6123f602408cfafe8c9a73620ff1 ] The AMD IOMMU documentation seems pretty clear that the V2 table follows the normal CPU expectation of sign extension. This is shown in Figure 25: AMD64 Long Mode 4-Kbyte Page Address Translation Where bits Sign-Extend [63:57] == [56]. This is typical for x86 which would have three regions in the page table: lower, non-canonical, upper. The manual describes that the V1 table does not sign extend in section 2.2.4 Sharing AMD64 Processor and IOMMU Page Tables GPA-to-SPA Further, Vasant has checked this and indicates the HW has an addtional behavior that the manual does not yet describe. The AMDv2 table does not have the sign extended behavior when attached to PASID 0, which may explain why this has gone unnoticed. The iommu domain geometry does not directly support sign extended page tables. The driver should report only one of the lower/upper spaces. Solve this by removing the top VA bit from the geometry to use only the lower space. This will also make the iommu_domain work consistently on all PASID 0 and PASID != 1. Adjust dma_max_address() to remove the top VA bit. It now returns: 5 Level: Before 0x1ffffffffffffff After 0x0ffffffffffffff 4 Level: Before 0xffffffffffff After 0x7fffffffffff Fixes: 11c439a19466 ("iommu/amd/pgtbl_v2: Fix domain max address") Link: https://lore.kernel.org/all/8858d4d6-d360-4ef0-935c-bfd13ea54f42@amd.com/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/0-v2-0615cc99b88a+1ce-amdv2_geo_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15iommu/amd: Enable PASID and ATS capabilities in the correct orderEaswar Hariharan1-1/+1
[ Upstream commit c694bc8b612ddd0dd70e122a00f39cb1e2e6927f ] Per the PCIe spec, behavior of the PASID capability is undefined if the value of the PASID Enable bit changes while the Enable bit of the function's ATS control register is Set. Unfortunately, pdev_enable_caps() does exactly that by ordering enabling ATS for the device before enabling PASID. Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Fixes: eda8c2860ab679 ("iommu/amd: Enable device ATS/PASID/PRI capabilities independently") Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250703155433.6221-1-eahariha@linux.microsoft.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>