summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)AuthorFilesLines
2 daysdrm/amdgpu: prevent immediate PASID reuse caseEric Huang3-13/+34
commit 14b81abe7bdc25f8097906fc2f91276ffedb2d26 upstream. PASID resue could cause interrupt issue when process immediately runs into hw state left by previous process exited with the same PASID, it's possible that page faults are still pending in the IH ring buffer when the process exits and frees up its PASID. To prevent the case, it uses idr cyclic allocator same as kernel pid's. Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8f1de51f49be692de137c8525106e0fce2d1912d) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2 daysdrm/amdgpu: Fix fence put before wait in amdgpu_amdkfd_submit_ibSrinivasan Shanmugam1-2/+2
[ Upstream commit 7150850146ebfa4ca998f653f264b8df6f7f85be ] amdgpu_amdkfd_submit_ib() submits a GPU job and gets a fence from amdgpu_ib_schedule(). This fence is used to wait for job completion. Currently, the code drops the fence reference using dma_fence_put() before calling dma_fence_wait(). If dma_fence_put() releases the last reference, the fence may be freed before dma_fence_wait() is called. This can lead to a use-after-free. Fix this by waiting on the fence first and releasing the reference only after dma_fence_wait() completes. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c:697 amdgpu_amdkfd_submit_ib() warn: passing freed memory 'f' (line 696) Fixes: 9ae55f030dc5 ("drm/amdgpu: Follow up change to previous drm scheduler change.") Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8b9e5259adc385b61a6590a13b82ae0ac2bd3482) Signed-off-by: Sasha Levin <sashal@kernel.org>
2 daysdrm/amdgpu: fix gpu idle power consumption issue for gfx v12Yang Wang1-1/+4
[ Upstream commit a6571045cf06c4aa749b4801382ae96650e2f0e1 ] Older versions of the MES firmware may cause abnormal GPU power consumption. When performing inference tasks on the GPU (e.g., with Ollama using ROCm), the GPU may show abnormal power consumption in idle state and incorrect GPU load information. This issue has been fixed in firmware version 0x8b and newer. Closes: https://github.com/ROCm/ROCm/issues/5706 Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 4e22a5fe6ea6e0b057e7f246df4ac3ff8bfbc46a) Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysdrm/amdgpu/mmhub4.1.0: add bounds checking for cidAlex Deucher1-1/+2
commit 3cdd405831d8cc50a5eae086403402697bb98a4a upstream. The value should never exceed the array size as those are the only values the hardware is expected to return, but add checks anyway. Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 04f063d85090f5dd0c671010ce88ee49d9dcc8ed) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu/mmhub3.0: add bounds checking for cidAlex Deucher1-1/+2
commit cdb82ecbeccb55fae75a3c956b605f7801a30db1 upstream. The value should never exceed the array size as those are the only values the hardware is expected to return, but add checks anyway. Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f14f27bbe2a3ed7af32d5f6eaf3f417139f45253) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu/mmhub3.0.2: add bounds checking for cidAlex Deucher1-1/+2
commit e5e6d67b1ce9764e67aef2d0eef9911af53ad99a upstream. The value should never exceed the array size as those are the only values the hardware is expected to return, but add checks anyway. Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1441f52c7f6ae6553664aa9e3e4562f6fc2fe8ea) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu/mmhub3.0.1: add bounds checking for cidAlex Deucher1-1/+2
commit 5d4e88bcfef29569a1db224ef15e28c603666c6d upstream. The value should never exceed the array size as those are the only values the hardware is expected to return, but add checks anyway. Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5f76083183363c4528a4aaa593f5d38c28fe7d7b) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu/mmhub2.3: add bounds checking for cidAlex Deucher1-1/+2
commit a54403a534972af5d9ba5aaa3bb6ead612500ec6 upstream. The value should never exceed the array size as those are the only values the hardware is expected to return, but add checks anyway. Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 89cd90375c19fb45138990b70e9f4ba4806f05c4) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu/mmhub2.0: add bounds checking for cidAlex Deucher1-3/+6
commit 0b26edac4ac5535df1f63e6e8ab44c24fe1acad7 upstream. The value should never exceed the array size as those are the only values the hardware is expected to return, but add checks anyway. Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e064cef4b53552602bb6ac90399c18f662f3cacd) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu/gmc9.0: add bounds checking for cidAlex Deucher1-7/+14
commit f39e1270277f4b06db0b2c6ec9405b6dd766fb13 upstream. The value should never exceed the array size as those are the only values the hardware is expected to return, but add checks anyway. Cc: Benjamin Cheng <benjamin.cheng@amd.com> Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e14d468304832bcc4a082d95849bc0a41b18ddea) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu: Add basic validation for RAS headerLijo Lazar1-2/+18
commit 5df0d6addb7e9b6f71f7162d1253762a5be9138e upstream. If RAS header read from EEPROM is corrupted, it could result in trying to allocate huge memory for reading the records. Add some validation to header fields. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> [ RAS_TABLE_VER_V3 is not supported in v6.6.y. ] Signed-off-by: Alva Lan <alvalan9@foxmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amd: Set num IP blocks to 0 if discovery failsMario Limonciello2-2/+4
commit 3646ff28780b4c52c5b5081443199e7a430110e5 upstream. If discovery has failed for any reason (such as no support for a block) then there is no need to unwind all the IP blocks in fini. In this condition there can actually be failures during the unwind too. Reset num_ip_blocks to zero during failure path and skip the unnecessary cleanup path. Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit fae5984296b981c8cc3acca35b701c1f332a6cd8) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu: Fix use-after-free race in VM acquireAlysa Liu1-1/+5
commit 2c1030f2e84885cc58bffef6af67d5b9d2e7098f upstream. Replace non-atomic vm->process_info assignment with cmpxchg() to prevent race when parent/child processes sharing a drm_file both try to acquire the same VM after fork(). Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alysa Liu <Alysa.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c7c573275ec20db05be769288a3e3bb2250ec618) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amd: Disable MES LR compute W/AMario Limonciello2-10/+0
commit 6b0d812971370c64b837a2db4275410f478272fe upstream. A workaround was introduced in commit 1fb710793ce2 ("drm/amdgpu: Enable MES lr_compute_wa by default") to help with some hangs observed in gfx1151. This WA didn't fully fix the issue. It was actually fixed by adjusting the VGPR size to the correct value that matched the hardware in commit b42f3bf9536c ("drm/amdkfd: bump minimum vgpr size for gfx1151"). There are reports of instability on other products with newer GC microcode versions, and I believe they're caused by this workaround. As we don't need the workaround any more, remove it. Fixes: b42f3bf9536c ("drm/amdkfd: bump minimum vgpr size for gfx1151") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 9973e64bd6ee7642860a6f3b6958cbf14e89cabd) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysdrm/amdgpu/vcn5: Add SMU dpm interface typesguttula1-0/+4
[ Upstream commit a5fe1a54513196e4bc8f9170006057dc31e7155e ] This will set AMDGPU_VCN_SMU_DPM_INTERFACE_* smu_type based on soc type and fixing ring timeout issue seen for DPM enabled case. Signed-off-by: sguttula <suresh.guttula@amd.com> Reviewed-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f0f23c315b38c55e8ce9484cf59b65811f350630) Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13drm/amd: Fix hang on amdgpu unload by using pci_dev_is_disconnected()Mario Limonciello1-2/+2
[ Upstream commit f7afda7fcd169a9168695247d07ad94cf7b9798f ] The commit 6a23e7b4332c ("drm/amd: Clean up kfd node on surprise disconnect") introduced early KFD cleanup when drm_dev_is_unplugged() returns true. However, this causes hangs during normal module unload (rmmod amdgpu). The issue occurs because drm_dev_unplug() is called in amdgpu_pci_remove() for all removal scenarios, not just surprise disconnects. This was done intentionally in commit 39934d3ed572 ("Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"") to fix IGT PCI software unplug test failures. As a result, drm_dev_is_unplugged() returns true even during normal module unload, triggering the early KFD cleanup inappropriately. The correct check should distinguish between: - Actual surprise disconnect (eGPU unplugged): pci_dev_is_disconnected() returns true - Normal module unload (rmmod): pci_dev_is_disconnected() returns false Replace drm_dev_is_unplugged() with pci_dev_is_disconnected() to ensure the early cleanup only happens during true hardware disconnect events. Cc: stable@vger.kernel.org Reported-by: Cal Peake <cp@absolutedigital.net> Closes: https://lore.kernel.org/all/b0c22deb-c0fa-3343-33cf-fd9a77d7db99@absolutedigital.net/ Fixes: 6a23e7b4332c ("drm/amd: Clean up kfd node on surprise disconnect") Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13drm/amdgpu: Fix locking bugs in error pathsBart Van Assche1-5/+7
[ Upstream commit 480ad5f6ead4a47b969aab6618573cd6822bb6a4 ] Do not unlock psp->ras_context.mutex if it has not been locked. This has been detected by the Clang thread-safety analyzer. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: YiPeng Chai <YiPeng.Chai@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: amd-gfx@lists.freedesktop.org Fixes: b3fb79cda568 ("drm/amdgpu: add mutex to protect ras shared memory") Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6fa01b4335978051d2cd80841728fd63cc597970) Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13drm/amdgpu: Replace kzalloc + copy_from_user with memdup_userThorsten Blum1-14/+6
[ Upstream commit 99eeb8358e6cdb7050bf2956370c15dcceda8c7e ] Replace kzalloc() followed by copy_from_user() with memdup_user() to improve and simplify ta_if_load_debugfs_write() and ta_if_invoke_debugfs_write(). No functional changes intended. Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Stable-dep-of: 480ad5f6ead4 ("drm/amdgpu: Fix locking bugs in error paths") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-13drm/amdgpu: Unlock a mutex before destroying itBart Van Assche1-0/+1
[ Upstream commit 5e0bcc7b88bcd081aaae6f481b10d9ab294fcb69 ] Mutexes must be unlocked before these are destroyed. This has been detected by the Clang thread-safety analyzer. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Yang Wang <kevinyang.wang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: amd-gfx@lists.freedesktop.org Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework") Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 270258ba320beb99648dceffb67e86ac76786e55) Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: keep vga memory on MacBooks with switchable graphicsAlex Deucher1-0/+10
[ Upstream commit 096bb75e13cc508d3915b7604e356bcb12b17766 ] On Intel MacBookPros with switchable graphics, when the iGPU is enabled, the address of VRAM gets put at 0 in the dGPU's virtual address space. This is non-standard and seems to cause issues with the cursor if it ends up at 0. We have the framework to reserve memory at 0 in the address space, so enable it here if the vram start address is 0. Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4302 Cc: stable@vger.kernel.org Cc: Mario Kleiner <mario.kleiner.de@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notifyPierre-Eric Pelloux-Prayer1-1/+8
[ Upstream commit b18fc0ab837381c1a6ef28386602cd888f2d9edf ] Invalidating a dmabuf will impact other users of the shared BO. In the scenario where process A moves the BO, it needs to inform process B about the move and process B will need to update its page table. The commit fixes a synchronisation bug caused by the use of the ticket: it made amdgpu_vm_handle_moved behave as if updating the page table immediately was correct but in this case it's not. An example is the following scenario, with 2 GPUs and glxgears running on GPU0 and Xorg running on GPU1, on a system where P2P PCI isn't supported: glxgears: export linear buffer from GPU0 and import using GPU1 submit frame rendering to GPU0 submit tiled->linear blit Xorg: copy of linear buffer The sequence of jobs would be: drm_sched_job_run # GPU0, frame rendering drm_sched_job_queue # GPU0, blit drm_sched_job_done # GPU0, frame rendering drm_sched_job_run # GPU0, blit move linear buffer for GPU1 access # amdgpu_dma_buf_move_notify -> update pt # GPU0 It this point the blit job on GPU0 is still running and would likely produce a page fault. Cc: stable@vger.kernel.org Fixes: a448cb003edc ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2") Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Adjust usleep_range in fence waitCe Sun1-1/+1
[ Upstream commit 3ee1c72606bd2842f0f377fd4b118362af0323ae ] Tune the sleep interval in the PSP fence wait loop from 10-100us to 60-100us.This adjustment results in an overall wait window of 1.2s (60us * 20000 iterations) to 2 seconds (100us * 20000 iterations), which guarantees that we can retrieve the correct fence value Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Skip loading SDMA_RS64 in VFYuBiao Wang1-0/+1
[ Upstream commit 39c21b81112321cbe1267b02c77ecd2161ce19aa ] VFs use the PF SDMA ucode and are unable to load SDMA_RS64. Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com> Signed-off-by: Victor Skvortsov <Victor.Skvortsov@amd.com> Reviewed-by: Gavin Wan <gavin.wan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: avoid a warning in timedout job handlerAlex Deucher1-1/+2
[ Upstream commit c8cf9ddc549fb93cb5a35f3fe23487b1e6707e74 ] Only set an error on the fence if the fence is not signalled. We can end up with a warning if the per queue reset path signals the fence and sets an error as part of the reset, but fails to recover. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: add support for HDP IP version 6.1.1Tim Huang1-0/+1
[ Upstream commit e2fd14f579b841f54a9b7162fef15234d8c0627a ] This initializes HDP IP version 6.1.1. Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Tim Huang <tim.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: fix NULL pointer issue buffer funcsLikun Gao1-1/+2
[ Upstream commit 9877a865d62c9c3e0f4cc369dc9ca9f7f24f5ee9 ] If SDMA block not enabled, buffer_funcs will not initialize, fix the null pointer issue if buffer_funcs not initialized. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Fix memory leak in amdgpu_ras_init()Zilin Guan1-1/+1
[ Upstream commit ee41e5b63c8210525c936ee637a2c8d185ce873c ] When amdgpu_nbio_ras_sw_init() fails in amdgpu_ras_init(), the function returns directly without freeing the allocated con structure, leading to a memory leak. Fix this by jumping to the release_con label to properly clean up the allocated memory before returning the error code. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: fdc94d3a8c88 ("drm/amdgpu: Rework pcie_bif ras sw_init") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Use kvfree instead of kfree in amdgpu_gmc_get_nps_memranges()Zilin Guan1-1/+1
[ Upstream commit 0c44d61945c4a80775292d96460aa2f22e62f86c ] amdgpu_discovery_get_nps_info() internally allocates memory for ranges using kvcalloc(), which may use vmalloc() for large allocation. Using kfree() to release vmalloc memory will lead to a memory corruption. Use kvfree() to safely handle both kmalloc and vmalloc allocations. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: b194d21b9bcc ("drm/amdgpu: Use NPS ranges from discovery table") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Fix memory leak in amdgpu_acpi_enumerate_xcc()Zilin Guan1-1/+3
[ Upstream commit c9be63d565789b56ca7b0197e2cb78a3671f95a8 ] In amdgpu_acpi_enumerate_xcc(), if amdgpu_acpi_dev_init() returns -ENOMEM, the function returns directly without releasing the allocated xcc_info, resulting in a memory leak. Fix this by ensuring that xcc_info is properly freed in the error paths. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: 4d5275ab0b18 ("drm/amdgpu: Add parsing of acpi xcc objects") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amdgpu: Use explicit VCN instance 0 in SR-IOV initSrinivasan Shanmugam1-22/+23
[ Upstream commit af26fa751c2eef66916acbf0d3c3e9159da56186 ] vcn_v2_0_start_sriov() declares a local variable "i" initialized to zero and uses it only as the instance index in SOC15_REG_OFFSET(UVD, i, ...). The value is never changed and all other fields are taken from adev->vcn.inst[0], so this path only ever programs VCN instance 0. This triggered a Smatch: warn: iterator 'i' not incremented Replace the dummy iterator with an explicit instance index of 0 in SOC15_REG_OFFSET() calls. Fixes: dd26858a9cd8 ("drm/amdgpu: implement initialization part on VCN2.0 for SRIOV") Reported by: Dan Carpenter <dan.carpenter@linaro.org> Cc: darlington Opara <darlington.opara@amd.com> Cc: Jinage Zhao <jiange.zhao@amd.com> Cc: Monk Liu <Monk.Liu@amd.com> Cc: Emily Deng <Emily.Deng@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm/amd: Drop "amdgpu kernel modesetting enabled" messageMario Limonciello (AMD)1-1/+0
[ Upstream commit 8644084a74a4573278d6f454c6638ccd5965f4e2 ] The behavior for amdgpu was changed with commit e00e5c223878 ("drm/amdgpu: adjust drm_firmware_drivers_only() handling") to potentially allow loading even if nomodeset was set, so the message is no longer accurate. Just drop it to avoid confusion. Fixes: e00e5c223878 ("drm/amdgpu: adjust drm_firmware_drivers_only() handling") Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11drm/amd/pm: Disable MMIO access during SMU Mode 1 resetPerry Yuan1-0/+3
[ Upstream commit 0de604d0357d0d22cbf03af1077d174b641707b6 ] During Mode 1 reset, the ASIC undergoes a reset cycle and becomes temporarily inaccessible via PCIe. Any attempt to access MMIO registers during this window (e.g., from interrupt handlers or other driver threads) can result in uncompleted PCIe transactions, leading to NMI panics or system hangs. To prevent this, set the `no_hw_access` flag to true immediately after triggering the reset. This signals other driver components to skip register accesses while the device is offline. A memory barrier `smp_mb()` is added to ensure the flag update is globally visible to all cores before the driver enters the sleep/wait state. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 7edb503fe4b6d67f47d8bb0dfafb8e699bb0f8a4) Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11Revert "drm/amd: Check if ASPM is enabled from PCIe subsystem"Bert Karwatzki1-3/+0
commit 243b467dea1735fed904c2e54d248a46fa417a2d upstream. This reverts commit 7294863a6f01248d72b61d38478978d638641bee. This commit was erroneously applied again after commit 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") removed it, leading to very hard to debug crashes, when used with a system with two AMD GPUs of which only one supports ASPM. Link: https://lore.kernel.org/linux-acpi/20251006120944.7880-1-spasswolf@web.de/ Link: https://github.com/acpica/acpica/issues/1060 Fixes: 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 97a9689300eb2b393ba5efc17c8e5db835917080) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06drm/amdgpu/gfx11: adjust KGQ reset sequenceAlex Deucher1-21/+24
[ Upstream commit 3eb46fbb601f9a0b4df8eba79252a0a85e983044 ] Kernel gfx queues do not need to be reinitialized or remapped after a reset. This fixes queue reset failures on APUs. v2: preserve init and remap for MMIO case. Fixes: b3e9bfd86658 ("drm/amdgpu/gfx11: add ring reset callbacks") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b340ff216fdabfe71ba0cdd47e9835a141d08e10) Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06drm/amdgpu: Fix cond_exec handling in amdgpu_ib_schedule()Alex Deucher1-2/+3
commit b1defcdc4457649db236415ee618a7151e28788c upstream. The EXEC_COUNT field must be > 0. In the gfx shadow handling we always emit a cond_exec packet after the gfx_shadow packet, but the EXEC_COUNT never gets patched. This leads to a hang when we try and reset queues on gfx11 APUs. Fixes: c68cbbfd54c6 ("drm/amdgpu: cleanup conditional execution") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4789 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ba205ac3d6e83f56c4f824f23f1b4522cb844ff3) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06drm/amdgpu: fix NULL pointer dereference in amdgpu_gmc_filter_faults_removeJon Doron1-1/+6
commit 8b1ecc9377bc641533cd9e76dfa3aee3cd04a007 upstream. On APUs such as Raven and Renoir (GC 9.1.0, 9.2.2, 9.3.0), the ih1 and ih2 interrupt ring buffers are not initialized. This is by design, as these secondary IH rings are only available on discrete GPUs. See vega10_ih_sw_init() which explicitly skips ih1/ih2 initialization when AMD_IS_APU is set. However, amdgpu_gmc_filter_faults_remove() unconditionally uses ih1 to get the timestamp of the last interrupt entry. When retry faults are enabled on APUs (noretry=0), this function is called from the SVM page fault recovery path, resulting in a NULL pointer dereference when amdgpu_ih_decode_iv_ts_helper() attempts to access ih->ring[]. The crash manifests as: BUG: kernel NULL pointer dereference, address: 0000000000000004 RIP: 0010:amdgpu_ih_decode_iv_ts_helper+0x22/0x40 [amdgpu] Call Trace: amdgpu_gmc_filter_faults_remove+0x60/0x130 [amdgpu] svm_range_restore_pages+0xae5/0x11c0 [amdgpu] amdgpu_vm_handle_fault+0xc8/0x340 [amdgpu] gmc_v9_0_process_interrupt+0x191/0x220 [amdgpu] amdgpu_irq_dispatch+0xed/0x2c0 [amdgpu] amdgpu_ih_process+0x84/0x100 [amdgpu] This issue was exposed by commit 1446226d32a4 ("drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1") which changed the default for Renoir APU from noretry=1 to noretry=0, enabling retry fault handling and thus exercising the buggy code path. Fix this by adding a check for ih1.ring_size before attempting to use it. Also restore the soft_ih support from commit dd299441654f ("drm/amdgpu: Rework retry fault removal"). This is needed if the hardware doesn't support secondary HW IH rings. v2: additional updates (Alex) Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3814 Fixes: dd299441654f ("drm/amdgpu: Rework retry fault removal") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Jon Doron <jond@wiz.io> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6ce8d536c80aa1f059e82184f0d1994436b1d526) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06drm/amdgpu/gfx12: fix wptr reset in KGQ initAlex Deucher1-1/+1
commit 9077d32a4b570fa20500aa26e149981c366c965d upstream. wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a2918f958d3f677ea93c0ac257cb6ba69b7abb7c) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06drm/amdgpu/gfx11: fix wptr reset in KGQ initAlex Deucher1-1/+1
commit b1f810471c6a6bd349f7f9f2f2fed96082056d46 upstream. wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1f16866bdb1daed7a80ca79ae2837a9832a74fbc) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06drm/amdgpu/gfx10: fix wptr reset in KGQ initAlex Deucher1-1/+1
commit cc4f433b14e05eaa4a98fd677b836e9229422387 upstream. wptr is a 64 bit value and we need to update the full value, not just 32 bits. Align with what we already do for KCQs. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e80b1d1aa1073230b6c25a1a72e88f37e425ccda) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06drm/amdgpu/soc21: fix xclk for APUsAlex Deucher1-1/+7
commit e7fbff9e7622a00c2b53cb14df481916f0019742 upstream. The reference clock is supposed to be 100Mhz, but it appears to actually be slightly lower (99.81Mhz). Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14451 Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 637fee3954d4bd509ea9d95ad1780fc174489860) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-30drm/amdgpu: remove frame cntl for gfx v12Likun Gao1-12/+0
commit 10343253328e0dbdb465bff709a2619a08fe01ad upstream. Remove emit_frame_cntl function for gfx v12, which is not support. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5aaa5058dec5bfdcb24c42fe17ad91565a3037ca) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23drm/amd: Clean up kfd node on surprise disconnectMario Limonciello (AMD)1-0/+8
commit 28695ca09d326461f8078332aa01db516983e8a2 upstream. When an eGPU is unplugged the KFD topology should also be destroyed for that GPU. This never happens because the fini_sw callbacks never get to run. Run them manually before calling amdgpu_device_ip_fini_early() when a device has already been disconnected. This location is intentionally chosen to make sure that the kfd locking refcount doesn't get incremented unintentionally. Cc: kent.russell@amd.com Closes: https://community.frame.work/t/amd-egpu-on-linux/8691/33 Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6a23e7b4332c10f8b56c33a9c5431b52ecff9aab) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-17drm/amdgpu: Fix query for VPE block_type and ip_countAlan Liu1-0/+6
commit 72d7f4573660287f1b66c30319efecd6fcde92ee upstream. [Why] Query for VPE block_type and ip_count is missing. [How] Add VPE case in ip_block_type and hw_ip_count query. Reviewed-by: Lang Yu <lang.yu@amd.com> Signed-off-by: Alan Liu <haoping.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a6ea0a430aca5932b9c75d8e38deeb45665dd2ae) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11drm/amdgpu: Forward VMID reservation errorsNatalie Vock1-2/+4
[ Upstream commit 8defb4f081a5feccc3ea8372d0c7af3522124e1f ] Otherwise userspace may be fooled into believing it has a reserved VMID when in reality it doesn't, ultimately leading to GPU hangs when SPM is used. Fixes: 80e709ee6ecc ("drm/amdgpu: add option params to enforce process isolation between graphics and compute") Cc: stable@vger.kernel.org Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Natalie Vock <natalie.vock@gmx.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> [ adapted 3-argument amdgpu_vmid_alloc_reserved(adev, vm, vmhub) call to 2-argument version and added separate error check to preserve reserved_vmid tracking logic. ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08drm/amdgpu/gmc11: add amdgpu_vm_handle_fault() handlingAlex Deucher1-0/+27
commit 3f2289b56cd98f5741056bdb6e521324eff07ce5 upstream. We need to call amdgpu_vm_handle_fault() on page fault on all gfx9 and newer parts to properly update the page tables, not just for recoverable page faults. Cc: stable@vger.kernel.org Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08drm/amdgpu: add missing lock to amdgpu_ttm_access_memory_sdmaPierre-Eric Pelloux-Prayer1-0/+2
commit 4fa944255be521b1bbd9780383f77206303a3a5c upstream. Users of ttm entities need to hold the gtt_window_lock before using them to guarantee proper ordering of jobs. Cc: stable@vger.kernel.org Fixes: cb5cc4f573e1 ("drm/amdgpu: improve debug VRAM access performance using sdma") Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08drm/amdgpu/gmc12: add amdgpu_vm_handle_fault() handlingAlex Deucher1-0/+27
commit ff28ff98db6a8eeb469e02fb8bd1647b353232a9 upstream. We need to call amdgpu_vm_handle_fault() on page fault on all gfx9 and newer parts to properly update the page tables, not just for recoverable page faults. Cc: stable@vger.kernel.org Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08Revert "drm/amd: Skip power ungate during suspend for VPE"Mario Limonciello (AMD)1-2/+1
commit 3925683515e93844be204381d2d5a1df5de34f31 upstream. Skipping power ungate exposed some scenarios that will fail like below: ``` amdgpu: Register(0) [regVPEC_QUEUE_RESET_REQ] failed to reach value 0x00000000 != 0x00000001n amdgpu 0000:c1:00.0: amdgpu: VPE queue reset failed ... amdgpu: [drm] *ERROR* wait_for_completion_timeout timeout! ``` The underlying s2idle issue that prompted this commit is going to be fixed in BIOS. This reverts commit 2a6c826cfeedd7714611ac115371a959ead55bda. Fixes: 2a6c826cfeed ("drm/amd: Skip power ungate during suspend for VPE") Cc: stable@vger.kernel.org Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reported-by: Konstantin <answer2019@yandex.ru> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220812 Reported-by: Matthew Schwartz <matthew.schwartz@linux.dev> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07drm/amd/amdgpu: reserve vm invalidation engine for uni_mesMichael Chen1-0/+3
commit 971fb57429df5aa4e6efc796f7841e0d10b1e83c upstream. Reserve vm invalidation engine 6 when uni_mes enabled. It is used in processing tlb flush request from host. Signed-off-by: Michael Chen <michael.chen@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Shaoyun liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 873373739b9b150720ea2c5390b4e904a4d21505) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07drm/amdgpu: fix cyan_skillfish2 gpu info fw handlingAlex Deucher1-0/+2
[ Upstream commit 7fa666ab07ba9e08f52f357cb8e1aad753e83ac6 ] If the board supports IP discovery, we don't need to parse the gpu info firmware. Backport to 6.18. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4721 Fixes: fa819e3a7c1e ("drm/amdgpu: add support for cyan skillfish gpu_info") Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5427e32fa3a0ba9a016db83877851ed277b065fb) Signed-off-by: Sasha Levin <sashal@kernel.org>