kernel/linux.git/drivers/gpu/drm/amd/amdkfd, branch v6.13.6

drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd

2025-03-07T17:27:06+00:00

commit 3502ab5022bb5ef1edd063bdb6465a8bf3b46e66 upstream. When userspace applications call AMDKFD_IOC_UPDATE_QUEUE. Preserve bitfields that do not need to be modified as they contain flags to track queue states that are used by CP FW. Signed-off-by: David Yat Sin Reviewed-by: Jay Cornwall Signed-off-by: Alex Deucher (cherry picked from commit 8150827990b709ab5a40c46c30d21b7f7b9e9440) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdkfd: Ensure consistent barrier state saved in gfx12 trap handler

2025-02-27T12:34:13+00:00

[ Upstream commit d584198a6fe4c51f4aa88ad72f258f8961a0f11c ] It is possible for some waves in a workgroup to finish their save sequence before the group leader has had time to capture the workgroup barrier state. When this happens, having those waves exit do impact the barrier state. As a consequence, the state captured by the group leader is invalid, and is eventually incorrectly restored. This patch proposes to have all waves in a workgroup wait for each other at the end of their save sequence (just before calling s_endpgm_saved). Signed-off-by: Lancelot SIX Reviewed-by: Jay Cornwall Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org # 6.12.x Signed-off-by: Sasha Levin

drm/amdkfd: Move gfx12 trap handler to separate file

2025-02-27T12:34:13+00:00

[ Upstream commit 62498e797aeb2bfa92a823ee1a8253f96d1cbe3f ] gfx12 derivatives will have substantially different trap handler implementations from gfx10/gfx11. Add a separate source file for gfx12+ and remove unneeded conditional code. No functional change. v2: Revert copyright date to 2018, minor comment fixes Signed-off-by: Jay Cornwall Reviewed-by: Lancelot Six Cc: Jonathan Kim Acked-by: Alex Deucher Signed-off-by: Alex Deucher Stable-dep-of: d584198a6fe4 ("drm/amdkfd: Ensure consistent barrier state saved in gfx12 trap handler") Signed-off-by: Sasha Levin

amdkfd: properly free gang_ctx_bo when failed to init user queue

2025-02-21T13:10:50+00:00

[ Upstream commit a33f7f9660705fb2ecf3467b2c48965564f392ce ] The destructor of a gtt bo is declared as void amdgpu_amdkfd_free_gtt_mem(struct amdgpu_device *adev, void **mem_obj); Which takes void** as the second parameter. GCC allows passing void* to the function because void* can be implicitly casted to any other types, so it can pass compiling. However, passing this void* parameter into the function's execution process(which expects void** and dereferencing void**) will result in errors. Signed-off-by: Zhu Lingshan Reviewed-by: Felix Kuehling Fixes: fb91065851cd ("drm/amdkfd: Refactor queue wptr_bo GART mapping") Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Block per-queue reset when halt_if_hws_hang=1

2025-02-17T10:36:19+00:00

commit f214b7beb00621b983e67ce97477afc3ab4b38f4 upstream. The purpose of halt_if_hws_hang is to preserve GPU state for driver debugging when queue preemption fails. Issuing per-queue reset may kill wavefronts which caused the preemption failure. Signed-off-by: Jay Cornwall Reviewed-by: Jonathan Kim Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org # 6.12.x Signed-off-by: Greg Kroah-Hartman

drm/amdkfd: only flush the validate MES contex

2025-02-17T10:36:19+00:00

commit 9078a5bfa21e78ae68b6d7c365d1b92f26720c55 upstream. The following page fault was observed duringthe KFD process release. In this particular error case, the HIP test (./MemcpyPerformance -h) does not require the queue. As a result, the process_context_addr was not assigned when the KFD process was released, ultimately leading to this page fault during the execution of the function kfd_process_dequeue_from_all_devices(). [345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0) [345962.295333] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [345962.295775] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33 [345962.296097] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) [345962.296394] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 [345962.296633] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1 [345962.296876] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [345962.297135] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 [345962.297377] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 [345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0) Signed-off-by: Prike Liang Reviewed-by: Jonathan Kim Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdkfd: Queue interrupt work to different CPU

2025-02-17T10:35:59+00:00

[ Upstream commit 34db5a32617d102e8042151bb87590e43c97132e ] For CPX mode, each KFD node has interrupt worker to process ih_fifo to send events to user space. Currently all interrupt workers of same adev queue to same CPU, all workers execution are actually serialized and this cause KFD ih_fifo overflow when CPU usage is high. Use per-GPU unbounded highpri queue with number of workers equals to number of partitions, let queue_work select the next CPU round robin among the local CPUs of same NUMA. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: wq_release signals dma_fence only when available

2025-01-06T20:17:23+00:00

kfd_process_wq_release() signals eviction fence by dma_fence_signal() which wanrs if dma_fence is NULL. kfd_process->ef is initialized by kfd_process_device_init_vm() through ioctl. That means the fence is NULL for a new created kfd_process, and close a kfd_process right after open it will trigger the warning. This commit conditionally signals the eviction fence in kfd_process_wq_release() only when it is available. [ 503.660882] WARNING: CPU: 0 PID: 9 at drivers/dma-buf/dma-fence.c:467 dma_fence_signal+0x74/0xa0 [ 503.782940] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu] [ 503.789640] RIP: 0010:dma_fence_signal+0x74/0xa0 [ 503.877620] Call Trace: [ 503.880066] [ 503.882168] ? __warn+0xcd/0x260 [ 503.885407] ? dma_fence_signal+0x74/0xa0 [ 503.889416] ? report_bug+0x288/0x2d0 [ 503.893089] ? handle_bug+0x53/0xa0 [ 503.896587] ? exc_invalid_op+0x14/0x50 [ 503.900424] ? asm_exc_invalid_op+0x16/0x20 [ 503.904616] ? dma_fence_signal+0x74/0xa0 [ 503.908626] kfd_process_wq_release+0x6b/0x370 [amdgpu] [ 503.914081] process_one_work+0x654/0x10a0 [ 503.918186] worker_thread+0x6c3/0xe70 [ 503.921943] ? srso_alias_return_thunk+0x5/0xfbef5 [ 503.926735] ? srso_alias_return_thunk+0x5/0xfbef5 [ 503.931527] ? __kthread_parkme+0x82/0x140 [ 503.935631] ? __pfx_worker_thread+0x10/0x10 [ 503.939904] kthread+0x2a8/0x380 [ 503.943132] ? __pfx_kthread+0x10/0x10 [ 503.946882] ret_from_fork+0x2d/0x70 [ 503.950458] ? __pfx_kthread+0x10/0x10 [ 503.954210] ret_from_fork_asm+0x1a/0x30 [ 503.958142] [ 503.960328] ---[ end trace 0000000000000000 ]--- Fixes: 967d226eaae8 ("dma-buf: add WARN_ON() illegal dma-fence signaling") Signed-off-by: Zhu Lingshan Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher (cherry picked from commit 2774ef7625adb5fb9e9265c26a59dca7b8fd171e) Cc: stable@vger.kernel.org

drm/amdkfd: fixed page fault when enable MES shader debugger

2025-01-06T20:13:37+00:00

Initialize the process context address before setting the shader debugger. [ 260.781212] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:32 vmid:0 pasid:0) [ 260.781236] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [ 260.781255] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040A40 [ 260.781270] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) [ 260.781284] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 [ 260.781296] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 [ 260.781308] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x4 [ 260.781320] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 260.781332] amdgpu 0000:03:00.0: amdgpu: RW: 0x1 [ 260.782017] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:32 vmid:0 pasid:0) [ 260.782039] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 [ 260.782058] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00040A41 [ 260.782073] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) [ 260.782087] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1 [ 260.782098] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 [ 260.782110] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x4 [ 260.782122] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 260.782137] amdgpu 0000:03:00.0: amdgpu: RW: 0x1 [ 260.782155] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:32 vmid:0 pasid:0) [ 260.782166] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 Fixes: 438b39ac74e2 ("drm/amdkfd: pause autosuspend when creating pdd") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3849 Signed-off-by: Jesse Zhang Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit 5b231f5bc9ff02ec5737f2ec95cdf15ac95088e9) Cc: stable@vger.kernel.org

drm/amdkfd: pause autosuspend when creating pdd

2024-12-10T15:26:18+00:00

When using MES creating a pdd will require talking to the GPU to setup the relevant context. The code here forgot to wake up the GPU in case it was in suspend, this causes KVM to EFAULT for passthrough GPU for example. This issue can be masked if the GPU was woken up by other things (e.g. opening the KMS node) first and have not yet gone to sleep. v4: do the allocation of proc_ctx_bo in a lazy fashion when the first queue is created in a process (Felix) Signed-off-by: Jesse Zhang Reviewed-by: Yunxiang Li Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org