kernel/linux.git/drivers/gpu/drm/amd/amdkfd, branch v6.6.36

Revert "drm/amdkfd: fix gfx_target_version for certain 11.0.3 devices"

2024-06-16T11:47:38+00:00

commit dd2b75fd9a79bf418e088656822af06fc253dbe3 upstream. This reverts commit 28ebbb4981cb1fad12e0b1227dbecc88810b1ee8. Revert this commit as apparently the LLVM code to take advantage of this never landed. Reviewed-by: Feifei Xu Signed-off-by: Alex Deucher Cc: Feifei Xu Signed-off-by: Greg Kroah-Hartman

drm/amdkfd: Flush the process wq before creating a kfd_process

2024-06-12T09:11:29+00:00

[ Upstream commit f5b9053398e70a0c10aa9cb4dd5910ab6bc457c5 ] There is a race condition when re-creating a kfd_process for a process. This has been observed when a process under the debugger executes exec(3). In this scenario: - The process executes exec. - This will eventually release the process's mm, which will cause the kfd_process object associated with the process to be freed (kfd_process_free_notifier decrements the reference count to the kfd_process to 0). This causes kfd_process_ref_release to enqueue kfd_process_wq_release to the kfd_process_wq. - The debugger receives the PTRACE_EVENT_EXEC notification, and tries to re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE). - When handling this request, KFD tries to re-create a kfd_process. This eventually calls kfd_create_process and kobject_init_and_add. At this point the call to kobject_init_and_add can fail because the old kfd_process.kobj has not been freed yet by kfd_process_wq_release. This patch proposes to avoid this race by making sure to drain kfd_process_wq before creating a new kfd_process object. This way, we know that any cleanup task is done executing when we reach kobject_init_and_add. Signed-off-by: Lancelot SIX Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Add VRAM accounting for SVM migration

2024-06-12T09:11:23+00:00

[ Upstream commit 1e214f7faaf5d842754cd5cfcd76308bfedab3b5 ] Do VRAM accounting when doing migrations to vram to make sure there is enough available VRAM and migrating to VRAM doesn't evict other possible non-unified memory BOs. If migrating to VRAM fails, driver can fall back to using system memory seamlessly. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: don't allow mapping the MMIO HDP page with large pages

2024-05-17T10:02:34+00:00

commit be4a2a81b6b90d1a47eaeaace4cc8e2cb57b96c7 upstream. We don't get the right offset in that case. The GPU has an unused 4K area of the register BAR space into which you can remap registers. We remap the HDP flush registers into this space to allow userspace (CPU or GPU) to flush the HDP when it updates VRAM. However, on systems with >4K pages, we end up exposing PAGE_SIZE of MMIO space. Fixes: d8e408a82704 ("drm/amdkfd: Expose HDP registers to user space") Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

amd/amdkfd: sync all devices to wait all processes being evicted

2024-05-17T10:02:16+00:00

[ Upstream commit d06af584be5a769d124b7302b32a033e9559761d ] If there are more than one device doing reset in parallel, the first device will call kfd_suspend_all_processes() to evict all processes on all devices, this call takes time to finish. other device will start reset and recover without waiting. if the process has not been evicted before doing recover, it will be restored, then caused page fault. Signed-off-by: Zhigang Luo Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: range check cp bad op exception interrupts

2024-05-17T10:02:11+00:00

[ Upstream commit 0cac183b98d8a8c692c98e8dba37df15a9e9210d ] Due to a CP interrupt bug, bad packet garbage exception codes are raised. Do a range check so that the debugger and runtime do not receive garbage codes. Update the user api to guard exception code type checking as well. Signed-off-by: Jonathan Kim Tested-by: Jesse Zhang Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Check cgroup when returning DMABuf info

2024-05-17T10:02:11+00:00

[ Upstream commit 9d7993a7ab9651afd5fb295a4992e511b2b727aa ] Check cgroup permissions when returning DMA-buf info and based on cgroup info return the GPU id of the GPU that have access to the BO. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Fix memory leak in create_process failure

2024-04-27T15:11:42+00:00

commit 18921b205012568b45760753ad3146ddb9e2d4e2 upstream. Fix memory leak due to a leaked mmget reference on an error handling code path that is triggered when attempting to create KFD processes while a GPU reset is in progress. Fixes: 0ab2d7532b05 ("drm/amdkfd: prepare per-process debug enable and disable") CC: Xiaogang Chen Signed-off-by: Felix Kuehling Tested-by: Harish Kasiviswanthan Reviewed-by: Mukul Joshi Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdkfd: Reset GPU on queue preemption failure

2024-04-17T09:19:34+00:00

commit 8bdfb4ea95ca738d33ef71376c21eba20130f2eb upstream. Currently, with F32 HWS GPU reset is only when unmap queue fails. However, if compute queue doesn't repond to preemption request in time unmap will return without any error. In this case, only preemption error is logged and Reset is not triggered. Call GPU reset in this case also. Reviewed-by: Alex Deucher Signed-off-by: Harish Kasiviswanathan Reviewed-by: Mukul Joshi Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

amdkfd: use calloc instead of kzalloc to avoid integer overflow

2024-04-13T11:07:28+00:00

commit 3b0daecfeac0103aba8b293df07a0cbaf8b43f29 upstream. This uses calloc instead of doing the multiplication which might overflow. Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie Signed-off-by: Greg Kroah-Hartman