kernel/linux.git/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c, branch v6.11.8

drm/amdkfd: Fix resource leak in criu restore queue

2024-10-10T10:03:30+00:00

[ Upstream commit aa47fe8d3595365a935921a90d00bc33ee374728 ] To avoid memory leaks, release q_extra_data when exiting the restore queue. v2: Correct the proto (Alex) Signed-off-by: Jesse Zhang Reviewed-by: Tim Huang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

2024-10-10T10:03:24+00:00

[ Upstream commit c86ad39140bbcb9dc75a10046c2221f657e8083b ] Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling Acked-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: add lock in kfd_process_dequeue_from_device

2024-06-14T20:17:11+00:00

We need to take the reset domain lock before talking to MES. While in this case we can take the lock inside the mes helper. We can't do so for most other mes helpers since they are used during reset. So for consistency sake we add the lock here. Signed-off-by: Yunxiang Li Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher

drm/amdgpu/kfd: remove is_hws_hang and is_resetting

2024-06-14T20:15:58+00:00

is_hws_hang and is_resetting serves pretty much the same purpose and they all duplicates the work of the reset_domain lock, just check that directly instead. This also eliminate a few bugs listed below and get rid of dqm->ops.pre_reset. kfd_hws_hang did not need to avoid scheduling another reset. If the on-going reset decided to skip GPU reset we have a bad time, otherwise the extra reset will get cancelled anyway. remove_queue_mes forgot to check is_resetting flag compared to the pre-MES path unmap_queue_cpsch, so it did not block hw access during reset correctly. Signed-off-by: Yunxiang Li Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher

drm/amdgpu: Add gfx v9_4_4 ip block

2024-05-02T19:49:16+00:00

Add gfx v9_4_4 ip block support Signed-off-by: Hawking Zhang Reviewed-by: Le Ma Signed-off-by: Alex Deucher

drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards

2024-02-15T19:18:44+00:00

In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Set COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang scenarions. Reviewed-by: Felix Kuehling Suggested-by: Joseph Greathouse Signed-off-by: Rajneesh Bhardwaj Signed-off-by: Alex Deucher

drm/amdkfd: only flush mes process context if mes support is there

2023-12-15T17:15:13+00:00

Fix up on mes process context flush to prevent non-mes devices from spamming error messages or running into undefined behaviour during process termination. Fixes: bd33bb1409b4 ("drm/amdkfd: fix mes set shader debugger process management") Signed-off-by: Jonathan Kim Reviewed-by: Eric Huang Signed-off-by: Alex Deucher

drm/amdkfd: fix mes set shader debugger process management

2023-12-13T21:07:43+00:00

MES provides the driver a call to explicitly flush stale process memory within the MES to avoid a race condition that results in a fatal memory violation. When SET_SHADER_DEBUGGER is called, the driver passes a memory address that represents a process context address MES uses to keep track of future per-process calls. Normally, MES will purge its process context list when the last queue has been removed. The driver, however, can call SET_SHADER_DEBUGGER regardless of whether a queue has been added or not. If SET_SHADER_DEBUGGER has been called with no queues as the last call prior to process termination, the passed process context address will still reside within MES. On a new process call to SET_SHADER_DEBUGGER, the driver may end up passing an identical process context address value (based on per-process gpu memory address) to MES but is now pointing to a new allocated buffer object during KFD process creation. Since the MES is unaware of this, access of the passed address points to the stale object within MES and triggers a fatal memory violation. The solution is for KFD to explicitly flush the process context address from MES on process termination. Note that the flush call and the MES debugger calls use the same MES interface but are separated as KFD calls to avoid conflicting with each other. Signed-off-by: Jonathan Kim Tested-by: Alice Wong Reviewed-by: Eric Huang Signed-off-by: Alex Deucher

drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit

2023-11-29T21:49:23+00:00

[Why] Memory leaks of gang_ctx_bo and wptr_bo. [How] Free gang_ctx_bo and wptr_bo in pqm_uninit. v2: add a common function pqm_clean_queue_resource to free queue's resources. v3: reset pdd->pqd.num_gws when destorying GWS queue. Reviewed-by: Felix Kuehling Signed-off-by: ZhenGuo Yin Signed-off-by: Alex Deucher

drm/amdkfd: get doorbell's absolute offset based on the db_size

2023-10-09T21:02:34+00:00

Here, Adding db_size in byte to find the doorbell's absolute offset for both 32-bit and 64-bit doorbell sizes. So that doorbell offset will be aligned based on the doorbell size. v2: - Addressed the review comment from Felix. v3: - Adding doorbell_size as parameter to get db absolute offset. v4: Squash the two patches into one. Cc: Christian Koenig Cc: Alex Deucher Reviewed-by: Felix Kuehling Signed-off-by: Shashank Sharma Signed-off-by: Arvind Yadav Signed-off-by: Alex Deucher