kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c, branch v6.12.80

drm/amdgpu: Fix fence put before wait in amdgpu_amdkfd_submit_ib

2026-04-02T11:09:40+00:00

[ Upstream commit 7150850146ebfa4ca998f653f264b8df6f7f85be ] amdgpu_amdkfd_submit_ib() submits a GPU job and gets a fence from amdgpu_ib_schedule(). This fence is used to wait for job completion. Currently, the code drops the fence reference using dma_fence_put() before calling dma_fence_wait(). If dma_fence_put() releases the last reference, the fence may be freed before dma_fence_wait() is called. This can lead to a use-after-free. Fix this by waiting on the fence first and releasing the reference only after dma_fence_wait() completes. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c:697 amdgpu_amdkfd_submit_ib() warn: passing freed memory 'f' (line 696) Fixes: 9ae55f030dc5 ("drm/amdgpu: Follow up change to previous drm scheduler change.") Cc: Felix Kuehling Cc: Dan Carpenter Cc: Christian König Cc: Alex Deucher Signed-off-by: Srinivasan Shanmugam Reviewed-by: Christian König Signed-off-by: Alex Deucher (cherry picked from commit 8b9e5259adc385b61a6590a13b82ae0ac2bd3482) Signed-off-by: Sasha Levin

drm/amdgpu: disable gfxoff with the compute workload on gfx12

2025-01-23T16:23:04+00:00

commit 90505894c4ed581318836b792c57723df491cb91 upstream. Disable gfxoff with the compute workload on gfx12. This is a workaround for the opencl test failure. Signed-off-by: Kenneth Feng Acked-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit 2affe2bbc997b3920045c2c434e480c81a5f9707) Cc: stable@vger.kernel.org # 6.12.x Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: Fix map/unmap queue logic

2024-12-05T13:01:56+00:00

[ Upstream commit fa31798582882740f2b13d19e1bd43b4ef918e2f ] In current logic, it calls ring_alloc followed by a ring_test. ring_test in turn will call another ring_alloc. This is illegal usage as a ring_alloc is expected to be closed properly with a ring_commit. Change to commit the map/unmap queue packet first followed by a ring_test. Add a comment about the usage of ring_test. Also, reorder the current pre-condition checks of job hang or kiq ring scheduler not ready. Without them being met, it is not useful to attempt ring or memory allocations. Fixes tag refers to the original patch which introduced this issue which then got carried over into newer code. Signed-off-by: Lijo Lazar Reviewed-by: Le Ma Signed-off-by: Alex Deucher Fixes: 6c10b5cc4eaa ("drm/amdgpu: Remove duplicate code in gfx_v8_0.c") Signed-off-by: Sasha Levin

drm/amdgpu: Retire query_utcl2_poison_status callback

2024-08-23T14:53:16+00:00

Driver switches to interrupt source id to identify utcl2 poison event. polling interface is not needed. Signed-off-by: Hawking Zhang Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher

drm/amdkfd: APIs to stop/start KFD scheduling

2024-08-21T02:07:28+00:00

Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling. When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from runlist. If users send ioctls to KFD to create queues, they'll be added but those queues won't be mapped to runlist (so not scheduled) until amdgpu_amdkfd_start_sched is called. v2: fix build (Alex) Signed-off-by: Amber Lin Signed-off-by: Alex Deucher

drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

2024-07-23T21:34:44+00:00

Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling Acked-by: Christian König Signed-off-by: Alex Deucher

drm/amdkfd: add reset cause in gpu pre-reset smi event

2024-06-05T15:25:14+00:00

reset cause is requested by customer as additional info for gpu reset smi event. v2: integerate reset sources suggested by Lijo Lazar Signed-off-by: Eric Huang Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

2024-05-08T19:17:04+00:00

Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads memory allocation requirements. We can't even run a Basic MNIST Example with a default 512MB carveout. https://github.com/pytorch/examples/tree/main/mnist. Error Log: "torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes is free. Of the allocated memory 103.83 MiB is allocated by PyTorch, and 22.17 MiB is reserved by PyTorch but unallocated" Though we can change BIOS settings to enlarge carveout size, which is inflexible and may bring complaint. On the other hand, the memory resource can't be effectively used between host and device. The solution is MI300A approach, i.e., let VRAM allocations go to GTT. Then device and host can flexibly and effectively share memory resource. v2: Report local_mem_size_private as 0. (Felix) Signed-off-by: Lang Yu Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher

drm/amdgpu: prepare to handle pasid poison consumption

2024-04-26T21:22:42+00:00

Prepare to handle pasid poison consumption. Signed-off-by: YiPeng Chai Reviewed-by: Tao Zhou Signed-off-by: Alex Deucher

drm/amdgpu: make reset method configurable for RAS poison

2024-03-20T17:38:15+00:00

Each RAS block has different requirement for gpu reset in poison consumption handling. Add support for mmhub RAS poison consumption handling. v2: remove the mmhub poison support for kfd int v10. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher