kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c, branch v6.19.11

drm/amdgpu: Fix use-after-free race in VM acquire

2026-03-19T15:15:23+00:00

commit 2c1030f2e84885cc58bffef6af67d5b9d2e7098f upstream. Replace non-atomic vm->process_info assignment with cmpxchg() to prevent race when parent/child processes sharing a drm_file both try to acquire the same VM after fork(). Reviewed-by: Harish Kasiviswanathan Signed-off-by: Alysa Liu Signed-off-by: Alex Deucher (cherry picked from commit c7c573275ec20db05be769288a3e3bb2250ec618) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: Fix double deletion of validate_list

2026-02-03T22:24:21+00:00

If amdgpu_amdkfd_gpuvm_free_memory_of_gpu() fails after kgd_mem is removed from validate_list, the mem handle still lingers in the KFD idr. This means when process is terminated, kfd_process_free_outstanding_kfd_bos() will call amdgpu_amdkfd_gpuvm_free_memory_of_gpu() again resulting in double deletion. To avoid this - (a) Check if list is empty before deleting it (b) Rearragne amdgpu_amdkfd_gpuvm_free_memory_of_gpu() such that it can be safely called again if it returns failure the first time. Signed-off-by: Harish Kasiviswanathan Reviewed-by: Philip Yang Signed-off-by: Alex Deucher (cherry picked from commit 6ba60345f45eaf7cb4f89105d26083a4b9fd1cba)

drm/amdkfd: Don't clear PT after process killed

2025-11-04T16:53:22+00:00

If process is killed. the vm entity is stopped, submit pt update job will trigger the error message "*ERROR* Trying to push to a killed entity", job will not execute. Suggested-by: Christian König Signed-off-by: Philip Yang Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Fix vram_usage underflow

2025-10-20T22:25:22+00:00

vram_usage was subtracting non-vram memory size, which caused it to become negative. Signed-off-by: Alysa Liu Reviewed-by: Harish Kasiviswanathan Signed-off-by: Alex Deucher

drm/amdgpu: update the functions to use amdgpu version of hmm

2025-10-13T18:14:36+00:00

At times we need a bo reference for hmm and for that add a new struct amdgpu_hmm_range which will hold an optional bo member and hmm_range. Use amdgpu_hmm_range instead of hmm_range and let the bo as an optional argument for the caller if they want to the bo reference to be taken or they want to handle that explicitly. Signed-off-by: Sunil Khatri Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: clean up amdgpu hmm range functions

2025-10-13T18:14:28+00:00

Clean up the amdgpu hmm range functions for clearer definition of each. a. Split amdgpu_ttm_tt_get_user_pages_done into two: 1. amdgpu_hmm_range_valid: To check if the user pages are valid and update seq num 2. amdgpu_hmm_range_free: Clean up the hmm range and pfn memory. b. amdgpu_ttm_tt_get_user_pages_done and amdgpu_ttm_tt_discard_user_pages are similar function so remove discard and directly use amdgpu_hmm_range_free to clean up the hmm range and pfn memory. Suggested-by: Christian König Signed-off-by: Sunil Khatri Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: use user provided hmm_range buffer in amdgpu_ttm_tt_get_user_pages

2025-10-13T18:14:28+00:00

update the amdgpu_ttm_tt_get_user_pages and all dependent function along with it callers to use a user allocated hmm_range buffer instead hmm layer allocates the buffer. This is a need to get hmm_range pointers easily accessible without accessing the bo and that is a requirement for the userqueue to lock the userptrs effectively. Signed-off-by: Sunil Khatri Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: use atomic functions with memory barriers for vm fault info

2025-10-13T18:14:15+00:00

The atomic variable vm_fault_info_updated is used to synchronize access to adev->gmc.vm_fault_info between the interrupt handler and get_vm_fault_info(). The default atomic functions like atomic_set() and atomic_read() do not provide memory barriers. This allows for CPU instruction reordering, meaning the memory accesses to vm_fault_info and the vm_fault_info_updated flag are not guaranteed to occur in the intended order. This creates a race condition that can lead to inconsistent or stale data being used. The previous implementation, which used an explicit mb(), was incomplete and inefficient. It failed to account for all potential CPU reorderings, such as the access of vm_fault_info being reordered before the atomic_read of the flag. This approach is also more verbose and less performant than using the proper atomic functions with acquire/release semantics. Fix this by switching to atomic_set_release() and atomic_read_acquire(). These functions provide the necessary acquire and release semantics, which act as memory barriers to ensure the correct order of operations. It is also more efficient and idiomatic than using explicit full memory barriers. Fixes: b97dfa27ef3a ("drm/amdgpu: save vm fault information for amdkfd") Cc: stable@vger.kernel.org Signed-off-by: Gui-Dong Han Signed-off-by: Felix Kuehling Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher

drm/amdkfd: Fix kfd process ref leaking when userptr unmapping

2025-10-07T18:09:06+00:00

kfd_lookup_process_by_pid hold the kfd process reference to ensure it doesn't get destroyed while sending the segfault event to user space. Calling kfd_lookup_process_by_pid as function parameter leaks the kfd process refcount and miss the NULL pointer check if app process is already destroyed. Fixes: 2d274bf7099b ("amd/amdkfd: Trigger segfault for early userptr unmmapping") Signed-off-by: Philip Yang Reviewed-by: Harish Kasiviswanathan Signed-off-by: Alex Deucher

drm/amdgpu: use hmm_pfns instead of array of pages

2025-09-23T14:22:31+00:00

we dont need to allocate local array of pages to hold the pages returned by the hmm, instead we could use the hmm_range structure itself to get to hmm_pfn and get the required pages directly. This avoids call to alloc/free quite a lot. Signed-off-by: Sunil Khatri Suggested-by: Christian König Reviewed-by: Christian König Acked-by: Felix Kuehling Signed-off-by: Alex Deucher