kernel/linux.git/drivers/gpu/drm/amd/amdkfd, branch v7.2-rc1

drm/amdkfd: Use exclusive bounds for SVM split alignment checks

2026-06-17T22:35:32+00:00

SVM ranges use inclusive page indices: prange->last is the last page in the range. The split-remap logic introduced by commit 448ee45353ef ("drm/amdkfd: Use huge page size to check split svm range alignment") uses ALIGN_DOWN(prange->last, 512) to determine whether the original range can contain a 2MB huge-page mapping. That aligns the last page itself down. Thus a range ending one page before the next 2MB boundary is classified as if the final 2MB block did not exist. When such a range is split inside that final block, the split head or tail can be left off the remap list even though it was derived from an original range that may have PMD mappings. Use prange->last + 1 as the exclusive upper bound when computing the original range's last 2MB-aligned boundary. Then use the actual split boundary for the head and tail alignment checks: tail->start for a tail split, and new_start for a head split. new_start is equivalent to head->last + 1 and directly names the exclusive end of the split head. Using head->last for the head-side check can both remap a head that ends exactly one page before a 2MB boundary and miss a head whose split boundary is one page after such a boundary. Philip Yang pointed out in the review of the original change that this condition should use head->last + 1 or new_start. Xiaogang Chen identified the inclusive-last cause and posted the candidate fix in the regression thread. With the culprit change active and the local revert not applied, the unchanged C/HSA reproducer completes 10/10 runs with this change on an RX 7600 XT. Fixes: 448ee45353ef ("drm/amdkfd: Use huge page size to check split svm range alignment") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4914 Link: https://lore.kernel.org/stable/IA1PR12MB85172F7FE9157C092EDA46A0E3112@IA1PR12MB8517.namprd12.prod.outlook.com/ Link: https://lore.kernel.org/all/32ce2b72-aa16-4202-9f99-92e3cd4408bc@amd.com/ Suggested-by: Xiaogang Chen Acked-by: Alex Deucher Signed-off-by: Gerhard Schwanzer Signed-off-by: Alex Deucher (cherry picked from commit a60ea15807126b148a328051636977a33ad0e9bb) Cc: stable@vger.kernel.org

drm/amdkfd: Use memdup_array_user to copy data from/to user space at kfd ioctls

2026-06-17T22:24:53+00:00

Several kfd ioctls need transfer array data from/to user space. Kfd driver uses kmalloc_array with user provided size. That can oversize alloc or 32-bit wrap with hostile value. Replace it by memdup_array_user that does overflow checking and allocates through dedicated slab caches, also physical continuous as kmalloc. Signed-off-by: Xiaogang Chen Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit 4eca4742eb215951f9739ffe0122d179d545a7a4)

drm/amdkfd: check find_first_zero_bit before __set_bit on kfd->doorbell_bitmap

2026-06-17T22:24:47+00:00

If inx from find_first_zero_bit is beyond range not need set doorbell_bitmap. Signed-off-by: Xiaogang Chen Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit 2664ce9143d174651a793d96a6a2326050c4f45a)

drm/amdkfd: Let driver decide buffer size at AMDKFD_IOC_GET_DMABUF_INFO ioctl

2026-06-17T22:24:38+00:00

amdkfd driver needs allocate buffer to return bo metadata to user space. The buffer size is controlled by user currently. It is a potential security issue that hostile value (e.g. 2 GiB) lets any render-group user trigger order-MAX allocation/OOM in kernel context. This patch first finds bo metadata size. If the size is smaller than user provided value drive can safely allocate buffer in kernel space and copy to user space buffer. If not, driver will let user know, not allocate and copy. User will redo with new buffer in user space. This patch lets driver decide buffer allocation size to avoid potential hostile size from user space. Signed-off-by: Xiaogang Chen Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit f54ce9e8cbd3abe0eda3a285f54dc4f572fe589a)

drm/amdkfd: Fix NULL deref during sysfs teardown

2026-06-17T22:21:27+00:00

Move kfd_process_remove_sysfs() earlier in kfd_process_wq_release() so that all sysfs/procfs entries are removed before tearing down PDDs and dropping lead_thread. The per-process sysfs attributes are backed by struct kfd_process_device, and their show/store callbacks dereference PDD fields. Since sysfs removal waits for active callbacks to complete, removing these entries first closes a race where userspace reads sdma_* and stats_* files after PDD teardown. Previously this cleanup ran after kfd_process_destroy_pdds(), which resets p->n_pdds to 0. This meant kfd_process_remove_sysfs() could no longer walk the PDD array, so the per-PDD sysfs cleanup did not run as intended. This race caused NULL pointer dereferences observed in kfd_sdma_activity_worker and kfd_procfs_stats_show. Also harden kfd_process_remove_sysfs() against partially initialized or already-freed objects: - Check kobj_queues before removing PASID and deleting it - Guard kobj_stats and kobj_counters before use These checks prevent invalid dereferences during cleanup. Cc: Felix Kuehling Cc: Alex Deucher Signed-off-by: Geoffrey McRae Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher (cherry picked from commit 674c692702341fed321720b4b92036c5934fb485)

drm/amdkfd: fix list_del corruption in kfd_criu_resume_svm

2026-06-17T22:19:37+00:00

The cleanup tail of kfd_criu_resume_svm() walks svms->criu_svm_metadata_list and kfree()s each struct criu_svm_metadata without removing it from the list. The list head is left pointing at freed kmalloc-96 objects. A second AMDKFD_IOC_CRIU_OP from the same process re-enters: list_empty() reads the dangling ->next (use-after-free), the loop walks freed entries, and each is kfree()'d again (double-free). This is reachable by an unprivileged render-group user via /dev/kfd with no capabilities required. Add list_del() before the kfree() so the list is properly emptied. The list_for_each_entry_safe() iterator already caches the next pointer, so unlinking during the walk is safe. Fixes: 2a909ae71871 ("drm/amdkfd: CRIU resume shared virtual memory ranges") Reviewed-by: Alex Deucher Signed-off-by: Mario Limonciello Signed-off-by: Alex Deucher (cherry picked from commit 6322d278a298e2c1430b9d2697743d3a04b788b1)

drm/amdkfd: Fix SMI event PID reporting for containers

2026-06-17T22:12:00+00:00

SMI events were reporting incorrect PIDs in containerized environments, causing test failures where container processes expected to see their namespace-local PIDs but instead received global host PIDs. The issue had two root causes: 1. Event functions were called from kernel context (page fault handlers, migration workers) where 'current' refers to the kernel worker thread, not the userspace GPU process that triggered the event. 2. PID conversion used task_tgid_vnr() which returns the PID in the caller's namespace (init namespace for kernel threads), not the task's own namespace. This patch updates the SMI event interface: - Change 8 event function signatures to accept task_struct pointer instead of pid_t, allowing proper namespace-aware PID conversion - Convert PIDs using task_tgid_nr_ns(task, task_active_pid_ns(task)) which returns the PID as the process sees it via getpid() - Update 10 call sites to pass p->lead_thread (the GPU process) instead of p->lead_thread->pid or current (kernel worker) This ensures SMI events report container-local PIDs, which is critical for containerized GPU workloads to correctly correlate events with their processes. Tested-by: Andrew Martin Assisted-by: Claude:Sonnet 4-5 Signed-off-by: Andrew Martin Reviewed-by: Harish Kasiviswanathan Signed-off-by: Alex Deucher (cherry picked from commit 60271ec06e04ba5d69d68714f3abdf637d86c257)

drm/amdkfd: Properly acquire queue buffers in CRIU restore

2026-06-17T22:08:14+00:00

When kfd_queue_acquire_buffers() was split off from set_queue_properties_from_user(), set_queue_properties_from_criu() was missed. Thus, set_queue_properties_from_criu() is not filling out the buffer fields of queue_properties, which can come up when subsequent code expects them to be non-null. Add the proper call to kfd_queue_acquire_buffers(), and also use the right cast types in set_queue_properties_from_criu() (which were missed at the same time) Signed-off-by: David Francis Reviewed-by: Kent Russell Signed-off-by: Alex Deucher (cherry picked from commit 88ed96abbbe27b70193544fbc1ee06448c274714)

drm/amdkfd: always resume_all after suspend_all

2026-06-04T19:38:08+00:00

Need to restore any good queues even if the suspend_all failed for some. Always run remove_queue as that will schedule a GPU reset is removing the queue fails. v2: move resume_all after remove Fixes: eb067d65c33e ("drm/amdkfd: Update BadOpcode Interrupt handling with MES") Reviewed-by: Amber Lin Signed-off-by: Alex Deucher

drm/amdkfd: Fix infinite loop parsing CRAT with zero subtype length

2026-06-04T19:27:41+00:00

Malformed ACPI CRAT tables can advertise a zero or undersized subtype length. The parser then fails to advance the cursor and loops forever while the remaining image still looks large enough for a generic header. Validate sub_type_hdr->length on each iteration before parsing or advancing. Return -EINVAL and warn when length is zero or smaller than the generic subtype header. Signed-off-by: Yongqiang Sun Acked-by: Alex Deucher Signed-off-by: Alex Deucher