kernel/linux.git/drivers/gpu/drm/amd/amdkfd, branch v6.6.12

drm/amdkfd: get doorbell's absolute offset based on the db_size

2023-12-13T17:45:11+00:00

[ Upstream commit 367a0af43373d4f791cc8b466a659ecf5aa52377 ] Here, Adding db_size in byte to find the doorbell's absolute offset for both 32-bit and 64-bit doorbell sizes. So that doorbell offset will be aligned based on the doorbell size. v2: - Addressed the review comment from Felix. v3: - Adding doorbell_size as parameter to get db absolute offset. v4: Squash the two patches into one. Cc: Christian Koenig Cc: Alex Deucher Reviewed-by: Felix Kuehling Signed-off-by: Shashank Sharma Signed-off-by: Arvind Yadav Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Fix shift out-of-bounds issue

2023-11-28T17:19:41+00:00

[ Upstream commit 282c1d793076c2edac6c3db51b7e8ed2b41d60a5 ] [ 567.613292] shift exponent 255 is too large for 64-bit type 'long unsigned int' [ 567.614498] CPU: 5 PID: 238 Comm: kworker/5:1 Tainted: G OE 6.2.0-34-generic #34~22.04.1-Ubuntu [ 567.614502] Hardware name: AMD Splinter/Splinter-RPL, BIOS WS43927N_871 09/25/2023 [ 567.614504] Workqueue: events send_exception_work_handler [amdgpu] [ 567.614748] Call Trace: [ 567.614750] [ 567.614753] dump_stack_lvl+0x48/0x70 [ 567.614761] dump_stack+0x10/0x20 [ 567.614763] __ubsan_handle_shift_out_of_bounds+0x156/0x310 [ 567.614769] ? srso_alias_return_thunk+0x5/0x7f [ 567.614773] ? update_sd_lb_stats.constprop.0+0xf2/0x3c0 [ 567.614780] svm_range_split_by_granularity.cold+0x2b/0x34 [amdgpu] [ 567.615047] ? srso_alias_return_thunk+0x5/0x7f [ 567.615052] svm_migrate_to_ram+0x185/0x4d0 [amdgpu] [ 567.615286] do_swap_page+0x7b6/0xa30 [ 567.615291] ? srso_alias_return_thunk+0x5/0x7f [ 567.615294] ? __free_pages+0x119/0x130 [ 567.615299] handle_pte_fault+0x227/0x280 [ 567.615303] __handle_mm_fault+0x3c0/0x720 [ 567.615311] handle_mm_fault+0x119/0x330 [ 567.615314] ? lock_mm_and_find_vma+0x44/0x250 [ 567.615318] do_user_addr_fault+0x1a9/0x640 [ 567.615323] exc_page_fault+0x81/0x1b0 [ 567.615328] asm_exc_page_fault+0x27/0x30 [ 567.615332] RIP: 0010:__get_user_8+0x1c/0x30 Signed-off-by: Jesse Zhang Suggested-by: Philip Yang Reviewed-by: Yifan Zhang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Fix a race condition of vram buffer unref in svm code

2023-11-28T17:19:40+00:00

[ Upstream commit 709c348261618da7ed89d6c303e2ceb9e453ba74 ] prange->svm_bo unref can happen in both mmu callback and a callback after migrate to system ram. Both are async call in different tasks. Sync svm_bo unref operation to avoid random "use-after-free". Signed-off-by: Xiaogang Chen Reviewed-by: Philip Yang Reviewed-by: Jesse Zhang Tested-by: Jesse Zhang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: ratelimited SQ interrupt messages

2023-11-28T17:19:39+00:00

[ Upstream commit 37fb87910724f21a1f27a75743d4f9accdee77fb ] No functional change. Use ratelimited version of pr_ to avoid overflowing of dmesg buffer Signed-off-by: Harish Kasiviswanathan Reviewed-by: Philip Yang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Handle errors from svm validate and map

2023-11-20T10:59:10+00:00

[ Upstream commit eb3c357bcb286e89386e89302061fe717fe4e562 ] If new range is splited to multiple pranges with max_svm_range_pages alignment and added to update_list, svm validate and map should keep going after error to make sure prange->mapped_to_gpu flag is up to date for the whole range. svm validate and map update set prange->mapped_to_gpu after mapping to GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for update mapping case) instead of setting error flag, we can remove the redundant error flag to simpliy code. Refactor to remove goto and update prange->mapped_to_gpu flag inside svm_range_lock, to guarant we always evict queues or unmap from GPUs if there are invalid ranges. After svm validate and map return error -EAGIN, the caller retry will update the mapping for the whole range again. Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map") Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling Tested-by: James Zhu Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Remove svm range validated_once flag

2023-11-20T10:59:10+00:00

[ Upstream commit c99b16128082de519975aa147d9da3e40380de67 ] The validated_once flag is not used after the prefault was removed, The prefault was needed to ensure validate all system memory pages at least once before mapping or migrating the range to GPU. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Stable-dep-of: eb3c357bcb28 ("drm/amdkfd: Handle errors from svm validate and map") Signed-off-by: Sasha Levin

drm/amdkfd: fix some race conditions in vram buffer alloc/free of svm code

2023-11-20T10:59:10+00:00

[ Upstream commit 7bfaa160caed8192f8262c4638f552cad94bcf5a ] This patch fixes: 1: ref number of prange's svm_bo got decreased by an async call from hmm. When wait svm_bo of prange got released we shoul also wait prang->svm_bo become NULL, otherwise prange->svm_bo may be set to null after allocate new vram buffer. 2: During waiting svm_bo of prange got released in a while loop should reschedule current task to give other tasks oppotunity to run, specially the the workque task that handles svm_bo ref release, otherwise we may enter to softlock. Signed-off-by: Xiaogang.Chen Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdkfd: Use gpu_offset for user queue's wptr

2023-09-20T21:30:42+00:00

Directly use tbo's start address will miss the domain start offset. Need to use gpu_offset instead. Signed-off-by: YuBiao Wang Reviewed-by: Christian König Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org

drm/amdkfd: Insert missing TLB flush on GFX10 and later

2023-09-12T21:45:40+00:00

Heavy-weight TLB flush is required after unmap on all GPUs for correctness and security. Signed-off-by: Harish Kasiviswanathan Reviewed-by: Felix Kuehling Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org

drm/amdkfd: Checkpoint and restore queues on GFX11

2023-09-11T22:22:38+00:00

The code in kfd_mqd_manager_v11.c to support criu dump and restore of queue state was missing. Added it; should be equivalent to kfd_mqd_manager_v10.c. CC: Felix Kuehling Reviewed-by: Harish Kasiviswanathan Acked-by: Alex Deucher Signed-off-by: David Francis Signed-off-by: Alex Deucher