kernel/linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c, branch v6.12.80

drm/amdgpu: avoid a warning in timedout job handler

2026-03-04T12:21:02+00:00

[ Upstream commit c8cf9ddc549fb93cb5a35f3fe23487b1e6707e74 ] Only set an error on the fence if the fence is not signalled. We can end up with a warning if the per queue reset path signals the fence and sets an error as part of the reset, but fails to recover. Reviewed-by: Timur Kristóf Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: switch job hw_fence to amdgpu_fence

2025-07-06T09:01:47+00:00

commit ebe43542702c3d15d1a1d95e8e13b1b54076f05a upstream. Use the amdgpu fence container so we can store additional data in the fence. This also fixes the start_time handling for MCBP since we were casting the fence to an amdgpu_fence and it wasn't. Fixes: 3f4c175d62d8 ("drm/amdgpu: MCBP based on DRM scheduler (v9)") Reviewed-by: Christian König Signed-off-by: Alex Deucher (cherry picked from commit bf1cd14f9e2e1fdf981eed273ddd595863f5288c) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: don't access invalid sched

2024-12-27T13:02:11+00:00

[ Upstream commit a93b1020eb9386d7da11608477121b10079c076a ] Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (*job)->base.sched done in amdgpu_job_alloc is overwritten by the memset. This commit fixes an issue when a CS would fail validation and would be rejected after job->num_ibs is incremented. In this case, amdgpu_ib_free(ring->adev, ...) will be called, which would crash the machine because the ring value is bogus. To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this because the device is actually not used in this function. The next commit will remove the ring argument completely. Fixes: 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") Signed-off-by: Pierre-Eric Pelloux-Prayer Reviewed-by: Alex Deucher Reviewed-by: Christian König Signed-off-by: Alex Deucher (cherry picked from commit 2ae520cb12831d264ceb97c61f72c59d33c0dbd7) Signed-off-by: Sasha Levin

drm/amdgpu: skip coredump after job timeout in SRIOV

2024-09-25T16:55:52+00:00

VF FLR will be triggered by host driver before job timeout, hence the error status of GPU get cleared. Performing a coredump here is unnecessary. Signed-off-by: ZhenGuo Yin Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: Move the dumping log out of for loop

2024-08-29T17:39:19+00:00

log message "Dumping IP State Completed" needs to be logged only once when state dumping is complete. Hence moving it out of the for loop. Signed-off-by: Sunil Khatri Acked-by: Trigger Huang Signed-off-by: Alex Deucher

drm/amdgpu: Do core dump immediately when job tmo

2024-08-29T17:39:00+00:00

Do the coredump immediately after a job timeout to get a closer representation of GPU's error status. V2: This will skip printing vram_lost as the GPU reset is not happened yet (Alex) V3: Unconditionally call the core dump as we care about all the reset functions(soft-recovery and queue reset and full adapter reset, Alex) V4: Do the dump after adev->job_hang = true (Sunil) Signed-off-by: Trigger Huang Acked-by: Sunil Khatri Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher

Merge tag 'amd-drm-next-6.12-2024-08-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

2024-08-27T12:33:12+00:00

amd-drm-next-6.12-2024-08-26: amdgpu: - SDMA devcoredump support - DCN 4.0.1 updates - DC SUBVP fixes - Refactor OPP in DC - Refactor MMHUBBUB in DC - DC DML 2.1 updates - DC FAMS2 updates - RAS updates - GFX12 updates - VCN 4.0.3 updates - JPEG 4.0.3 updates - Enable wave kill (soft recovery) for compute queues - Clean up CP error interrupt handling - Enable CP bad opcode interrupts - VCN 4.x fixes - VCN 5.x fixes - GPU reset fixes - Fix vbios embedded EDID size handling - SMU 14.x updates - Misc code cleanups and spelling fixes - VCN devcoredump support - ISP MFD i2c support - DC vblank fixes - GFX 12 fixes - PSR fixes - Convert vbios embedded EDID to drm_edid - DCN 3.5 updates - DMCUB updates - Cursor fixes - Overdrive support for SMU 14.x - GFX CP padding optimizations - DCC fixes - DSC fixes - Preliminary per queue reset infrastructure - Initial per queue reset support for GFX 9 - Initial per queue reset support for GFX 7, 8 - DCN 3.2 fixes - DP MST fixes - SR-IOV fixes - GFX 9.4.3/4 devcoredump support - Add process isolation framework - Enable process isolation support for GFX 9.4.3/4 - Take IOMMU remapping into account for P2P DMA checks amdkfd: - CRIU fixes - Improved input validation for user queues - HMM fix - Enable process isolation support for GFX 9.4.3/4 - Initial per queue reset support for GFX 9 - Allow users to target recommended SDMA engines radeon: - remove .load and drm_dev_alloc - Fix vbios embedded EDID size handling - Convert vbios embedded EDID to drm_edid - Use GEM references instead of TTM - r100 cp init cleanup - Fix potential overflows in evergreen CS offset tracking UAPI: - KFD support for targetting queues on recommended SDMA engines Proposed userspace: https://github.com/ROCm/ROCR-Runtime/commit/2f588a24065f41c208c3701945e20be746d8faf7 https://github.com/ROCm/ROCR-Runtime/commit/eb30a5bbc7719c6ffcf2d2dd2878bc53a47b3f30 drm/buddy: - Add start address support for trim function From: Alex Deucher Link: https://patchwork.freedesktop.org/patch/msgid/20240826201528.55307-1-alexander.deucher@amd.com

drm/amdgpu: increase the reset counter for the queue reset

2024-08-16T18:17:56+00:00

Update the reset counter for the amdgpu_cs_query_reset_state() Acked-by: Vitaly Prosyak Signed-off-by: Prike Liang Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: add per ring reset support (v5)

2024-08-16T18:17:52+00:00

If a specific job is hung, try and reset just the ring associated with the job. v2: move to amdgpu_job.c v3: fix drm_sched_stop() handling when ring reset fails v4: drop unnecessary amdgpu_fence_driver_clear_job_fences() and drm_sched_increase_karma() v5: rework sched_stop handling Acked-by: Vitaly Prosyak Signed-off-by: Alex Deucher

drm/amdgpu: Forward soft recovery errors to userspace

2024-08-07T22:20:07+00:00

As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset. 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton Reviewed-by: Marek Olšák Signed-off-by: Christian König Signed-off-by: Alex Deucher (cherry picked from commit 434967aadbbbe3ad9103cc29e9a327de20fdba01) Cc: stable@vger.kernel.org