kernel/linux.git/drivers/gpu/drm/amd/amdgpu, branch v6.6.47

drm/amdgpu: Forward soft recovery errors to userspace

2024-08-14T11:58:53+00:00

commit 829798c789f567ef6ba4b084c15b7b5f3bd98d51 upstream. As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset. 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton Reviewed-by: Marek Olšák Signed-off-by: Christian König Signed-off-by: Alex Deucher (cherry picked from commit 434967aadbbbe3ad9103cc29e9a327de20fdba01) Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: Add lock around VF RLCG interface

2024-08-14T11:58:45+00:00

[ Upstream commit e864180ee49b4d30e640fd1e1d852b86411420c9 ] flush_gpu_tlb may be called from another thread while device_gpu_recover is running. Both of these threads access registers through the VF RLCG interface during VF Full Access. Add a lock around this interface to prevent race conditions between these threads. Signed-off-by: Victor Skvortsov Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/admgpu: fix dereferencing null pointer context

2024-08-14T11:58:45+00:00

[ Upstream commit 030ffd4d43b433bc6671d9ec34fc12c59220b95d ] When user space sets an invalid ta type, the pointer context will be empty. So it need to check the pointer context before using it Signed-off-by: Jesse Zhang Suggested-by: Tim Huang Reviewed-by: Tim Huang Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: Fix the null pointer dereference to ras_manager

2024-08-14T11:58:45+00:00

[ Upstream commit 4c11d30c95576937c6c35e6f29884761f2dddb43 ] Check ras_manager before using it Signed-off-by: Ma Jun Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: fix potential resource leak warning

2024-08-14T11:58:44+00:00

[ Upstream commit 22a5daaec0660dd19740c4c6608b78f38760d1e6 ] Clear resource leak warning that when the prepare fails, the allocated amdgpu job object will never be released. Signed-off-by: Tim Huang Reviewed-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amd/amdgpu: Fix uninitialized variable warnings

2024-08-03T06:54:29+00:00

commit df65aabef3c0327c23b840ab5520150df4db6b5f upstream. Return 0 to avoid returning an uninitialized variable r. Cc: stable@vger.kernel.org Fixes: 230dd6bb6117 ("drm/amd/amdgpu: implement mode2 reset on smu_v13_0_10") Signed-off-by: Ma Ke Signed-off-by: Alex Deucher (cherry picked from commit 6472de66c0aa18d50a4b5ca85f8272e88a737676) Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: reset vm state machine after gpu reset(vram lost)

2024-08-03T06:54:29+00:00

commit 5659b0c93a1ea02c662a030b322093203f299185 upstream. [Why] Page table of compute VM in the VRAM will lost after gpu reset. VRAM won't be restored since compute VM has no shadows. [How] Use higher 32-bit of vm->generation to record a vram_lost_counter. Reset the VM state machine when vm->genertaion is not equal to the new generation token. v2: Check vm->generation instead of calling drm_sched_entity_error in amdgpu_vm_validate. v3: Use new generation token instead of vram_lost_counter for check. Signed-off-by: ZhenGuo Yin Reviewed-by: Christian König Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org (cherry picked from commit 47c0388b0589cb481c294dcb857d25a214c46eb3) Signed-off-by: Greg Kroah-Hartman

drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell

2024-08-03T06:54:28+00:00

commit a03ebf116303e5d13ba9a2b65726b106cb1e96f6 upstream. We seem to have a case where SDMA will sometimes miss a doorbell if GFX is entering the powergating state when the doorbell comes in. To workaround this, we can update the wptr via MMIO, however, this is only safe because we disallow gfxoff in begin_ring() for SDMA 5.2 and then allow it again in end_ring(). Enable this workaround while we are root causing the issue with the HW team. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/3440 Tested-by: Friedrich Vock Signed-off-by: Alex Deucher Cc: stable@vger.kernel.org (cherry picked from commit f2ac52634963fc38e4935e11077b6f7854e5d700) Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

2024-08-03T06:53:46+00:00

[ Upstream commit 1446226d32a45bb7c4f63195a59be8c08defe658 ] The following commit updated gmc->noretry from 0 to 1 for GC HW IP 9.3.0: commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1") This causes the device to hang when a page fault occurs, until the device is rebooted. Instead, revert back to gmc->noretry=0 so the device is still responsive. Fixes: 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1") Signed-off-by: Tim Van Patten Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin

drm/amdgpu: Check if NBIO funcs are NULL in amdgpu_device_baco_exit

2024-08-03T06:53:46+00:00

[ Upstream commit 0cdb3f9740844b9d95ca413e3fcff11f81223ecf ] The special case for VM passthrough doesn't check adev->nbio.funcs before dereferencing it. If GPUs that don't have an NBIO block are passed through, this leads to a NULL pointer dereference on startup. Signed-off-by: Friedrich Vock Fixes: 1bece222eabe ("drm/amdgpu: Clear doorbell interrupt status for Sienna Cichlid") Cc: Alex Deucher Cc: Christian König Acked-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin