drm/amdgpu: Prefer RAS recovery for scheduler hang

Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A scheduler/job hang could be the side effect of a RAS error. In such cases, it is required to go through the RAS error recovery process. A RAS error recovery process in certains cases also could avoid a full device device reset. An error state is maintained in RAS context to detect the block affected. Fatal Error state uses unused block id. Set the block id when error is detected. If the interrupt handler detected a poison error, it's not required to look for a fatal error. Skip fatal error checking in such cases. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
author: Lijo Lazar <lijo.lazar@amd.com> 2024-10-24 08:31:57 +0300
committer: Alex Deucher <alexander.deucher@amd.com> 2024-12-10 18:26:46 +0300
commit: e1ee2111ca48169a9fdc5075f7863f5d4d591e2f (patch)
tree: 487517237aa6a8c5587ef88e717accf3a72878b2 /drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
parent: 0eecff79e49f8ce5475e1b4d968f26263587be66 (diff)
download: linux-e1ee2111ca48169a9fdc5075f7863f5d4d591e2f.tar.xz
1 files changed, 13 insertions, 2 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d272d95dd5b2..97d3e5f29638 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5181,7 +5181,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	if (r)
 		return r;
 
-	amdgpu_ras_set_fed(adev, false);
+	amdgpu_ras_clear_err_state(adev);
 	amdgpu_irq_gpu_reset_resume_helper(adev);
 
 	/* some sw clean up VF needs to do before recover */
@@ -5484,7 +5484,7 @@ int amdgpu_device_reinit_after_reset(struct amdgpu_reset_context *reset_context)
 		amdgpu_set_init_level(tmp_adev, init_level);
 		if (full_reset) {
 			/* post card */
-			amdgpu_ras_set_fed(tmp_adev, false);
+			amdgpu_ras_clear_err_state(tmp_adev);
 			r = amdgpu_device_asic_init(tmp_adev);
 			if (r) {
 				dev_warn(tmp_adev->dev, "asic atom init failed!");
@@ -5818,6 +5818,17 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
 
 	/*
+	 * If it reaches here because of hang/timeout and a RAS error is
+	 * detected at the same time, let RAS recovery take care of it.
+	 */
+	if (amdgpu_ras_is_err_state(adev, AMDGPU_RAS_BLOCK__ANY) &&
+	    reset_context->src != AMDGPU_RESET_SRC_RAS) {
+		dev_dbg(adev->dev,
+			"Gpu recovery from source: %d yielding to RAS error recovery handling",
+			reset_context->src);
+		return 0;
+	}
+	/*
 	 * Special case: RAS triggered and full reset isn't supported
 	 */
 	need_emergency_restart = amdgpu_ras_need_emergency_restart(adev);
author	Lijo Lazar <lijo.lazar@amd.com>	2024-10-24 08:31:57 +0300
committer	Alex Deucher <alexander.deucher@amd.com>	2024-12-10 18:26:46 +0300
commit	e1ee2111ca48169a9fdc5075f7863f5d4d591e2f (patch)
tree	487517237aa6a8c5587ef88e717accf3a72878b2 /drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
parent	0eecff79e49f8ce5475e1b4d968f26263587be66 (diff)
download	linux-e1ee2111ca48169a9fdc5075f7863f5d4d591e2f.tar.xz