kernel/linux.git/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.h, branch v6.18.21

drm/amdgpu: Introduce VF critical region check for RAS poison injection

2025-09-18T13:43:02+00:00

The SRIOV guest send requet to host to check whether the poison injection address is in VF critical region or not via mabox. Signed-off-by: Xiang Liu Reviewed-by: Shravan Kumar Gande Signed-off-by: Alex Deucher

drm/amdgpu: Implement unrecoverable error message handling for VFs

2025-05-07T21:43:13+00:00

This notification may arrive in VF mailbox while polling for response from another event. This patches covers the following scenarios: - If VF is already in RMA state, then do not attempt to contact the host. Host will ignore the VF after sending the notification. - If the notification is detected during polling, then set the RMA status, and return error to caller. - If the notification arrives by interrupt, then set the RMA status and queue a reset. This reset will fail and VF will stop runtime services. Reviewed-by: Shravan Kumar Gande Signed-off-by: Victor Skvortsov Signed-off-by: Ellen Pan Signed-off-by: Alex Deucher

drm/amdgpu: Implement Runtime Bad Page query for VFs

2025-05-07T21:41:49+00:00

Host will send a notification when new bad pages are available. Uopn guest request, the first 256 bad page addresses will be placed into the PF2VF region. Guest should pause the PF2VF worker thread while the copy is in progress. Reviewed-by: Shravan Kumar Gande Signed-off-by: Victor Skvortsov Signed-off-by: Ellen Pan Signed-off-by: Alex Deucher

drm/amdgpu: Update headers for CPER support on SRIOV

2025-03-05T15:46:40+00:00

Update amdgv_sriovmsg.h and mxgpu_nv.h to add new definitions for CPER support on VFs. PMFW ACA messages are not available on VFs, and VFs must query CPERs from host. Signed-off-by: Tony Yi Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Update SRIOV Exchange Headers for RAS Telemetry Support

2024-11-11T16:55:01+00:00

The SRIOV PF/VF Data exchange is extended by 64KB for VF RAS Telemetry data. Add Host RAS Telemetry enable capabilities bitfields. Add a new VF msg REQ_RAS_ERROR_COUNT, the host response data will be populated in the RAS Telemetry region. Signed-off-by: Victor Skvortsov Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amd/sriov: extend NV_MAILBOX_POLL_MSG_TIMEDOUT

2024-08-13T16:12:51+00:00

on MI300/MI308 UBB products, when doing mode1 reset, since 1 gpu need to wait all 8 gpus finish mode1 reset and then do re-init. As observed, sometimes the gpu which triggered the reset need to wait 15s for all gpus to finish. If poll msg timeout, guest driver will send the reset message again, and may mess up the following reinit sequence on other gpus. So extend the time to cover the maximum time needed to recover. Signed-off-by: Victor Zhao Acked-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: process RAS fatal error MB notification

2024-06-27T21:31:37+00:00

For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Add RAS_POISON_READY host response message

2024-01-25T19:58:03+00:00

In a non-FLR page avoidance scenario, the host driver will provide the bad pages in the pf2vf exchange region. Adding a new host response message to indicate when the pf2vf exchange region has been updated. Signed-off-by: Victor Skvortsov Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: add RAS poison consumption handler for NV SRIOV

2022-12-15T17:18:19+00:00

Send handling request to host. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Add MB_REQ_MSG_READY_TO_RESET response when VF get FLR notification.

2021-08-16T19:17:57+00:00

When guest received FLR notification from host, it would lock adapter into reset state. There will be no more job submission and hardware access after that. Then it should send a response to host that it has prepared for host reset. Signed-off-by: Jiange Zhao Signed-off-by: Peng Ju Zhou Reviewed-by: Emily.Deng Signed-off-by: Alex Deucher