kernel/linux.git/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c, branch v6.19.11

drm/amdgpu: Add lock to serialize sriov command execution

2025-11-14T16:28:07+00:00

Add lock to serialize sriov command execution. Signed-off-by: YiPeng Chai Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Add SRIOV crit_region_version support

2025-10-20T22:27:58+00:00

1. Added enum amd_sriov_crit_region_version to support multi versions 2. Added logic in SRIOV mailbox to regonize crit_region version during req_gpu_init_data Signed-off-by: Ellen Pan Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu: Introduce VF critical region check for RAS poison injection

2025-09-18T13:43:02+00:00

The SRIOV guest send requet to host to check whether the poison injection address is in VF critical region or not via mabox. Signed-off-by: Xiang Liu Reviewed-by: Shravan Kumar Gande Signed-off-by: Alex Deucher

drm/amdgpu: refactor bad_page_work for corner case handling

2025-08-15T17:07:30+00:00

When a poison is consumed on the guest before the guest receives the host's poison creation msg, a corner case may occur to have poison_handler complete processing earlier than it should to cause the guest to hang waiting for the req_bad_pages reply during a VF FLR, resulting in the VM becoming inaccessible in stress tests. To fix this issue, this patch refactored the mailbox sequence by seperating the bad_page_work into two parts req_bad_pages_work and handle_bad_pages_work. Old sequence: 1.Stop data exchange work 2.Guest sends MB_REQ_RAS_BAD_PAGES to host and keep polling for IDH_RAS_BAD_PAGES_READY 3.If the IDH_RAS_BAD_PAGES_READY arrives within timeout limit, re-init the data exchange region for updated bad page info else timeout with error message New sequence: req_bad_pages_work: 1.Stop data exhange work 2.Guest sends MB_REQ_RAS_BAD_PAGES to host Once Guest receives IDH_RAS_BAD_PAGES_READY event handle_bad_pages_work: 3.re-init the data exchange region for updated bad page info Signed-off-by: Chenglei Xie Reviewed-by: Shravan Kumar Gande Signed-off-by: Alex Deucher

drm/amdgpu: Implement unrecoverable error message handling for VFs

2025-05-07T21:43:13+00:00

This notification may arrive in VF mailbox while polling for response from another event. This patches covers the following scenarios: - If VF is already in RMA state, then do not attempt to contact the host. Host will ignore the VF after sending the notification. - If the notification is detected during polling, then set the RMA status, and return error to caller. - If the notification arrives by interrupt, then set the RMA status and queue a reset. This reset will fail and VF will stop runtime services. Reviewed-by: Shravan Kumar Gande Signed-off-by: Victor Skvortsov Signed-off-by: Ellen Pan Signed-off-by: Alex Deucher

drm/amdgpu: Implement Runtime Bad Page query for VFs

2025-05-07T21:41:49+00:00

Host will send a notification when new bad pages are available. Uopn guest request, the first 256 bad page addresses will be placed into the PF2VF region. Guest should pause the PF2VF worker thread while the copy is in progress. Reviewed-by: Shravan Kumar Gande Signed-off-by: Victor Skvortsov Signed-off-by: Ellen Pan Signed-off-by: Alex Deucher

drm/amdgpu: Add support for CPERs on virtualization

2025-03-05T15:47:03+00:00

Add support for CPERs on VFs. VFs do not receive PMFW messages directly; as such, they need to query them from the host. To avoid hitting host event guard, CPER queries need to be rate limited. CPER queries share the same RAS telemetry buffer as error count query, so a mutex protecting the shared buffer was added as well. For readability, the amdgpu_detect_virtualization was refactored into multiple individual functions. Signed-off-by: Tony Yi Reviewed-by: Tao Zhou Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Add msg handlers for SRIOV RAS Telemetry

2024-11-11T16:55:08+00:00

Add message handlers for RAS telemetry. Signed-off-by: Victor Skvortsov Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: process RAS fatal error MB notification

2024-06-27T21:31:37+00:00

For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Use dev_ prints for virtualization as it supports multi adapter

2024-06-27T21:30:39+00:00

So we can get clearer per device logging. Signed-off-by: Vignesh Chander Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher