kernel/linux.git/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c, branch v6.12.80

drm/amdgpu: process RAS fatal error MB notification

2024-06-27T21:31:37+00:00

For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Use dev_ prints for virtualization as it supports multi adapter

2024-06-27T21:30:39+00:00

So we can get clearer per device logging. Signed-off-by: Vignesh Chander Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: fix sriov host flr handler

2024-06-14T20:15:58+00:00

We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen. In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li Acked-by: Christian König Reviewed-by: Emily Deng Signed-off-by: Alex Deucher

drm/amdgpu: Add reset_context flag for host FLR

2024-05-02T19:40:50+00:00

There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FLR comes from the host does not work. Add a flag in reset_context to explicitly mark host triggered reset, and set this flag when we receive host reset notification. Signed-off-by: Yunxiang Li Reviewed-by: Emily Deng Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Fix two reset triggered in a row

2024-05-02T19:40:44+00:00

Some times a hang GPU causes multiple reset sources to schedule resets. The second source will be able to trigger an unnecessary reset if they schedule after we call amdgpu_device_stop_pending_resets. Move amdgpu_device_stop_pending_resets to after the reset is done. Since at this point the GPU is supposedly in a good state, any reset scheduled after this point would be a legitimate reset. Remove unnecessary and incorrect checks for amdgpu_in_reset that was kinda serving this purpose. Signed-off-by: Yunxiang Li Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdgpu: update vf to pf message retry from 2 to 5

2024-05-02T19:40:38+00:00

increase retry times to wait host has enough time to complete reset. Signed-off-by: Zhigang Luo Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: remove virt_init_data_exchange from poison consumption handler

2024-04-19T03:47:26+00:00

Host will initiate an FLR for all poison consumption. Guest should wait for FLR message to re-init data exchange. Signed-off-by: Zhigang Luo Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: trigger flr_work if reading pf2vf data failed

2024-03-20T17:38:13+00:00

if reading pf2vf data failed 30 times continuously, it means something is wrong. Need to trigger flr_work to recover the issue. also use dev_err to print the error message to get which device has issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL timeout. Signed-off-by: Zhigang Luo Acked-by: Hawking Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Skip virt_exchange_init on SDMA poison consumption

2024-03-20T17:37:38+00:00

Host will initiate an FLR in SDMA poison consumption scenario. Guest should wait for FLR message to re-init data exchange. Signed-off-by: Victor Skvortsov Reviewed-by: Zhigang Luo Signed-off-by: Alex Deucher

drm/amdgpu: Add RAS_POISON_READY host response message

2024-01-25T19:58:03+00:00

In a non-FLR page avoidance scenario, the host driver will provide the bad pages in the pf2vf exchange region. Adding a new host response message to indicate when the pf2vf exchange region has been updated. Signed-off-by: Victor Skvortsov Reviewed-by: Hawking Zhang Signed-off-by: Alex Deucher