From fbb9933666e31f84c62e9620e9ec4d220ee31ab4 Mon Sep 17 00:00:00 2001 From: Saeed Mahameed Date: Mon, 17 Nov 2025 23:42:08 +0200 Subject: net/mlx5: Abort new commands if all command slots are stalled In case of a FW issue, FW might be not responding to FW commands, causing kernel lockout for a long period of time, e.g. rtnl_lock held while ethtool is trying to collect stats waiting for FW to respond to multiple commands, when all of them will timeout. While there's no immediate indication of the FW lockout, we can safely assume that something is wrong when all command slots are busy and in a timeout state and no FW completion was received on any of them. In such case, start immediately failing new commands. Signed-off-by: Saeed Mahameed Reviewed-by: Moshe Shemesh Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/1763415729-1238421-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- include/linux/mlx5/driver.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index 046396269ccf..7aec53371cf0 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -819,6 +819,7 @@ typedef void (*mlx5_cmd_cbk_t)(int status, void *context); enum { MLX5_CMD_ENT_STATE_PENDING_COMP, + MLX5_CMD_ENT_STATE_TIMEDOUT, }; struct mlx5_cmd_work_ent { -- cgit v1.2.3