summaryrefslogtreecommitdiff
path: root/drivers/net/ipvlan/ipvlan_core.c
diff options
context:
space:
mode:
authorJakub Kicinski <kuba@kernel.org>2025-08-27 03:24:18 +0300
committerJakub Kicinski <kuba@kernel.org>2025-08-27 03:24:18 +0300
commitdded99427d1a1e753e9554ebc9f27604154277ca (patch)
treed4f5f425e3f1ef240c3b0d846f7eaca8aca7a157 /drivers/net/ipvlan/ipvlan_core.c
parenta0f849c1cc6df0db9083b4c81c05a5456b1ed0fb (diff)
parent2d5ccb93bbb4f161ccb677ce0b2ebcfe4a089d62 (diff)
downloadlinux-dded99427d1a1e753e9554ebc9f27604154277ca.tar.xz
Merge branch 'expose-burst-period-for-devlink-health-reporter'
Mark Bloch says: ==================== Expose burst period for devlink health reporter Shahar writes: -------------------------------------------------------------------------- Currently, the devlink health reporter initiates the grace period immediately after recovering an error, which blocks further recovery attempts until the grace period concludes. Since additional errors are not generally expected during this short interval, any new error reported during the grace period is not only rejected but also causes the reporter to enter an error state that requires manual intervention. This approach poses a problem in scenarios where a single root cause triggers multiple related errors in quick succession - for example, a PCI issue affecting multiple hardware queues. Because these errors are closely related and occur rapidly, it is more effective to handle them together rather than handling only the first one reported and blocking any subsequent recovery attempts. Furthermore, setting the reporter to an error state in this context can be misleading, as these multiple errors are manifestations of a single underlying issue, making it unlike the general case where additional errors are not expected during the grace period. To resolve this, introduce a configurable burst period attribute to the devlink health reporter. This period starts when the first error is recovered and lasts for a user-defined duration. Once this error burst period expires, the grace period begins. After the grace period ends, a new reported error will start the same flow again. Timeline summary: ----|--------|------------------------------/----------------------/-- error is error is burst period grace period reported recovered (recoveries allowed) (recoveries blocked) With burst period, create a time window during which recovery attempts are permitted, allowing all reported errors to be handled sequentially before the grace period starts. Once the grace period begins, it prevents any further error recoveries until it ends. When burst period is set to 0, current behavior is preserved. Design alternatives considered: 1. Recover all queues upon any error: A brute-force approach that recovers all queues on any error. While simple, it is overly aggressive and disrupts unaffected queues unnecessarily. Also, because this is handled entirely within the driver, it leads to a driver-specific implementation rather than a generic one. 2. Per-queue reporter: This design would isolate recovery handling per SQ or RQ, effectively removing interdependencies between queues. While conceptually clean, it introduces significant scalability challenges as the number of queues grows, as well as synchronization challenges across multiple reporters. 3. Error aggregation with delayed handling: Errors arriving during the grace period are saved and processed after it ends. While addressing the issue of related errors whose recovery is aborted as grace period started, this adds complexity due to synchronization needs and contradicts the assumption that no errors should occur during a healthy system’s grace period. Also, this breaks the important role of grace period in preventing an infinite loop of immediate error detection following recovery. In such cases we want to stop. 4. Allowing a fixed burst of errors before starting grace period: Allows a set number of recoveries before the grace period begins. However, it also requires limiting the error reporting window. To keep the design simple, the burst threshold becomes redundant. The burst period design was chosen for its simplicity and precision in addressing the problem at hand. It effectively captures the temporal correlation of related errors and aligns with the original intent of the grace period as a stabilization window where further errors are unexpected, and if they do occur, they indicate an abnormal system state. v3: https://lore.kernel.org/1755111349-416632-1-git-send-email-tariqt@nvidia.com v2: https://lore.kernel.org/1753390134-345154-1-git-send-email-tariqt@nvidia.com v1: https://lore.kernel.org/1752768442-264413-1-git-send-email-tariqt@nvidia.com ==================== Link: https://patch.msgid.link/20250824084354.533182-1-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Diffstat (limited to 'drivers/net/ipvlan/ipvlan_core.c')
0 files changed, 0 insertions, 0 deletions