diff options
author | Petr Machata <petrm@nvidia.com> | 2021-03-29 18:57:31 +0300 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2021-03-29 23:51:38 +0300 |
commit | 87f2c6716f6408b9992d7f2247d1fcc190de2c92 (patch) | |
tree | bb25414f52eef78488fcaeec5602fa7e11264772 /Documentation | |
parent | 177cb7876dced42e7ab8736e108afd1fe8dc03ea (diff) | |
download | linux-87f2c6716f6408b9992d7f2247d1fcc190de2c92.tar.xz |
Documentation: net: Document resilient next-hop groups
Add a document describing the principles behind resilient next-hop groups,
and some notes about how to configure and offload them.
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/networking/index.rst | 1 | ||||
-rw-r--r-- | Documentation/networking/nexthop-group-resilient.rst | 293 |
2 files changed, 294 insertions, 0 deletions
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b8a29997d433..e9ce55992aa9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -76,6 +76,7 @@ Contents: netdevices netfilter-sysctl netif-msg + nexthop-group-resilient nf_conntrack-sysctl nf_flowtable openvswitch diff --git a/Documentation/networking/nexthop-group-resilient.rst b/Documentation/networking/nexthop-group-resilient.rst new file mode 100644 index 000000000000..fabecee24d85 --- /dev/null +++ b/Documentation/networking/nexthop-group-resilient.rst @@ -0,0 +1,293 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +Resilient Next-hop Groups +========================= + +Resilient groups are a type of next-hop group that is aimed at minimizing +disruption in flow routing across changes to the group composition and +weights of constituent next hops. + +The idea behind resilient hashing groups is best explained in contrast to +the legacy multipath next-hop group, which uses the hash-threshold +algorithm, described in RFC 2992. + +To select a next hop, hash-threshold algorithm first assigns a range of +hashes to each next hop in the group, and then selects the next hop by +comparing the SKB hash with the individual ranges. When a next hop is +removed from the group, the ranges are recomputed, which leads to +reassignment of parts of hash space from one next hop to another. RFC 2992 +illustrates it thus:: + + +-------+-------+-------+-------+-------+ + | 1 | 2 | 3 | 4 | 5 | + +-------+-+-----+---+---+-----+-+-------+ + | 1 | 2 | 4 | 5 | + +---------+---------+---------+---------+ + + Before and after deletion of next hop 3 + under the hash-threshold algorithm. + +Note how next hop 2 gave up part of the hash space in favor of next hop 1, +and 4 in favor of 5. While there will usually be some overlap between the +previous and the new distribution, some traffic flows change the next hop +that they resolve to. + +If a multipath group is used for load-balancing between multiple servers, +this hash space reassignment causes an issue that packets from a single +flow suddenly end up arriving at a server that does not expect them. This +can result in TCP connections being reset. + +If a multipath group is used for load-balancing among available paths to +the same server, the issue is that different latencies and reordering along +the way causes the packets to arrive in the wrong order, resulting in +degraded application performance. + +To mitigate the above-mentioned flow redirection, resilient next-hop groups +insert another layer of indirection between the hash space and its +constituent next hops: a hash table. The selection algorithm uses SKB hash +to choose a hash table bucket, then reads the next hop that this bucket +contains, and forwards traffic there. + +This indirection brings an important feature. In the hash-threshold +algorithm, the range of hashes associated with a next hop must be +continuous. With a hash table, mapping between the hash table buckets and +the individual next hops is arbitrary. Therefore when a next hop is deleted +the buckets that held it are simply reassigned to other next hops:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + v v v v + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Before and after deletion of next hop 3 + under the resilient hashing algorithm. + +When weights of next hops in a group are altered, it may be possible to +choose a subset of buckets that are currently not used for forwarding +traffic, and use those to satisfy the new next-hop distribution demands, +keeping the "busy" buckets intact. This way, established flows are ideally +kept being forwarded to the same endpoints through the same paths as before +the next-hop group change. + +Algorithm +--------- + +In a nutshell, the algorithm works as follows. Each next hop deserves a +certain number of buckets, according to its weight and the number of +buckets in the hash table. In accordance with the source code, we will call +this number a "wants count" of a next hop. In case of an event that might +cause bucket allocation change, the wants counts for individual next hops +are updated. + +Next hops that have fewer buckets than their wants count, are called +"underweight". Those that have more are "overweight". If there are no +overweight (and therefore no underweight) next hops in the group, it is +said to be "balanced". + +Each bucket maintains a last-used timer. Every time a packet is forwarded +through a bucket, this timer is updated to current jiffies value. One +attribute of a resilient group is then the "idle timer", which is the +amount of time that a bucket must not be hit by traffic in order for it to +be considered "idle". Buckets that are not idle are busy. + +After assigning wants counts to next hops, an "upkeep" algorithm runs. For +buckets: + +1) that have no assigned next hop, or +2) whose next hop has been removed, or +3) that are idle and their next hop is overweight, + +upkeep changes the next hop that the bucket references to one of the +underweight next hops. If, after considering all buckets in this manner, +there are still underweight next hops, another upkeep run is scheduled to a +future time. + +There may not be enough "idle" buckets to satisfy the updated wants counts +of all next hops. Another attribute of a resilient group is the "unbalanced +timer". This timer can be set to 0, in which case the table will stay out +of balance until idle buckets do appear, possibly never. If set to a +non-zero value, the value represents the period of time that the table is +permitted to stay out of balance. + +With this in mind, we update the above list of conditions with one more +item. Thus buckets: + +4) whose next hop is overweight, and the amount of time that the table has + been out of balance exceeds the unbalanced timer, if that is non-zero, + +\... are migrated as well. + +Offloading & Driver Feedback +---------------------------- + +When offloading resilient groups, the algorithm that distributes buckets +among next hops is still the one in SW. Drivers are notified of updates to +next hop groups in the following three ways: + +- Full group notification with the type + ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is + created and buckets populated for the first time. + +- Single-bucket notifications of the type + ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of + individual migrations within an already-established group. + +- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This + is sent before the group is replaced, and is a way for the driver to veto + the group before committing anything to the HW. + +Some single-bucket notifications are forced, as indicated by the "force" +flag in the notification. Those are used for the cases where e.g. the next +hop associated with the bucket was removed, and the bucket really must be +migrated. + +Non-forced notifications can be overridden by the driver by returning an +error code. The use case for this is that the driver notifies the HW that a +bucket should be migrated, but the HW discovers that the bucket has in fact +been hit by traffic. + +A second way for the HW to report that a bucket is busy is through the +``nexthop_res_grp_activity_update()`` API. The buckets identified this way +as busy are treated as if traffic hit them. + +Offloaded buckets should be flagged as either "offload" or "trap". This is +done through the ``nexthop_bucket_set_hw_flags()`` API. + +Netlink UAPI +------------ + +Resilient Group Replacement +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the +same manner as other multipath groups. The following changes apply to the +attributes passed in the netlink message: + + =================== ========================================================= + ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group. + ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient + groups. + =================== ========================================================= + +``NHA_RES_GROUP`` payload: + + =================================== ========================================= + ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table. + ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t. + ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t. + =================================== ========================================= + +Next Hop Get +^^^^^^^^^^^^ + +Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP`` +message in exactly the same way as other next hop get requests. The +response attributes match the replacement attributes cited above, except +``NHA_RES_GROUP`` payload will include the following attribute: + + =================================== ========================================= + ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out + of balance, in units of clock_t. + =================================== ========================================= + +Bucket Get +^^^^^^^^^^ + +The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is +used to request a single bucket. The attributes recognized at get requests +are: + + =================== ========================================================= + ``NHA_ID`` ID of the next-hop group that the bucket belongs to. + ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. + =================== ========================================================= + +``NHA_RES_BUCKET`` payload: + + ======================== ==================================================== + ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table. + ======================== ==================================================== + +Bucket Dumps +^^^^^^^^^^^^ + +The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used +to request a dump of matching buckets. The attributes recognized at dump +requests are: + + =================== ========================================================= + ``NHA_ID`` If specified, limits the dump to just the next-hop group + with this ID. + ``NHA_OIF`` If specified, limits the dump to buckets that contain + next hops that use the device with this ifindex. + ``NHA_MASTER`` If specified, limits the dump to buckets that contain + next hops that use a device in the VRF with this ifindex. + ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. + =================== ========================================================= + +``NHA_RES_BUCKET`` payload: + + ======================== ==================================================== + ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets + that contain the next hop with this ID. + ======================== ==================================================== + +Usage +----- + +To illustrate the usage, consider the following commands:: + + # ip nexthop add id 1 via 192.0.2.2 dev eth0 + # ip nexthop add id 2 via 192.0.2.3 dev eth0 + # ip nexthop add id 10 group 1/2 type resilient \ + buckets 8 idle_timer 60 unbalanced_timer 300 + +The last command creates a resilient next-hop group. It will have 8 buckets +(which is unusually low number, and used here for demonstration purposes +only), each bucket will be considered idle when no traffic hits it for at +least 60 seconds, and if the table remains out of balance for 300 seconds, +it will be forcefully brought into balance. + +Changing next-hop weights leads to change in bucket allocation:: + + # ip nexthop replace id 10 group 1,3/2 type resilient + +This can be confirmed by looking at individual buckets:: + + # ip nexthop bucket show id 10 + id 10 index 0 idle_time 5.59 nhid 1 + id 10 index 1 idle_time 5.59 nhid 1 + id 10 index 2 idle_time 8.74 nhid 2 + id 10 index 3 idle_time 8.74 nhid 2 + id 10 index 4 idle_time 8.74 nhid 1 + id 10 index 5 idle_time 8.74 nhid 1 + id 10 index 6 idle_time 8.74 nhid 1 + id 10 index 7 idle_time 8.74 nhid 1 + +Note the two buckets that have a shorter idle time. Those are the ones that +were migrated after the next-hop replace command to satisfy the new demand +that next hop 1 be given 6 buckets instead of 4. + +Netdevsim +--------- + +The netdevsim driver implements a mock offload of resilient groups, and +exposes debugfs interface that allows marking individual buckets as busy. +For example, the following will mark bucket 23 in next-hop group 10 as +active:: + + # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity + +In addition, another debugfs interface can be used to configure that the +next attempt to migrate a bucket should fail:: + + # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace + +Besides serving as an example, the interfaces that netdevsim exposes are +useful in automated testing, and +``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of +them to test the algorithm. |