summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-01-26ice: pass pointer to ice_fetch_u64_stats_per_ringJacob Keller2-18/+9
The ice_fetch_u64_stats_per_ring function takes a pointer to the syncp from the ring stats to synchronize reading of the packet stats. It also takes a *copy* of the ice_q_stats fields instead of a pointer to the stats. This completely defeats the point of using the u64_stats API. We pass the stats by value, so they are static at the point of reading within the u64_stats_fetch_retry loop. Simplify the function to take a pointer to the ice_ring_stats instead of two separate parameters. Additionally, since we never call this outside of ice_main.c, make it a static function. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-01-26dt-bindings: net: dsa: fix typos in bindings docsAkiyoshi Kurita1-1/+1
Fix "alway" -> "always" in lan9303.txt and marvell,mv88e6xxx.yaml. Signed-off-by: Akiyoshi Kurita <weibu@redadmin.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20260123150211.2646235-1-weibu@redadmin.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26Merge branch '200GbE' of ↵Jakub Kicinski13-1146/+1417
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== refactor IDPF resource access Pavan Kumar Linga says: Queue and vector resources for a given vport, are stored in the idpf_vport structure. At the time of configuration, these resources are accessed using vport pointer. Meaning, all the config path functions are tied to the default queue and vector resources of the vport. There are use cases which can make use of config path functions to configure queue and vector resources that are not tied to any vport. One such use case is PTP secondary mailbox creation (it would be in a followup series). To configure queue and interrupt resources for such cases, we can make use of the existing config infrastructure by passing the necessary queue and vector resources info. To achieve this, group the existing queue and vector resources into default resource group and refactor the code to pass the resource pointer to the config path functions. This series also includes patches which generalizes the send virtchnl message APIs and mailbox API that are necessary for the implementation of PTP secondary mailbox. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: idpf: generalize mailbox API idpf: avoid calling get_rx_ptypes for each vport idpf: generalize send virtchnl message API idpf: remove vport pointer from queue sets idpf: add rss_data field to RSS function parameters idpf: reshuffle idpf_vport struct members to avoid holes idpf: move some iterator declarations inside for loops idpf: move queue resources to idpf_q_vec_rsrc structure idpf: introduce idpf_q_vec_rsrc struct and move vector resources to it idpf: introduce local idpf structure to store virtchnl queue chunks ==================== Link: https://patch.msgid.link/20260122223601.2208759-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: usb: sr9700: rename register write commands for clarityEthan Nelson-Moore2-6/+6
SR_WR_REG and SR_WR_REGS may be confused at a cursory glance. Rename them to be more easily differentiated to prevent this. Suggested-by: Andrew Lunn <andrew+netdev@lunn.ch> Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Peter Korsgaard <peter@korsgaard.com> Link: https://patch.msgid.link/20260123080409.64165-1-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: usb: sr9700: use ETH_ALEN instead of magic numberEthan Nelson-Moore1-1/+1
The driver hardcodes the number 6 as the number of bytes to write to the SR_PAR register, which stores the MAC address. Use ETH_ALEN instead to make the code clearer. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Peter Korsgaard <peter@korsgaard.com> Link: https://patch.msgid.link/20260123070645.56434-1-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26Merge branch 'net-neighbour-notify-changes-atomically'Jakub Kicinski1-57/+93
Petr Machata says: ==================== net: neighbour: Notify changes atomically Andy Roulin and Francesco Ruggeri have apparently independently both hit an issue with the current neighbor notification scheme. Francesco reported the issue in [1]. In a response[2] to that report, Andy said: neigh_update sends a rtnl notification if an update, e.g., nud_state change, was done but there is no guarantee of ordering of the rtnl notifications. Consider the following scenario: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = STALE read_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = REACHABLE read_unlock_bh(n->lock) rtnl_nofify RTNL REACHABLE sent <-------- rtnl_notify RTNL STALE sent In this scenario, the kernel neigh is updated first to STALE and then REACHABLE but the netlink notifications are sent out of order, first REACHABLE and then STALE. The solution presented in [2] was to extend the critical region to include both the call to neigh_fill_info(), as well as rtnl_notify(). Then we have a guarantee that whatever state was captured by neigh_fill_info(), will be sent right away. The above scenario can thus not happen. This is how this patchset begins: patches #1 and #2 add helper duals to neigh_fill_info() and __neigh_notify() such that the __-prefixed function assumes the neighbor lock is held, and the unprefixed one is a thin wrapper that manages locking. This extends locking further than Andy's patch, but makes for a clear code and supports the following part. At that point, the original race is gone. But what can happen is the following race, where the notification does not reflect the change that was made: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent <-------- neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent again Here, even though neigh_update() made a change to STALE, it later sends a notification with a NUD of REACHABLE. The obvious solution to fix this race is to move the notifier to the same critical section that actually makes the change. Sending a notification in fact involves two things: invoking the internal notifier chain, and sending the netlink notification. The overall approach in this patchset is to move the netlink notification to the critical section of the change, while keeping the internal notifier intact. Since the motion is not obviously correct, the patchset presents the change in series of incremental steps with discussion in commit messages. Please see details in the patches themselves. Reproducer ========== To consistently reproduce, I injected an mdelay before the rtnl_notify() call. Since only one thread should delay, a bit of instrumentation was needed to see where the call originates. The mdelay was then only issued on the call stack rooted in the RTNL request. Then the general idea is to issue an "ip neigh replace" to mark a neighbor entry as failed. In parallel to that, inject an ARP burst that validates the entry. This is all observed with an "ip monitor neigh", where one can see either a REACHABLE->FAILED transition, or FAILED->REACHABLE, while the actual state at the end of the sequence is always REACHABLE. With the patchset, only FAILED->REACHABLE is ever observed in the monitor. Alternatives ============ Another approach to solving the issue would be to have a per-neighbor queue of notification digests, each with a set of fields necessary for formatting a notification. In pseudocode, a neighbor update would look something like this: neighbor_update: - lock - do update - allocate notification digest, fill partially, mark not-committed - unlock - critical-section-breaking stuff (probes, ARP Q, etc.) - lock - fill in missing details to the digest (notably neigh->probes) - mark the digest as committed - while (front of the digest queue is committed) - pop it, convert to notifier, send the notification - unlock This adds more complexity and would imply more changes to the code, which is why I think the approach presented in this patchset is better. But it would allow us to retain the overall structure of the code while giving us accurate notifications. A third approach would be to consider the second race not very serious and be OK with seeing a notification that does not reflect the change that prompted it. Then a two-patch prefix of this patchset would be all that is needed. [1]: https://lore.kernel.org/20220606230107.D70B55EC0B30@us226.sjc.aristanetworks.com [2]: https://lore.kernel.org/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com ==================== Link: https://patch.msgid.link/cover.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: core: neighbour: Make another netlink notification atomicallyPetr Machata1-4/+9
Similarly to the issue from the previous patch, neigh_timer_handler() also updates the neighbor separately from formatting and sending the netlink notification message. We have not seen reports to the effect of this causing trouble, but in theory, the same sort of issues could have come up: neigh_timer_handler() would make changes as necessary, but before formatting and sending a notification, is interrupted before sending by another thread, which makes a parallel change and sends its own message. The message send that is prompted by an earlier change thus contains information that does not reflect the change having been made. To solve this, the netlink notification needs to be in the same critical section that updates the neighbor. The critical section is ended by the neigh_probe() call which drops the lock before calling solicit. Stretching the critical section over the solicit call is problematic, because that can then involved all sorts of forwarding callbacks. Therefore, like in the previous patch, split the netlink notification away from the internal one and move it ahead of the probe call. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e440118511cbdbe1d88eb0d71c9047116feb96e0.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: core: neighbour: Make one netlink notification atomicallyPetr Machata1-3/+5
As noted in a previous patch, one race remains in the current code. A kernel thread might interrupt a userspace thread after the change is done, but before formatting and sending the message. Then what we would see is two messages with the same contents: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent <-------- neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent again The solution is to send the netlink message inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. To that end, in __neigh_update(), move the current neigh_notify() call up to said critical section, and convert it to __neigh_notify(), because the lock is held. This motion crosses calls to neigh_update_managed_list(), neigh_update_gc_list() and neigh_update_process_arp_queue(), all of which potentially unlock and give an opportunity for the above race. This also crosses a call to neigh_update_process_arp_queue() which calls neigh->output(), which might be neigh_resolve_output() calls neigh_event_send() calls neigh_event_send_probe() calls __neigh_event_send() calls neigh_probe(), which touches neigh->probes, an update which will now not be visible in the notification. However, there is indication that there is no promise that these changes will be accurately projected to notifications: fib6_table_lookup() indirectly calls route.c's find_match() calls rt6_probe(), which looks up a neighbor and call __neigh_set_probe_once(), which sets neigh->probes to 0, but neither this nor the caller seems to send a notification. Additionally, the neighbor object that the neigh_probe() mentioned above is called on, might be the alternative neighbor looked up for the ARP queue packet destination. If that is the case, the changed value of n1->probes is not notified anywhere. So at least in some circumstances, the reported number of probes needs to be assumed to change without notification. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/ceb44995498eb52375cb2d46c3245bdb9e74b355.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: core: neighbour: Reorder netlink & internal notificationPetr Machata1-2/+2
The netlink message needs to be send inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. On the other hand, there is no such need in case of notifier chain: the listeners do not assume lock, and often in fact just schedule a delayed work to act on the neighbor later. At least one in fact also takes the neighbor lock. This requires that the netlink notification be done before the internal notifier chain message is sent. That is safe to do, because the current listeners, as well as __neigh_notify(), only read the updated neighbor fields, and never modify them. (Apart from locking.) Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/f3ef74d5460f14c4d102b8a5857d4a6624da9a5a.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: core: neighbour: Inline neigh_update_notify() callsPetr Machata1-11/+10
The obvious idea behind the helper is to keep together the two bits that should be done either both or neither: the internal notifier chain message, and the netlink notification. To make sure that the notification sent reflects the change being made, the netlink message needs to be send inside the critical section where the neighbor is changed. But for the notifier chain, there is no such need: the listeners do not assume lock, and often in fact just schedule a delayed work to act on the neighbor later. At least one in fact also takes the neighbor lock. Therefore these two items have each different locking needs. Now we could unlock inside the helper, but I find that error prone, and the fact that the notification is conditional in the first place does not help to make the call site obvious. So in this patch, the helper is instead removed and the body, which is just these two calls, inlined. That way we can use each notifier independently. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e65dce5882bc6f4aa2530b8a4877d0e003071a1a.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: core: neighbour: Process ARP queue laterPetr Machata1-1/+7
ARP queue processing unlocks the neighbor lock, which can allow another thread to asynchronously perform a neighbor update and send an out of order notification. Therefore this needs to be done after the notification is sent. Move it just before the end of the critical section. Since neigh_update_process_arp_queue() unlocks, it does not form a part of the critical section anymore but it can benefit from the lock being taken. The intention is to eventually do the RTNL notification before this call. This motion crosses a call to neigh_update_is_router(), which should not influence processing of the ARP queue. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/9ea7159e71430ebdc837ebcc880a76b7e82e52a4.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: core: neighbour: Extract ARP queue processing to a helper functionPetr Machata1-35/+43
In order to make manipulation with this bit of code clearer, extract it to a helper function, neigh_update_process_arp_queue(). Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/8b0fa0abe2cf0e24484903f5436fe0ac64163057.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: core: neighbour: Call __neigh_notify() under a lockPetr Machata1-6/+12
Andy Roulin has described an issue with the current neighbor notification scheme as follows. This was also presented publicly at the link below. neigh_update sends a rtnl notification if an update, e.g., nud_state change, was done but there is no guarantee of ordering of the rtnl notifications. Consider the following scenario: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = STALE read_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = REACHABLE read_unlock_bh(n->lock) rtnl_nofify RTNL REACHABLE sent <-------- rtnl_notify RTNL STALE sent In this scenario, the kernel neigh is updated first to STALE and then REACHABLE but the netlink notifications are sent out of order, first REACHABLE and then STALE. The solution is to send the netlink message inside the same critical section that formats the message. That way both the contents and ordering of the message reflect the same state, and we cannot see the abovementioned out-of-order delivery. Even with this patch, a remaining issue that the contents of the message may not reflect the changes made to the neighbor. A kernel thread might still interrupt a userspace thread after the change is done, but before formatting and sending the message. Then what we would see is two messages with the same contents. The following patches will attempt to address that issue. To support those future patches, convert __neigh_notify() to a helper that assumes that the neighbor lock is already taken by having it call __neigh_fill_info() instead of neigh_fill_info(). Add a new helper, neigh_notify(), which takes the lock before calling __neigh_notify(). Migrate all callers to use the latter. Link: https://lore.kernel.org/netdev/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com/ Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/4b4368dcc5f5a7e407009cb6c36b69cfb5282864.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: core: neighbour: Add a neigh_fill_info() helper for when lock not heldPetr Machata1-7/+17
The netlink message needs to be formatted and sent inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. Because it will happen inside an already existing critical section, it has to assume that the neighbor lock is held. Add a helper __neigh_fill_info(), which is like neigh_fill_info(), but makes this assumption. Convert neigh_fill_info() to a wrapper around this new helper. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/7ec20113d5d809200e3534d3ed8f0004514914b8.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26ipv4: igmp: annotate data-races around idev->mr_maxdelayEric Dumazet2-3/+3
idev->mr_maxdelay is read and written locklessly, add READ_ONCE()/WRITE_ONCE() annotations. While we are at it, make this field an u32. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260122172247.2429403-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26ipvlan: remove ipvlan_ht_addr_lookup()Eric Dumazet1-21/+34
ipvlan_ht_addr_lookup() is called four times and not inlined. Split it to ipvlan_ht_addr_lookup6() and ipvlan_ht_addr_lookup4() and rework ipvlan_addr_lookup() to call these helpers once, so that they are (auto)inlined. After this change, ipvlan_addr_lookup() is faster, and we save 350 bytes of text on x86_64. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 123/-473 (-350) Function old new delta ipvlan_addr_lookup 467 590 +123 __pfx_ipvlan_ht_addr_lookup 16 - -16 ipvlan_ht_addr_lookup 457 - -457 Total: Before=22571833, After=22571483, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260122165049.2366985-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: expand NETDEV_RSS_KEY_LEN to 256 bytesEric Dumazet3-10/+16
NETDEV_RSS_KEY_LEN has been set to 52 bytes in 2014, until now. Jakub suggested we bump the size to 128 bytes or more. Some drivers (like idpf) were already working around the core limit. Since this change might cause some issues in admin scripts, bump it directly to 256 in one go. tjbp26:~# cat /proc/sys/net/core/netdev_rss_key | wc -c 768 tjbp26:~# ethtool -x eth1 RX flow hash indirection table for eth1 with 32 RX ring(s): ... RSS hash key: fe:16:5b:2f:93:85:c2:c9:c1:ef:bd:60:c6:e0:2b:99:4d:bf:b7:14:c8:1e:8d:cb:31:17:51:da:55:eb:91:d9:9e:f9:89:9b:44:a1:dc:08:72:3a:b3:d6:31:86:9a:fe:02:3a:0d:eb:a1:7c:f5:a3:51:3b:08:56:c9:3f:71:69:01:ba:70:38 RSS hash function: toeplitz: on xor: off crc32: off Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20260122075206.504ec591@kernel.org/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260122190349.2771064-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26Merge branch 'net-few-critical-helpers-are-inlined-again'Jakub Kicinski6-40/+47
Eric Dumazet says: ==================== net: few critical helpers are inlined again Recent devmem additions increased stack depth. Some helpers that were inlined in the past are now out-of-line. ==================== Link: https://patch.msgid.link/20260122045720.1221017-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: inline get_netmem() and put_netmem()Eric Dumazet2-23/+28
These helpers are used in network fast paths. Only call out-of-line helpers for netmem case. We might consider inlining __get_netmem() and __put_netmem() in the future. $ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4 add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968) Function old new delta pskb_carve 1669 1894 +225 gro_pull_from_frag0 - 206 +206 get_page 190 380 +190 skb_segment 3561 3747 +186 put_page 595 765 +170 skb_copy_ubufs 1683 1822 +139 __pskb_trim_head 276 401 +125 __pskb_copy_fclone 734 858 +124 skb_zerocopy 1092 1215 +123 pskb_expand_head 892 1008 +116 skb_split 828 940 +112 skb_release_data 297 409 +112 ___pskb_trim 829 941 +112 __skb_zcopy_downgrade_managed 120 226 +106 tcp_clone_payload 530 634 +104 esp_ssg_unref 191 294 +103 dev_gro_receive 1464 1514 +50 __put_netmem - 41 +41 __get_netmem - 41 +41 skb_shift 1139 1175 +36 skb_try_coalesce 681 714 +33 __pfx_put_page 112 144 +32 __pfx_get_page 32 64 +32 __pskb_pull_tail 1137 1168 +31 veth_xdp_get 250 267 +17 __pfx_gro_pull_from_frag0 - 16 +16 __pfx___put_netmem - 16 +16 __pfx___get_netmem - 16 +16 __pfx_put_netmem 16 - -16 __pfx_gro_try_pull_from_frag0 16 - -16 __pfx_get_netmem 16 - -16 put_netmem 114 - -114 get_netmem 130 - -130 napi_gro_frags 929 771 -158 gro_try_pull_from_frag0 196 - -196 Total: Before=22565857, After=22567825, chg +0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: inline net_is_devmem_iov()Eric Dumazet3-11/+13
1) Inline this small helper to reduce code size and decrease cpu costs. 2) Constify its argument. 3) Move it to include/net/netmem.h, as a prereq for the following patch. $ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3 add/remove: 0/2 grow/shrink: 0/4 up/down: 0/-158 (-158) Function old new delta validate_xmit_skb 866 857 -9 __pfx_net_is_devmem_iov 16 - -16 net_is_devmem_iov 22 - -22 get_netmem 152 130 -22 put_netmem 140 114 -26 tcp_recvmsg_locked 3860 3797 -63 Total: Before=22566015, After=22565857, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26gro: change the BUG_ON() in gro_pull_from_frag0()Eric Dumazet1-1/+1
Replace the BUG_ON() which never fired with a DEBUG_NET_WARN_ON_ONCE() $ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2 add/remove: 2/2 grow/shrink: 1/1 up/down: 370/-254 (116) Function old new delta gro_try_pull_from_frag0 - 196 +196 napi_gro_frags 771 929 +158 __pfx_gro_try_pull_from_frag0 - 16 +16 __pfx_gro_pull_from_frag0 16 - -16 dev_gro_receive 1514 1464 -50 gro_pull_from_frag0 188 - -188 Total: Before=22565899, After=22566015, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: always inline skb_frag_unref() and __skb_frag_unref()Eric Dumazet1-5/+5
clang is not inlining skb_frag_unref() and __skb_frag_unref() in gro fast path. It also does not inline gro_try_pull_from_frag0(). Using __always_inline fixes this issue, makes the kernel faster _and_ smaller. Also change __skb_frag_ref(), skb_frag_ref() and skb_page_unref() to let them inlined for the last patch in this series. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1 add/remove: 2/6 grow/shrink: 1/2 up/down: 218/-511 (-293) Function old new delta gro_pull_from_frag0 - 188 +188 __pfx_gro_pull_from_frag0 - 16 +16 skb_shift 1125 1139 +14 __pfx_skb_frag_unref 16 - -16 __pfx_gro_try_pull_from_frag0 16 - -16 __pfx___skb_frag_unref 16 - -16 __skb_frag_unref 36 - -36 skb_frag_unref 59 - -59 dev_gro_receive 1608 1514 -94 napi_gro_frags 892 771 -121 gro_try_pull_from_frag0 153 - -153 Total: Before=22566192, After=22565899, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26Merge branch 'u64_stats-introduce-u64_stats_copy'Jakub Kicinski4-5/+20
David Yang says: ==================== u64_stats: Introduce u64_stats_copy() On 64bit arches, struct u64_stats_sync is empty and provides no help against load/store tearing. memcpy() should not be considered atomic against u64 values. Use u64_stats_copy() instead. ==================== Link: https://patch.msgid.link/20260120092137.2161162-1-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26vxlan: vnifilter: fix memcpy with u64_statsDavid Yang1-1/+1
On 64bit arches, struct u64_stats_sync is empty and provides no help against load/store tearing. memcpy() should not be considered atomic against u64 values. Use u64_stats_copy() instead. Signed-off-by: David Yang <mmyangfl@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260120092137.2161162-5-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26macsec: fix memcpy with u64_statsDavid Yang1-3/+3
On 64bit arches, struct u64_stats_sync is empty and provides no help against load/store tearing. memcpy() should not be considered atomic against u64 values. Use u64_stats_copy() instead. Signed-off-by: David Yang <mmyangfl@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260120092137.2161162-4-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26net: bridge: mcast: fix memcpy with u64_statsDavid Yang1-1/+1
On 64bit arches, struct u64_stats_sync is empty and provides no help against load/store tearing. memcpy() should not be considered atomic against u64 values. Use u64_stats_copy() instead. Signed-off-by: David Yang <mmyangfl@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260120092137.2161162-3-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26u64_stats: Introduce u64_stats_copy()David Yang1-0/+15
The following (anti-)pattern was observed in the code tree: do { start = u64_stats_fetch_begin(&pstats->syncp); memcpy(&temp, &pstats->stats, sizeof(temp)); } while (u64_stats_fetch_retry(&pstats->syncp, start)); On 64bit arches, struct u64_stats_sync is empty and provides no help against load/store tearing, especially for memcpy(), for which arches may provide their highly-optimized implements. In theory the affected code should convert to u64_stats_t, or use READ_ONCE()/WRITE_ONCE() properly. However since there are needs to copy chunks of statistics, instead of writing loops at random places, we provide a safe memcpy() variant for u64_stats. Signed-off-by: David Yang <mmyangfl@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260120092137.2161162-2-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-26Documentation: net: Fix typos in netdevices.rstDimitri Daskalakis1-2/+2
Fixes two minor typos. Specifically, on -> or and Devices -> Device. Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com> Link: https://patch.msgid.link/20260122225723.2368698-1-dimitri.daskalakis1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23Merge branch 'net-rds-rds-tcp-state-machine-and-message-loss-improvements'Jakub Kicinski6-81/+171
Allison Henderson says: ==================== net/rds: RDS-TCP state machine and message loss improvements This is subset 2 of the larger RDS-TCP patch series I posted last Oct. The greater series aims to correct multiple rds-tcp issues that can cause dropped or out of sequence messages. I've broken it down into smaller sets to make reviews more manageable. In this set, we correct a few RDS/TCP connection handling issues, and message loss issues. ==================== Link: https://patch.msgid.link/20260122055213.83608-1-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net/rds: rds_tcp_accept_one ought to not discard messagesGerd Rausch6-76/+169
RDS/TCP differs from RDS/RDMA in that message acknowledgment is done based on TCP sequence numbers: As soon as the last byte of a message has been acknowledged by the TCP stack of a peer, "rds_tcp_write_space()" goes on to discard prior messages from the send queue. Which is fine, for as long as the receiver never throws any messages away. Unfortunately, that is *not* the case since the introduction of MPRDS: commit 1a0e100fb2c96 "RDS: TCP: Enable multipath RDS for TCP" A new function "rds_tcp_accept_one_path" was introduced, which is entitled to return "NULL", if no connection path is currently available. Unfortunately, this happens after the "->accept()" call, and the new socket often already contains messages, since the peer already transitioned to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED". That's also the case after this [1]: commit 1a0e100fb2c96 "RDS: TCP: Force every connection to be initiated by numerically smaller IP address" which tried to address the situation of pending data by only transitioning connections from a smaller IP address to "RDS_CONN_UP". But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS" handshake has not occurred yet, and therefore we're working with "c_npaths <= 1", "c_conn[0]" may be in a state distinct from "RDS_CONN_DOWN", and therefore all messages on the just accepted socket will be tossed away. This fix changes "rds_tcp_accept_one": * If connected from a peer with a larger IP address, the new socket will continue to get closed right away. With commit [1] above, there should not be any messages in the socket receive buffer, since the peer never transitioned to "RDS_CONN_UP". Therefore it should be okay to not make any efforts to dispatch the socket receive buffer. * If connected from a peer with a smaller IP address, we call "rds_tcp_accept_one_path" to find a free slot/"path". If found, business goes on as usual. If none was found, we save/stash the newly accepted socket into "rds_tcp_accepted_sock", in order to not lose any messages that may have arrived already. We then return from "rds_tcp_accept_one" with "-ENOBUFS". Later on, when a slot/"path" does become available again (e.g. state transitioned to "RDS_CONN_DOWN", or HS extension header was received with "c_npaths > 1") we call "rds_tcp_conn_slots_available" that simply re-issues a "rds_tcp_accept_one_path" worker-callback and picks up the new socket from "rds_tcp_accepted_sock", and thereby continuing where it left with "-ENOBUFS" last time. Since a new slot has become available, those messages won't be lost, since processing proceeds as if that slot had been available the first time around. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260122055213.83608-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net/rds: No shortcut out of RDS_CONN_ERRORGerd Rausch2-5/+2
RDS connections carry a state "rds_conn_path::cp_state" and transitions from one state to another and are conditional upon an expected state: "rds_conn_path_transition." There is one exception to this conditionality, which is "RDS_CONN_ERROR" that can be enforced by "rds_conn_path_drop" regardless of what state the condition is currently in. But as soon as a connection enters state "RDS_CONN_ERROR", the connection handling code expects it to go through the shutdown-path. The RDS/TCP multipath changes added a shortcut out of "RDS_CONN_ERROR" straight back to "RDS_CONN_CONNECTING" via "rds_tcp_accept_one_path" (e.g. after "rds_tcp_state_change"). A subsequent "rds_tcp_reset_callbacks" can then transition the state to "RDS_CONN_RESETTING" with a shutdown-worker queued. That'll trip up "rds_conn_init_shutdown", which was never adjusted to handle "RDS_CONN_RESETTING" and subsequently drops the connection with the dreaded "DR_INV_CONN_STATE", which leaves "RDS_SHUTDOWN_WORK_QUEUED" on forever. So we do two things here: a) Don't shortcut "RDS_CONN_ERROR", but take the longer path through the shutdown code. b) Add "RDS_CONN_RESETTING" to the expected states in "rds_conn_init_shutdown" so that we won't error out and get stuck, if we ever hit weird state transitions like this again." Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260122055213.83608-2-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23Merge branch 'net-restore-the-structure-of-driver-facing-qcfg-api'Jakub Kicinski7-49/+152
Jakub Kicinski says: ==================== net: restore the structure of driver-facing qcfg API The goal of qcfg objects is to let us seamlessly support new use cases without modifying all the drivers. We want to pull all the logic of combining configuration supplied via different interfaces into the core and present the drivers with a flat queue-by-queue configuration. Additionally we want to separate the current effective configuration from the user intent (default vs user setting vs memory provider setting). Restructure the recently added code to re-introduce the pieces that are missing compared to the old RFC: https://lore.kernel.org/20250421222827.283737-1-kuba@kernel.org Namely: - the netdev_queue_config() helper - queue config validation callback I hopefully removed all the more "out there" parts of the RFC. ==================== Link: https://patch.msgid.link/20260122005113.2476634-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23eth: bnxt: plug bnxt_validate_qcfg() into qopsJakub Kicinski1-5/+6
Plug bnxt_validate_qcfg() back into qops, where it was in my old RFC. Link: https://patch.msgid.link/20260122005113.2476634-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: add queue config validation callbackJakub Kicinski4-14/+76
I imagine (tm) that as the number of per-queue configuration options grows some of them may conflict for certain drivers. While the drivers can obviously do all the validation locally doing so is fairly inconvenient as the config is fed to drivers piecemeal via different ops (for different params and NIC-wide vs per-queue). Add a centralized callback for validating the queue config in queue ops. The callback gets invoked before memory provider is installed, and in the future should also be called when ring params are modified. The validation is done after each layer of configuration. Since we can't fail MP un-binding we must make sure that the config is valid both before and after MP overrides are applied. This is moot for now since the set of MP and device configs are disjoint. It will matter significantly in the future, so adding it now so that we don't forget.. Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: use netdev_queue_config() for mp restartJakub Kicinski4-39/+37
We should follow the prepare/commit approach for queue configuration. The qcfg struct should be added to dev->cfg rather than directly to queue objects so that we can clone and discard the pending config easily. Remove the qcfg in struct netdev_rx_queue, and switch remaining callers to netdev_queue_config(). netdev_queue_config() will construct the qcfg on the fly based on device defaults and state of the queue. ndo_default_qcfg becomes optional because having the callback itself does not have any meaningful semantics to us. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: move mp->rx_page_size validation to __net_mp_open_rxq()Jakub Kicinski1-4/+6
Move mp->rx_page_size validation where the rest of MP input validation lives. No other caller is modifying mp params so validation logic in queue restarts is out of place. Link: https://patch.msgid.link/20260122005113.2476634-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: introduce a trivial netdev_queue_config()Jakub Kicinski4-3/+39
We may choose to extend or reimplement the logic which renders the per-queue config. The drivers should not poke directly into the queue state. Add a helper for drivers to use when they want to query the config for a specific queue. Link: https://patch.msgid.link/20260122005113.2476634-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23eth: bnxt: always set the queue mgmt opsJakub Kicinski1-1/+5
Core provides a centralized callback for validating per-queue settings but the callback is part of the queue management ops. Having the ops conditionally set complicates the parts of the driver which could otherwise lean on the core to feed it the correct settings. Always set the queue ops, but provide no restart-related callbacks if queue ops are not supported by the device. This should maintain current behavior, the check in netdev_rx_queue_restart() looks both at op struct and individual ops. Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Link: https://patch.msgid.link/20260122005113.2476634-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23Merge branch 'selftest-extend-tun-virtio-coverage-for-gso-over-udp-tunnel'Jakub Kicinski3-39/+1265
Xu Du says: ==================== selftest: Extend tun/virtio coverage for GSO over UDP tunnel The design strategy is to extend the existing tun testing infrastructure to support this new use-case, rather than introducing a new or parallel framework. This allows for better integration and re-use of existing test logic. ==================== Link: https://patch.msgid.link/cover.1768979440.git.xudu@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23selftest: tun: Add test data for success and failure pathsXu Du1-2/+113
To improve the robustness and coverage of the TUN selftests, this patch expands the set of test data. Signed-off-by: Xu Du <xudu@redhat.com> Link: https://patch.msgid.link/5054f3ad9f3dbfe33b827183fccc5efeb8fd0da7.1768979440.git.xudu@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23selftest: tun: Add test for receiving gso packet from tunXu Du1-0/+194
The test validate that GSO information are correctly exposed when reading packets from a TUN device. Signed-off-by: Xu Du <xudu@redhat.com> Link: https://patch.msgid.link/fe75ac66466380490eba858eef50596a1bfbd071.1768979440.git.xudu@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23selftest: tun: Add test for sending gso packet into tunXu Du1-9/+135
The test constructs a raw packet, prepends a virtio_net_hdr, and writes the result to the TUN device. This mimics the behavior of a vm forwarding a guest's packet to the host networking stack. Signed-off-by: Xu Du <xudu@redhat.com> Link: https://patch.msgid.link/a988dbc9ca109e4f1f0b33858c5035bce8ebede3.1768979440.git.xudu@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23selftest: tun: Add helpers for GSO over UDP tunnelXu Du1-0/+425
In preparation for testing GSO over UDP tunnels, enhance the test infrastructure to support a more complex data path involving a TUN device and a GENEVE udp tunnel. This patch introduces a dedicated setup/teardown topology that creates both a GENEVE tunnel interface and a TUN interface. The TUN device acts as the VTEP (Virtual Tunnel Endpoint), allowing it to send and receive virtio-net packets. This setup effectively tests the kernel's data path for encapsulated traffic. Note that after adding a new address to the UDP tunnel, we need to wait a bit until the associated route is available. Additionally, a new data structure is defined to manage test parameters. This structure is designed to be extensible, allowing different test data and configurations to be easily added in subsequent patches. Signed-off-by: Xu Du <xudu@redhat.com> Link: https://patch.msgid.link/b5787b8c269f43ce11e1756f1691cc7fd9a1e901.1768979440.git.xudu@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23selftest: tun: Refactor tun_delete to use tuntap_helpersXu Du2-40/+15
The previous patch introduced common tuntap helpers to simplify tun test code. This patch refactors the tun_delete function to use these new helpers. Signed-off-by: Xu Du <xudu@redhat.com> Link: https://patch.msgid.link/ecc7c0c2d75d87cb814e97579e731650339703ab.1768979440.git.xudu@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23selftest: tun: Introduce tuntap_helpers.h header for TUN/TAP testingXu Du1-0/+390
Introduce rtnetlink manipulation and packet construction helpers that will simplify the later creation of more related test cases. This avoids duplicating logic across different test cases. This new header will contain: - YNL-based netlink management utilities. - Helpers for ip link, ip address, ip neighbor and ip route operations. - Packet construction and manipulation helpers. Signed-off-by: Xu Du <xudu@redhat.com> Link: https://patch.msgid.link/91f905715c69c75f7bf72d43388921fde6c34989.1768979440.git.xudu@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23selftest: tun: Format tun.c existing codeXu Du1-10/+15
In preparation for adding new tests for GSO over UDP tunnels, apply consistently the kernel style to the existing code. Signed-off-by: Xu Du <xudu@redhat.com> Link: https://patch.msgid.link/d797de1e5a3d215dd78cb46775772ef682bab60e.1768979440.git.xudu@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23Merge branch 'geneve-introduce-double-tunnel-gso-gro-support'Jakub Kicinski10-38/+974
Paolo Abeni says: ==================== geneve: introduce double tunnel GSO/GRO support This is the [belated] incarnation of topic discussed in the last Neconf [1]. In container orchestration in virtual environments there is a consistent usage of double UDP tunneling - specifically geneve. Such setup lack support of GRO and GSO for inter VM traffic. After commit b430f6c38da6 ("Merge branch 'virtio_udp_tunnel_08_07_2025' of https://github.com/pabeni/linux-devel") and the qemu cunter-part, VMs are able to send/receive GSO over UDP aggregated packets. This series introduces the missing bit for full end-to-end aggregation in the above mentioned scenario. Specifically: - introduces a new netdev feature set to generalize existing per device driver GSO admission check.1 - adds GSO partial support for the geneve and vxlan drivers - introduces and use a geneve option to assist double tunnel GRO - adds some simple functional tests for the above. The new device features set is not strictly needed for the following work, but avoids the introduction of trivial `ndo_features_check` to support GSO partial and thus possible performance regression due to the additional indirect call. Such feature set could be leveraged by a number of existing drivers (intel, meta and possibly wangxun) to avoid duplicate code/tests. Such part has been omitted here to keep the series small. Both GSO partial support and double GRO support have some downsides. With the first in place, GSO partial packets will traverse the network stack 'downstream' the outer geneve UDP tunnel and will be visible by the udp/IP/IPv6 and by netfilter. Currently only H/W NICs implement GSO partial support and such packets are visible only via software taps. Double UDP tunnel GRO will cook 'GSO partial' like aggregate packets, i.e. the inner UDP encapsulation headers set will still carry the wire-level lengths and csum, so that segmentation considering such headers parts of a giant, constant encapsulation header will yield the correct result. The correct GSO packet layout is applied when the packet traverse the outermost geneve encapsulation. Both GSO partial and double UDP encap are disabled by default and must be explicitly enabled via, respectively ethtool and geneve device configuration. Finally note that the GSO partial feature could potentially be applied to all the other UDP tunnels, but this series limits its usage to geneve and vxlan devices. Link: https://netdev.bots.linux.dev/netconf/2024/paolo.pdf [1] ==================== Link: https://patch.msgid.link/cover.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23selftests: net: tests for add double tunneling GRO/GSOPaolo Abeni3-0/+395
Create a simple, netns-based topology with double, nested UDP tunnels and perform TSO transfers on top. Explicitly enable GSO and/or GRO and check the skb layout consistency with different configuration allowing (or not) GSO frames to be delivered on the other end. The trickest part is account in a robust way the aggregated/unaggregated packets with double encapsulation: use a classic bpf filter for it. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/61f2c98ba0f73057c2d6f6cb62eb807abd90bf6b.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23geneve: use GRO hint option in the RX pathPaolo Abeni1-7/+176
At the GRO stage, when a valid hint option is found, try match the whole nested headers and try to aggregate on the inner protocol; in case of hdr mismatch extract the nested address and port to properly flush on a per-inner flow basis. On GRO completion, the (unmodified) nested headers will be considered part of the (constant) outer geneve encap header so that plain UDP tunnel segmentation will yield valid wire packets. In the geneve RX path, when processing a GSO packet carrying a GRO hint option, update the nested header length fields from the wire packet size to the GSO-packet one. If the nested header additionally carries a checksum, convert it to CSUM-partial. Finally, when the RX path leverages the GRO hints, skip the additional GRO stage done by GRO cells: otherwise the already set skb->encapsulation flag will foul the GRO cells complete step to use touch the innermost IP header when it should update the nested csum, corrupting the packet. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/4a9a390588a429191e0ffe48ccdd288bb69e567e.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23geneve: extract hint option at GRO stagePaolo Abeni1-5/+183
Add helpers for finding a GRO hint option in the geneve header, performing basic sanitization of the option offsets vs the actual packet layout, validate the option for GRO aggregation and check the nested header checksum. The validation helper closely mirrors similar check performed by the ipv4 and ipv6 gro callbacks, with the additional twist of accessing the relevant network header via the GRO hint offset. To validate the nested UDP checksum, leverage the csum completed of the outer header, similarly to LCO, with the main difference that in this case we have the outer checksum available. Use the helpers to extract the hint info at the GRO stage. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/cd0e9dc42ba83f388b604097cffe268ffcb53351.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>