summaryrefslogtreecommitdiff
path: root/net/core/dev.c
AgeCommit message (Collapse)AuthorFilesLines
2026-05-09net: napi: Avoid gro timer misfiring at end of busypollDragos Tatulea1-9/+12
When in irq deferral mode (defer-hard-irqs > 0), a short enough gro-flush timeout can trigger before NAPI_STATE_SCHED is cleared if the last poll in busy_poll_stop() takes too long. This can have the effect of leaving the queue stuck with interrupts disabled and no timer armed which results in a tx timeout if there is no subsequent busypoll cycle. To prevent this, defer the gro-flush timer arm after the last poll. Fixes: 7fd3253a7de6 ("net: Introduce preferred busy-polling") Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260506090808.820559-2-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-05net: prevent possible UAF in rtnl_prop_list_size()Eric Dumazet1-1/+1
I was mistaken by synchronize_rcu() [1] call in netdev_name_node_alt_destroy(), giving a false sense of RCU safety at delete times. We have to use list_del_rcu() to not confuse potential readers in rtnl_prop_list_size(). [1] This synchronize_rcu() call was later removed in commit 723de3ebef03 ("net: free altname using an RCU callback"). Fixes: 9f30831390ed ("net: add rcu safety to rtnl_prop_list_size()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260502124102.499204-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-24Merge tag 'net-deletions' of ↵Linus Torvalds1-7/+0
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking deletions from Jakub Kicinski: "Delete some obsolete networking code Old code like amateur radio and NFC have long been a burden to core networking developers. syzbot loves to find bugs in BKL-era code, and noobs try to fix them. If we want to have a fighting chance of surviving the LLM-pocalypse this code needs to find a dedicated owner or get deleted. We've talked about these deletions multiple times in the past and every time someone wanted the code to stay. It is never very clear to me how many of those people actually use the code vs are just nostalgic to see it go. Amateur radio did have occasional users (or so I think) but most users switched to user space implementations since its all super slow stuff. Nobody stepped up to maintain the kernel code. We were lucky enough to find someone who wants to help with NFC so we're giving that a chance. Let's try to put the rest of this code behind us" * tag 'net-deletions' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: drivers: net: 8390: wd80x3: Remove this driver drivers: net: 8390: ultra: Remove this driver drivers: net: 8390: AX88190: Remove this driver drivers: net: fujitsu: fmvj18x: Remove this driver drivers: net: smsc: smc91c92: Remove this driver drivers: net: smsc: smc9194: Remove this driver drivers: net: amd: nmclan: Remove this driver drivers: net: amd: lance: Remove this driver drivers: net: 3com: 3c589: Remove this driver drivers: net: 3com: 3c574: Remove this driver drivers: net: 3com: 3c515: Remove this driver drivers: net: 3com: 3c509: Remove this driver net: packetengines: remove obsolete yellowfin driver and vendor dir net: packetengines: remove obsolete hamachi driver net: remove unused ATM protocols and legacy ATM device drivers net: remove ax25 and amateur radio (hamradio) subsystem net: remove ISDN subsystem and Bluetooth CMTP caif: remove CAIF NETWORK LAYER
2026-04-23net: remove unused ATM protocols and legacy ATM device driversJakub Kicinski1-7/+0
Remove the ATM protocol modules and PCI/SBUS ATM device drivers that are no longer in active use. The ATM core protocol stack, PPPoATM, BR2684, and USB DSL modem drivers (drivers/usb/atm/) are retained in-tree to maintain PPP over ATM (PPPoA) and PPPoE-over-BR2684 support for DSL connections. The Solos ADSL2+ PCI driver is also retained. Removed ATM protocol modules: - net/atm/clip.c - Classical IP over ATM (RFC 2225) - net/atm/lec.c - LAN Emulation Client (LANE) - net/atm/mpc.c, mpoa_caches.c, mpoa_proc.c - Multi-Protocol Over ATM Removed PCI/SBUS ATM device drivers (drivers/atm/): - adummy, atmtcp - software/testing ATM devices - eni - Efficient Networks ENI155P (OC-3, ~1995) - fore200e - FORE Systems 200E PCI/SBUS (OC-3, ~1999) - he - ForeRunner HE (OC-3/OC-12, ~2000) - idt77105 - IDT 77105 25 Mbps ATM PHY - idt77252 - IDT 77252 NICStAR II (OC-3, ~2000) - iphase - Interphase ATM PCI (OC-3/DS3/E3) - lanai - Efficient Networks Speedstream 3010 - nicstar - IDT 77201 NICStAR (155/25 Mbps, ~1999) - suni - PMC S/UNI SONET PHY library Also clean up references in: - net/bridge/ - remove ATM LANE hook (br_fdb_test_addr_hook, br_fdb_test_addr) - net/core/dev.c - remove br_fdb_test_addr_hook export - defconfig files - remove ATM driver config options The removed code is moved to an out-of-tree module package (mod-orphan). Acked-by: Andy Shevchenko <andriy.shevchenko@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260422041846.2035118-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-21net: warn ops-locked drivers still using ndo_set_rx_modeStanislav Fomichev1-0/+5
Now that all in-tree ops-locked drivers have been converted to ndo_set_rx_mode_async, add a warning in register_netdevice to catch any remaining or newly added drivers that use ndo_set_rx_mode with ops locking. This ensures future driver authors are guided toward the async path. Also route ops-locked devices through netdev_rx_mode_work even if they lack rx_mode NDOs, to ensure netdev_ops_assert_locked() does not fire on the legacy path where only RTNL is held. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-14-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-21net: move promiscuity handling into netdev_rx_mode_workStanislav Fomichev1-12/+4
Move unicast promiscuity tracking into netdev_rx_mode_work so it runs under netdev_ops_lock instead of under the addr_lock spinlock. This is required because __dev_set_promiscuity calls dev_change_rx_flags and __dev_notify_flags, both of which may need to sleep. Change ASSERT_RTNL() to netdev_ops_assert_locked() in __dev_set_promiscuity, netif_set_allmulti and __dev_change_flags since these are now called from the work queue under the ops lock. Link: https://lore.kernel.org/netdev/20260214033859.43857-1-jiayuan.chen@linux.dev/ Fixes: 78cd408356fe ("net: add missing instance lock to dev_set_promiscuity") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-5-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-21net: cache snapshot entries for ndo_set_rx_mode_asyncStanislav Fomichev1-0/+3
Add a per-device netdev_hw_addr_list cache (rx_mode_addr_cache) that allows __hw_addr_list_snapshot() and __hw_addr_list_reconcile() to reuse previously allocated entries instead of hitting GFP_ATOMIC on every snapshot cycle. snapshot pops entries from the cache when available, falling back to __hw_addr_create(). reconcile splices both snapshot lists back into the cache via __hw_addr_splice(). The cache is flushed in free_netdev(). Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-4-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-21net: introduce ndo_set_rx_mode_async and netdev_rx_mode_workStanislav Fomichev1-41/+2
Add ndo_set_rx_mode_async callback that drivers can implement instead of the legacy ndo_set_rx_mode. The legacy callback runs under the netif_addr_lock spinlock with BHs disabled, preventing drivers from sleeping. The async variant runs from a work queue with rtnl_lock and netdev_lock_ops held, in fully sleepable context. When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules netdev_rx_mode_work instead of calling the driver inline. The work function takes two snapshots of each address list (uc/mc) under the addr_lock, then drops the lock and calls the driver with the work copies. After the driver returns, it reconciles the snapshots back to the real lists under the lock. Add netif_rx_mode_sync() to opportunistically execute the pending workqueue update inline, so that rx mode changes are committed before returning to userspace: - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK) - dev_set_promiscuity - dev_set_allmulti - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI - do_setlink (RTM_SETLINK) Note that some deep hierarchies still do skip the lower updates via: - dev_uc_sync - dev_mc_sync If we do end up hitting user-visible issues, we can add more calls to netif_rx_mode_sync in specific places. But hopefully we should not, the actual user-visible lists are still synced, it's that just HW state that might be lagging. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260416185712.2155425-3-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-12net: fix reference tracker mismanagement in netdev_put_lock()Jakub Kicinski1-11/+5
dev_put() releases a reference which didn't have a tracker. References without a tracker are accounted in the tracking code as "no_tracker". We can't free the tracker and then call dev_put(). The references themselves will be fine but the tracking code will think it's a double-release: refcount_t: decrement hit 0; leaking memory. IOW commit under fixes confused dev_put() (release never tracked reference) with __dev_put() (just release the reference, skipping the reference tracking infra). Since __netdev_put_lock() uses dev_put() we can't feed a previously tracked netdev ref into it. Let's flip things around. netdev_put(dev, NULL) is the same as dev_put(dev) so make netdev_put_lock() the real function and have __netdev_put_lock() feed it a NULL tracker for all the cases that were untracked. Fixes: d04686d9bc86 ("net: Implement netdev_nl_queue_create_doit") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20260410153600.1984522-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: Add net_cookie to Dead loop messagesChris J Arges1-2/+3
Network devices can have the same name within different network namespaces. To help distinguish these devices, add the net_cookie value which can be used to identify the netns. Signed-off-by: Chris J Arges <carges@cloudflare.com> Link: https://patch.msgid.link/20260408191056.1036330-1-carges@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-10netkit: Add netkit notifier to check for unregistering devicesDaniel Borkmann1-0/+6
Add a netdevice notifier in netkit to watch for NETDEV_UNREGISTER events. If the target device is indeed NETREG_UNREGISTERING and previously leased a queue to a netkit device, then collect the related netkit devices and batch-unregister_netdevice_many() them. If this were not done, then the netkit device would hold a reference on the physical device preventing it from going away. However, in case of both io_uring zero-copy as well as AF_XDP this situation is handled gracefully and the allocated resources are torn down. In the case where mentioned infra is used through netkit, the applications have a reference on netkit, and netkit in turn holds a reference on the physical device. In order to have netkit release the reference on the physical device, we need such watcher to then unregister the netkit ones. This is generally quite similar to the dependency handling in case of tunnels (e.g. vxlan bound to a underlying netdev) where the tunnel device gets removed along with the physical device. # ip a [...] 4: enp10s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff inet 10.0.0.2/24 scope global enp10s0f0np0 valid_lft forever preferred_lft forever [...] 8: nk@NONE: <BROADCAST,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff [...] # rmmod mlx5_ib # rmmod mlx5_core [...] [ 309.261822] mlx5_core 0000:0a:00.0 mlx5_0: Port: 1 Link DOWN [ 344.235236] mlx5_core 0000:0a:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.246948] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.463754] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.770155] mlx5_core 0000:0a:00.1: E-Switch: cleanup [...] # ip a [...] [ both enp10s0f0np0 and nk gone ] [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-13-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-10net: Proxy netif_mp_{open,close}_rxq for leased queuesDavid Wei1-3/+1
When a process in a container wants to setup a memory provider, it will use the virtual netdev and a leased rxq, and call netif_mp_{open,close}_rxq to try and restart the queue. At this point, proxy the queue restart on the real rxq in the physical netdev. For memory providers (io_uring zero-copy rx and devmem), it causes the real rxq in the physical netdev to be filled from a memory provider that has DMA mapped memory from a process within a container. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-7-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-10net: Implement netdev_nl_queue_create_doitDaniel Borkmann1-0/+8
Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ynl --family netdev --output-json --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-08net: pull headers in qdisc_pkt_len_segs_init()Eric Dumazet1-20/+31
Most ndo_start_xmit() methods expects headers of gso packets to be already in skb->head. net/core/tso.c users are particularly at risk, because tso_build_hdr() does a memcpy(hdr, skb->data, hdr_len); qdisc_pkt_len_segs_init() already does a dissection of gso packets. Use pskb_may_pull() instead of skb_header_pointer() to make sure drivers do not have to reimplement this. Some malicious packets could be fed, detect them so that we can drop them sooner with a new SKB_DROP_REASON_SKB_BAD_GSO drop_reason. Fixes: e876f208af18 ("net: Add a software TSO helper API") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260403221540.3297753-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-08net: qdisc_pkt_len_segs_init() cleanupEric Dumazet1-31/+31
Reduce indentation level by returning early if the transport header was not set. Add an unlikely() clause as this is not the common case. No functional change. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260403221540.3297753-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-3/+8
Cross-merge networking fixes after downstream PR (net-7.0-rc7). Conflicts: net/vmw_vsock/af_vsock.c b18c83388874 ("vsock: initialize child_ns_mode_locked in vsock_net_init()") 0de607dc4fd8 ("vsock: add G2H fallback for CIDs not owned by H2G transport") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c ceee35e5674a ("bnxt_en: Refactor some basic ring setup and adjustment logic") 57cdfe0dc70b ("bnxt_en: Resize RSS contexts on channel count change") drivers/net/wireless/intel/iwlwifi/mld/mac80211.c 4d56037a02bd ("wifi: iwlwifi: mld: block EMLSR during TDLS connections") 687a95d204e7 ("wifi: iwlwifi: mld: correctly set wifi generation data") drivers/net/wireless/intel/iwlwifi/mld/scan.h b6045c899e37 ("wifi: iwlwifi: mld: Refactor scan command handling") ec66ec6a5a8f ("wifi: iwlwifi: mld: Fix MLO scan timing") drivers/net/wireless/intel/iwlwifi/mvm/fw.c 078df640ef05 ("wifi: iwlwifi: mld: add support for iwl_mcc_allowed_ap_type_cmd v 2") 323156c3541e ("wifi: iwlwifi: mvm: don't send a 6E related command when not supported") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-31net: use skb_header_pointer() for TCPv4 GSO frag_off checkGuoyu Su1-3/+8
Syzbot reported a KMSAN uninit-value warning in gso_features_check() called from netif_skb_features() [1]. gso_features_check() reads iph->frag_off to decide whether to clear mangleid_features. Accessing the IPv4 header via ip_hdr()/inner_ip_hdr() can rely on skb header offsets that are not always safe for direct dereference on packets injected from PF_PACKET paths. Use skb_header_pointer() for the TCPv4 frag_off check so the header read is robust whether data is already linear or needs copying. [1] https://syzkaller.appspot.com/bug?extid=1543a7d954d9c6d00407 Link: https://lore.kernel.org/netdev/willemdebruijn.kernel.1a9f35039caab@gmail.com/ Fixes: cbc53e08a793 ("GSO: Add GSO type for fixed IPv4 ID") Reported-by: syzbot+1543a7d954d9c6d00407@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1543a7d954d9c6d00407 Tested-by: syzbot+1543a7d954d9c6d00407@syzkaller.appspotmail.com Signed-off-by: Guoyu Su <yss2813483011xxl@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260327153507.39742-1-yss2813483011xxl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-29net: remove EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() macrosFernando Fernandez Mancera1-3/+0
As IPv6 is built-in only, the macro is always evaluating to an empty one. Remove it completely from the code. Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260325120928.15848-3-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-5/+17
Cross-merge networking fixes after downstream PR (net-7.0-rc6). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-24net: correctly handle tunneled traffic on IPV6_CSUM GSO fallbackWillem de Bruijn1-5/+17
NETIF_F_IPV6_CSUM only advertises support for checksum offload of packets without IPv6 extension headers. Packets with extension headers must fall back onto software checksumming. Since TSO depends on checksum offload, those must revert to GSO. The below commit introduces that fallback. It always checks network header length. For tunneled packets, the inner header length must be checked instead. Extend the check accordingly. A special case is tunneled packets without inner IP protocol. Such as RFC 6951 SCTP in UDP. Those are not standard IPv6 followed by transport header either, so also must revert to the software GSO path. Cc: stable@vger.kernel.org Fixes: 864e3396976e ("net: gso: Forbid IPv6 TSO with extensions on devices with only IPV6_CSUM") Reported-by: Tangxin Xie <xietangxin@yeah.net> Closes: https://lore.kernel.org/netdev/0414e7e2-9a1c-4d7c-a99d-b9039cf68f40@yeah.net/ Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260320190148.2409107-1-willemdebruijn.kernel@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-14net: plumb drop reasons to __dev_queue_xmit()Eric Dumazet1-40/+43
Add drop reasons to __dev_queue_xmit(): - SKB_DROP_REASON_DEV_READY : device is not UP. - SKB_DROP_REASON_RECURSION_LIMIT : recursion limit on virtual device is hit. Also add an unlikely() for the SKB_DROP_REASON_DEV_READY case, and reduce indentation level. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260312201824.203093-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-10net: export netif_open for self_test usageMike Marciniszyn (Meta)1-0/+1
dev_open() already is exported, but drivers which use the netdev instance lock need to use netif_open() instead. netif_close() is also already exported [1] so this completes the pairing. This export is required for the following fbnic self tests to avoid calling ndo_stop() and ndo_open() in favor of the more appropriate netif_open() and netif_close() that notifies any listeners that the interface went down to test and is now coming back up. Link: https://patch.msgid.link/20250309215851.2003708-1-sdf@fomichev.me [1] Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com> Link: https://patch.msgid.link/20260307105847.1438-2-mike.marciniszyn@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-10net/sched: do not reset queues in graft operationsEric Dumazet1-1/+1
Following typical script is extremely disruptive, because each graft operation calls dev_deactivate() which resets all the queues of the device. QPARAM="limit 100000 flow_limit 1000 buckets 4096" TXQS=64 for ETH in eth1 do tc qd del dev $ETH root 2>/dev/null tc qd add dev $ETH root handle 1: mq for i in `seq 1 $TXQS` do slot=$( printf %x $(( i )) ) tc qd add dev $ETH parent 1:$slot fq $QPARAM done done One can add "ip link set dev $ETH down/up" to reduce the disruption time: QPARAM="limit 100000 flow_limit 1000 buckets 4096" TXQS=64 for ETH in eth1 do ip link set dev $ETH down tc qd del dev $ETH root 2>/dev/null tc qd add dev $ETH root handle 1: mq for i in `seq 1 $TXQS` do slot=$( printf %x $(( i )) ) tc qd add dev $ETH parent 1:$slot fq $QPARAM done ip link set dev $ETH up done Or we can add a @reset_needed flag to dev_deactivate() and dev_deactivate_many(). This flag is set to true at device dismantle or linkwatch_do_dev(), and to false for graft operations. In the future, we might only stop one queue instead of the whole device, ie call dev_deactivate_queue() instead of dev_deactivate(). I think the problem (quadratic behavior) was added in commit 2fb541c862c9 ("net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc") but this does not look serious enough to deserve risky backports. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260307163430.470644-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-11/+13
Cross-merge networking fixes after downstream PR (net-7.0-rc3). No conflicts. Adjacent changes: net/netfilter/nft_set_rbtree.c fb7fb4016300 ("netfilter: nf_tables: clone set on flush only") 3aea466a4399 ("netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05net: Provide a PREEMPT_RT specific check for netdev_queue::_xmit_lockSebastian Andrzej Siewior1-4/+1
After acquiring netdev_queue::_xmit_lock the number of the CPU owning the lock is recorded in netdev_queue::xmit_lock_owner. This works as long as the BH context is not preemptible. On PREEMPT_RT the softirq context is preemptible and without the softirq-lock it is possible to have multiple user in __dev_queue_xmit() submitting a skb on the same CPU. This is fine in general but this means also that the current CPU is recorded as netdev_queue::xmit_lock_owner. This in turn leads to the recursion alert and the skb is dropped. Instead checking the for CPU number, that owns the lock, PREEMPT_RT can check if the lockowner matches the current task. Add netif_tx_owned() which returns true if the current context owns the lock by comparing the provided CPU number with the recorded number. This resembles the current check by negating the condition (the current check returns true if the lock is not owned). On PREEMPT_RT use rt_mutex_owner() to return the lock owner and compare the current task against it. Use the new helper in __dev_queue_xmit() and netif_local_xmit_active() which provides a similar check. Update comments regarding pairing READ_ONCE(). Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/20260216134333.412332-1-spasswolf@web.de Fixes: 3253cb49cbad4 ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Bert Karwatzki <spasswolf@web.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260302162631.uGUyIqDT@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-05net: devmem: use READ_ONCE/WRITE_ONCE on binding->devBobby Eshleman1-1/+1
binding->dev is protected on the write-side in mp_dmabuf_devmem_uninstall() against concurrent writes, but due to the concurrent bare reads in net_devmem_get_binding() and validate_xmit_unreadable_skb() it should be wrapped in a READ_ONCE/WRITE_ONCE pair to make sure no compiler optimizations play with the underlying register in unforeseen ways. Doesn't present a critical bug because the known compiler optimizations don't result in bad behavior. There is no tearing on u64, and load omissions/invented loads would only break if additional binding->dev references were inlined together (they aren't right now). This just more strictly follows the linux memory model (i.e., "Lock-Protected Writes With Lockless Reads" in tools/memory-model/Documentation/access-marking.txt). Fixes: bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260302-devmem-membar-fix-v2-1-5b33c9cbc28b@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05net-sysfs: use rps_tag_ptr and remove metadata from rps_dev_flow_tableEric Dumazet1-22/+31
Instead of storing the @log at the beginning of rps_dev_flow_table use 5 low order bits of the rps_tag_ptr to store the log of the size. This removes a potential cache line miss (for light traffic). This allows us to switch to one high-order allocation instead of vmalloc() when CONFIG_RFS_ACCEL is not set. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05net-sysfs: use rps_tag_ptr and remove metadata from rps_sock_flow_tableEric Dumazet1-5/+7
Instead of storing the @mask at the beginning of rps_sock_flow_table, use 5 low order bits of the rps_tag_ptr to store the log of the size. This removes a potential cache line miss to fetch @mask. More importantly, we can switch to vmalloc_huge() without wasting memory. Tested with: numactl --interleave=all bash -c "echo 4194304 >/proc/sys/net/core/rps_sock_flow_entries" Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-05net-sysfs: add rps_sock_flow_table_mask() helperEric Dumazet1-1/+3
In preparation of the following patch, abstract access to the @mask field in 'struct rps_sock_flow_table'. Also cleanup rps_sock_flow_sysctl() a bit : - Rename orig_sock_table to o_sock_table. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260302181432.1836150-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-04net: core: allow netdev_upper_get_next_dev_rcu from bh contextKohei Enju1-1/+2
Since XDP programs are called from a NAPI poll context, the RCU reference liveness is ensured by local_bh_disable(). Commit aeea1b86f936 ("bpf, devmap: Exclude XDP broadcast to master device") started to call netdev_upper_get_next_dev_rcu() from this context, but missed adding rcu_read_lock_bh_held() as a condition to the RCU checks. While both bh_disabled and rcu_read_lock() provide RCU protection, lockdep complains since the check condition is insufficient [1]. Add rcu_read_lock_bh_held() as condition to help lockdep to understand the dereference is safe, in the same way as commit 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context"). [1] WARNING: net/core/dev.c:8099 at netdev_upper_get_next_dev_rcu+0x96/0xd0, CPU#0: swapper/0/0 ... RIP: 0010:netdev_upper_get_next_dev_rcu+0x96/0xd0 ... <IRQ> dev_map_enqueue_multi+0x411/0x970 xdp_do_redirect+0xdf2/0x1030 __igc_xdp_run_prog+0x6a0/0xc80 igc_poll+0x34b0/0x70b0 __napi_poll.constprop.0+0x98/0x490 net_rx_action+0x8f2/0xfa0 handle_softirqs+0x1c7/0x710 __irq_exit_rcu+0xb1/0xf0 irq_exit_rcu+0x9/0x20 common_interrupt+0x7f/0x90 </IRQ> Signed-off-by: Kohei Enju <kohei@enjuk.jp> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260220110922.94781-1-kohei@enjuk.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-03net: Fix rcu_tasks stall in threaded busypollYiFei Zhu1-6/+11
I was debugging a NIC driver when I noticed that when I enable threaded busypoll, bpftrace hangs when starting up. dmesg showed: rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 10658 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 40793 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 131273 jiffies old. rcu_tasks_wait_gp: rcu_tasks grace period number 85 (since boot) is 402058 jiffies old. INFO: rcu_tasks detected stalls on tasks: 00000000769f52cd: .N nvcsw: 2/2 holdout: 1 idle_cpu: -1/64 task:napi/eth2-8265 state:R running task stack:0 pid:48300 tgid:48300 ppid:2 task_flags:0x208040 flags:0x00004000 Call Trace: <TASK> ? napi_threaded_poll_loop+0x27c/0x2c0 ? __pfx_napi_threaded_poll+0x10/0x10 ? napi_threaded_poll+0x26/0x80 ? kthread+0xfa/0x240 ? __pfx_kthread+0x10/0x10 ? ret_from_fork+0x31/0x50 ? __pfx_kthread+0x10/0x10 ? ret_from_fork_asm+0x1a/0x30 </TASK> The cause is that in threaded busypoll, the main loop is in napi_threaded_poll rather than napi_threaded_poll_loop, where the latter rarely iterates more than once within its loop. For rcu_softirq_qs_periodic inside napi_threaded_poll_loop to report its qs state, the last_qs must be 100ms behind, and this can't happen because napi_threaded_poll_loop rarely iterates in threaded busypoll, and each time napi_threaded_poll_loop is called last_qs is reset to latest jiffies. This patch changes so that in threaded busypoll, last_qs is saved in the outer napi_threaded_poll, and whether busy_poll_last_qs is NULL indicates whether napi_threaded_poll_loop is called for busypoll. This way last_qs would not reset to latest jiffies on each invocation of napi_threaded_poll_loop. Fixes: c18d4b190a46 ("net: Extend NAPI threaded polling to allow kthread based busy polling") Cc: stable@vger.kernel.org Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20260227221937.1060857-1-zhuyifei@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-01net: sched: introduce qdisc-specific drop reason tracingJesper Dangaard Brouer1-4/+4
Create new enum qdisc_drop_reason and trace_qdisc_drop tracepoint for qdisc layer drop diagnostics with direct qdisc context visibility. The new tracepoint includes qdisc handle, parent, kind (name), and device information. Existing SKB_DROP_REASON_QDISC_DROP is retained for backwards compatibility via kfree_skb_reason(). Convert qdiscs with drop reasons to use the new infrastructure. Change CAKE's cobalt_should_drop() return type from enum skb_drop_reason to enum qdisc_drop_reason to fix implicit enum conversion warnings. Use QDISC_DROP_UNSPEC as the 'not dropped' sentinel instead of SKB_NOT_DROPPED_YET. Both have the same compiled value (0), so the comparison logic remains semantically equivalent. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/177211345275.3011628.1974310302645218067.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-26Merge tag 'net-7.0-rc2' of ↵Linus Torvalds1-12/+23
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec, Bluetooth and netfilter Current release - regressions: - wifi: fix dev_alloc_name() return value check - rds: fix recursive lock in rds_tcp_conn_slots_available Current release - new code bugs: - vsock: lock down child_ns_mode as write-once Previous releases - regressions: - core: - do not pass flow_id to set_rps_cpu() - consume xmit errors of GSO frames - netconsole: avoid OOB reads, msg is not nul-terminated - netfilter: h323: fix OOB read in decode_choice() - tcp: re-enable acceptance of FIN packets when RWIN is 0 - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb(). - wifi: brcmfmac: fix potential kernel oops when probe fails - phy: register phy led_triggers during probe to avoid AB-BA deadlock - eth: - bnxt_en: fix deleting of Ntuple filters - wan: farsync: fix use-after-free bugs caused by unfinished tasklets - xscale: check for PTP support properly Previous releases - always broken: - tcp: fix potential race in tcp_v6_syn_recv_sock() - kcm: fix zero-frag skb in frag_list on partial sendmsg error - xfrm: - fix race condition in espintcp_close() - always flush state and policy upon NETDEV_UNREGISTER event - bluetooth: - purge error queues in socket destructors - fix response to L2CAP_ECRED_CONN_REQ - eth: - mlx5: - fix circular locking dependency in dump - fix "scheduling while atomic" in IPsec MAC address query - gve: fix incorrect buffer cleanup for QPL - team: avoid NETDEV_CHANGEMTU event when unregistering slave - usb: validate USB endpoints" * tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) netfilter: nf_conntrack_h323: fix OOB read in decode_choice() dpaa2-switch: validate num_ifs to prevent out-of-bounds write net: consume xmit errors of GSO frames vsock: document write-once behavior of the child_ns_mode sysctl vsock: lock down child_ns_mode as write-once selftests/vsock: change tests to respect write-once child ns mode net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query net/mlx5: Fix missing devlink lock in SRIOV enable error path net/mlx5: E-switch, Clear legacy flag when moving to switchdev net/mlx5: LAG, disable MPESW in lag_disable_change() net/mlx5: DR, Fix circular locking dependency in dump selftests: team: Add a reference count leak test team: avoid NETDEV_CHANGEMTU event when unregistering slave net: mana: Fix double destroy_workqueue on service rescan PCI path MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER dpll: zl3073x: Remove redundant cleanup in devm_dpll_init() selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0 tcp: re-enable acceptance of FIN packets when RWIN is 0 vsock: Use container_of() to get net namespace in sysctl handlers net: usb: kaweth: validate USB endpoints ...
2026-02-26net: consume xmit errors of GSO framesJakub Kicinski1-5/+18
udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests currently in NIPA. They fail in the same exact way, TCP GRO test stalls occasionally and the test gets killed after 10min. These tests use veth to simulate GRO. They attach a trivial ("return XDP_PASS;") XDP program to the veth to force TSO off and NAPI on. Digging into the failure mode we can see that the connection is completely stuck after a burst of drops. The sender's snd_nxt is at sequence number N [1], but the receiver claims to have received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle is that senders rtx queue is not empty (let's say the block in the rtx queue is at sequence number N - 4 * MSS [3]). In this state, sender sends a retransmission from the rtx queue with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3]. Receiver sees it and responds with an ACK all the way up to N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA because it has no recollection of ever sending data that far out [1]. And we are stuck. The root cause is the mess of the xmit return codes. veth returns an error when it can't xmit a frame. We end up with a loss event like this: ------------------------------------------------- | GSO super frame 1 | GSO super frame 2 | |-----------------------------------------------| | seg | seg | seg | seg | seg | seg | seg | seg | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------- x ok ok <ok>| ok ok ok <x> \\ snd_nxt "x" means packet lost by veth, and "ok" means it went thru. Since veth has TSO disabled in this test it sees individual segments. Segment 1 is on the retransmit queue and will be resent. So why did the sender not advance snd_nxt even tho it clearly did send up to seg 8? tcp_write_xmit() interprets the return code from the core to mean that data has not been sent at all. Since TCP deals with GSO super frames, not individual segment the crux of the problem is that loss of a single segment can be interpreted as loss of all. TCP only sees the last return code for the last segment of the GSO frame (in <> brackets in the diagram above). Of course for the problem to occur we need a setup or a device without a Qdisc. Otherwise Qdisc layer disconnects the protocol layer from the device errors completely. We have multiple ways to fix this. 1) make veth not return an error when it lost a packet. While this is what I think we did in the past, the issue keeps reappearing and it's annoying to debug. The game of whack a mole is not great. 2) fix the damn return codes We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the documentation, so maybe we should make the return code from ndo_start_xmit() a boolean. I like that the most, but perhaps some ancient, not-really-networking protocol would suffer. 3) make TCP ignore the errors It is not entirely clear to me what benefit TCP gets from interpreting the result of ip_queue_xmit()? Specifically once the connection is established and we're pushing data - packet loss is just packet loss? 4) this fix Ignore the rc in the Qdisc-less+GSO case, since it's unreliable. We already always return OK in the TCQ_F_CAN_BYPASS case. In the Qdisc-less case let's be a bit more conservative and only mask the GSO errors. This path is taken by non-IP-"networks" like CAN, MCTP etc, so we could regress some ancient thing. This is the simplest, but also maybe the hackiest fix? Similar fix has been proposed by Eric in the past but never committed because original reporter was working with an OOT driver and wasn't providing feedback (see Link). Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com Fixes: 1f59533f9ca5 ("qdisc: validate frames going through the direct_xmit path") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-24net: do not pass flow_id to set_rps_cpu()Eric Dumazet1-7/+5
Blamed commit made the assumption that the RPS table for each receive queue would have the same size, and that it would not change. Compute flow_id in set_rps_cpu(), do not assume we can use the value computed by get_rps_cpu(). Otherwise we risk out-of-bound access and/or crashes. Fixes: 48aa30443e52 ("net: Cache hash and flow_id to avoid recalculation") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Krishna Kumar <krikku@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260220222605.3468081-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-22Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds1-2/+1
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds1-1/+1
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds1-4/+4
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook1-13/+12
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-17net: remove WARN_ON_ONCE when accessing forward path arrayPablo Neira Ayuso1-1/+1
Although unlikely, recent support for IPIP tunnels increases chances of reaching this WARN_ON_ONCE if userspace manages to build a sufficiently long forward path. Remove it. Fixes: ddb94eafab8b ("net: resolve forwarding path from virtual netdevice and HW destination address") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RTEric Dumazet1-5/+12
CONFIG_PREEMPT_RT is special, make this clear in backlog_lock_irq_save() and backlog_unlock_irq_restore(). The issue shows up with CONFIG_DEBUG_IRQFLAGS=y raw_local_irq_restore() called with IRQs enabled WARNING: kernel/locking/irqflag-debug.c:10 at warn_bogus_irq_restore+0xc/0x20 kernel/locking/irqflag-debug.c:10, CPU#1: aoe_tx0/1321 Modules linked in: CPU: 1 UID: 0 PID: 1321 Comm: aoe_tx0 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 RIP: 0010:warn_bogus_irq_restore+0xc/0x20 kernel/locking/irqflag-debug.c:10 Call Trace: <TASK> backlog_unlock_irq_restore net/core/dev.c:253 [inline] enqueue_to_backlog+0x525/0xcf0 net/core/dev.c:5347 netif_rx_internal+0x120/0x550 net/core/dev.c:5659 __netif_rx+0xa9/0x110 net/core/dev.c:5679 loopback_xmit+0x43a/0x660 drivers/net/loopback.c:90 __netdev_start_xmit include/linux/netdevice.h:5275 [inline] netdev_start_xmit include/linux/netdevice.h:5284 [inline] xmit_one net/core/dev.c:3864 [inline] dev_hard_start_xmit+0x2df/0x830 net/core/dev.c:3880 __dev_queue_xmit+0x16f4/0x3990 net/core/dev.c:4829 dev_queue_xmit include/linux/netdevice.h:3384 [inline] Fixes: 27a01c1969a5 ("net: fully inline backlog_unlock_irq_restore()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260213120427.2914544-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-07net/ipv6: Remove jumbo_remove step from TX pathAlice Mikityanska1-4/+2
Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove unnecessary steps from the GSO TX path, that used to check and remove HBH. Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260205133925.526371-5-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: use netdev_queue_config() for mp restartJakub Kicinski1-17/+0
We should follow the prepare/commit approach for queue configuration. The qcfg struct should be added to dev->cfg rather than directly to queue objects so that we can clone and discard the pending config easily. Remove the qcfg in struct netdev_rx_queue, and switch remaining callers to netdev_queue_config(). netdev_queue_config() will construct the qcfg on the fly based on device defaults and state of the queue. ndo_default_qcfg becomes optional because having the callback itself does not have any meaningful semantics to us. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: introduce mangleid_featuresPaolo Abeni1-1/+4
Some/most devices implementing gso_partial need to disable the GSO partial features when the IP ID can't be mangled; to that extend each of them implements something alike the following[1]: if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID)) features &= ~NETIF_F_TSO; in the ndo_features_check() op, which leads to a bit of duplicate code. Later patch in the series will implement GSO partial support for virtual devices, and the current status quo will require more duplicate code and a new indirect call in the TX path for them. Introduce the mangleid_features mask, allowing the core to disable NIC features based on/requiring MANGLEID, without any further intervention from the driver. The same functionality could be alternatively implemented adding a single boolean flag to the struct net_device, but would require an additional checks in ndo_features_check(). Also note that [1] is incorrect if the NIC additionally implements NETIF_F_GSO_UDP_L4, mangleid_features transparently handle even such a case. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/5a7cdaeea40b0a29b88e525b6c942d73ed3b8ce7.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21Merge tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linuxJakub Kicinski1-0/+17
Pavel Begunkov says: ==================== Add support for providers with large rx buffer Many modern NICs support configurable receive buffer lengths, and zcrx and memory providers can use buffers larger than 4K to improve performance. When paired with hw-gro larger rx buffer sizes can drastically reduce the number of buffers traversing the stack and save a lot of processing time. It also allows to give to users larger contiguous chunks of data. Single stream benchmarks showed up to ~30% CPU util improvement. E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC: packets=23987040 (MB=2745098), rps=199559 (MB/s=22837) CPU %usr %nice %sys %iowait %irq %soft %idle 0 1.53 0.00 27.78 2.72 1.31 66.45 0.22 packets=24078368 (MB=2755550), rps=200319 (MB/s=22924) CPU %usr %nice %sys %iowait %irq %soft %idle 0 0.69 0.00 8.26 31.65 1.83 57.00 0.57 This series adds net infrastructure for memory providers configuring the size and implements it for bnxt. It's an opt-in feature for drivers, they should advertise support for the parameter in the qops and must check if the hardware supports the given size. It's limited to memory providers as it drastically simplifies implementation. It doesn't affect the fast path zcrx uAPI, and the user exposed parameter is defined in zcrx terms, which allows it to be flexible and adjusted in the future. A liburing example can be found at [2] full branch: [1] https://github.com/isilence/linux.git zcrx/large-buffers-v8 Liburing example: [2] https://github.com/isilence/liburing.git zcrx/rx-buf-len * tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux: io_uring/zcrx: document area chunking parameter selftests: iou-zcrx: test large chunk sizes eth: bnxt: support qcfg provided rx page size eth: bnxt: adjust the fill level of agg queues with larger buffers eth: bnxt: store rx buffer size per queue net: pass queue rx page size from memory provider net: add bare bone queue configs net: reduce indent of struct netdev_queue_mgmt_ops members net: memzero mp params when closing a queue ==================== Link: https://patch.msgid.link/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'"Jakub Kicinski1-7/+0
This reverts commit 77b9c4a438fc66e2ab004c411056b3fb71a54f2c, reversing changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497: 931420a2fc36 ("selftests/net: Add netkit container tests") ab771c938d9a ("selftests/net: Make NetDrvContEnv support queue leasing") 6be87fbb2776 ("selftests/net: Add env for container based tests") 61d99ce3dfc2 ("selftests/net: Add bpf skb forwarding program") 920da3634194 ("netkit: Add xsk support for af_xdp applications") eef51113f8af ("netkit: Add netkit notifier to check for unregistering devices") b5ef109d22d4 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create") b5c3fa4a0b16 ("netkit: Add single device mode for netkit") 0073d2fd679d ("xsk: Proxy pool management for leased queues") 1ecea95dd3b5 ("xsk: Extend xsk_rcv_check validation") 804bf334d08a ("net: Proxy netdev_queue_get_dma_dev for leased queues") 0caa9a8ddec3 ("net: Proxy net_mp_{open,close}_rxq for leased queues") ff8889ff9107 ("net, ethtool: Disallow leased real rxqs to be resized") 9e2103f36110 ("net: Add lease info to queue-get response") 31127deddef4 ("net: Implement netdev_nl_queue_create_doit") a5546e18f77c ("net: Add queue-create operation") The series will conflict with io_uring work, and the code needs more polish. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: Implement netdev_nl_queue_create_doitDaniel Borkmann1-0/+7
Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-3-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-9/+22
Cross-merge networking fixes after downstream PR (net-6.19-rc6). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-14net: add bare bone queue configsPavel Begunkov1-0/+17
We'll need to pass extra parameters when allocating a queue for memory providers. Define a new structure for queue configurations, and pass it to qapi callbacks. It's empty for now, actual parameters will be added in following patches. Configurations should persist across resets, and for that they're default-initialised on device registration and stored in struct netdev_rx_queue. We also add a new qapi callback for defaulting a given config. It must be implemented if a driver wants to use queue configs and is optional otherwise. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2026-01-13net: add net.core.qdisc_max_burstEric Dumazet1-3/+3
In blamed commit, I added a check against the temporary queue built in __dev_xmit_skb(). Idea was to drop packets early, before any spinlock was acquired. if (unlikely(defer_count > READ_ONCE(q->limit))) { kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP); return NET_XMIT_DROP; } It turned out that HTB Qdisc has a zero q->limit. HTB limits packets on a per-class basis. Some of our tests became flaky. Add a new sysctl : net.core.qdisc_max_burst to control how many packets can be stored in the temporary lockless queue. Also add a new QDISC_BURST_DROP drop reason to better diagnose future issues. Thanks Neal ! Fixes: 100dfa74cad9 ("net: dev_queue_xmit() llist adoption") Reported-and-bisected-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260107104159.3669285-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>