summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-01-20netfilter: xt_tcpmss: check remaining length before reading optlenFlorian Westphal1-1/+1
Quoting reporter: In net/netfilter/xt_tcpmss.c (lines 53-68), the TCP option parser reads op[i+1] directly without validating the remaining option length. If the last byte of the option field is not EOL/NOP (0/1), the code attempts to index op[i+1]. In the case where i + 1 == optlen, this causes an out-of-bounds read, accessing memory past the optlen boundary (either reading beyond the stack buffer _opt or the following payload). Reported-by: sungzii <sungzii@pm.me> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conncount: fix tracking of connections from localhostFernando Fernandez Mancera1-2/+13
Since commit be102eb6a0e7 ("netfilter: nf_conncount: rework API to use sk_buff directly"), we skip the adding and trigger a GC when the ct is confirmed. For connections originated from local to local it doesn't work because the connection is confirmed on POSTROUTING, therefore tracking on the INPUT hook is always skipped. In order to fix this, we check whether skb input ifindex is set to loopback ifindex. If it is then we fallback on a GC plus track operation skipping the optimization. This fallback is necessary to avoid duplicated tracking of a packet train e.g 10 UDP datagrams sent on a burst when initiating the connection. Tested with xt_connlimit/nft_connlimit and OVS limit and with a HTTP server and iperf3 on UDP mode. Fixes: be102eb6a0e7 ("netfilter: nf_conncount: rework API to use sk_buff directly") Reported-by: Michal Slabihoudek <michal.slabihoudek@gooddata.com> Closes: https://lore.kernel.org/netfilter/6989BD9F-8C24-4397-9AD7-4613B28BF0DB@gooddata.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nft_compat: add more restrictions on netlink attributesFlorian Westphal1-3/+10
As far as I can see nothing bad can happen when NFTA_TARGET/MATCH_NAME are too large because this calls x_tables helpers which check for the length, but it seems better to already reject it during netlink parsing. Rest of the changes avoid silent u8/u16 truncations. For _TYPE, its expected to be only 1 or 0. In x_tables world, this variable is set by kernel, for IPT_SO_GET_REVISION_TARGET its 1, for all others its set to 0. As older versions of nf_tables permitted any value except 1 to mean 'match', keep this as-is but sanitize the value for consistency. Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT ↵Scott Mitchell1-41/+34
allocation Currently, instance_create() uses GFP_ATOMIC because it's called while holding instances_lock spinlock. This makes allocation more likely to fail under memory pressure. Refactor nfqnl_recv_config() to drop RCU lock after instance_lookup() and peer_portid verification. A socket cannot simultaneously send a message and close, so the queue owned by the sending socket cannot be destroyed while processing its CONFIG message. This allows instance_create() to allocate with GFP_KERNEL_ACCOUNT before taking the spinlock. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conntrack: don't rely on implicit includesFlorian Westphal13-3/+17
several netfilter compilation units rely on implicit includes coming from nf_conntrack_proto_gre.h. Clean this up and add the required dependencies where needed. nf_conntrack.h requires net_generic() helper. Place various gre/ppp/vlan includes to where they are needed. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: don't include xt and nftables.h in unrelated subsystemsFlorian Westphal8-5/+6
conntrack, xtables and nftables are distinct subsystems, don't use them in other subystems. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conntrack: enable icmp clash supportFlorian Westphal2-0/+2
Not strictly required, but should not be harmful either: This isn't a stateful protocol, hence clash resolution should work fine. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conncount: increase the connection clean up limit to 64Fernando Fernandez Mancera2-5/+11
After the optimization to only perform one GC per jiffy, a new problem was introduced. If more than 8 new connections are tracked per jiffy the list won't be cleaned up fast enough possibly reaching the limit wrongly. In order to prevent this issue, only skip the GC if it was already triggered during the same jiffy and the increment is lower than the clean up limit. In addition, increase the clean up limit to 64 connections to avoid triggering GC too often and do more effective GCs. This has been tested using a HTTP server and several performance tools while having nft_connlimit/xt_connlimit or OVS limit configured. Output of slowhttptest + OVS limit at 52000 connections: slow HTTP test status on 340th second: initializing: 0 pending: 432 connected: 51998 error: 0 closed: 0 service available: YES Fixes: d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC") Reported-by: Aleksandra Rukomoinikova <ARukomoinikova@k2.cloud> Closes: https://lore.kernel.org/netfilter/b2064e7b-0776-4e14-adb6-c68080987471@k2.cloud/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conntrack: Add allow_clash to generic protocol handlerYuto Hamaguchi1-0/+1
The upstream commit, 71d8c47fc653711c41bc3282e5b0e605b3727956 ("netfilter: conntrack: introduce clash resolution on insertion race"), sets allow_clash=true in the UDP/UDPLITE protocol handler but does not set it in the generic protocol handler. As a result, packets composed of connectionless protocols at each layer, such as UDP over IP-in-IP, still drop packets due to conflicts during conntrack insertion. To resolve this, this patch sets allow_clash in the nf_conntrack_l4proto_generic. Signed-off-by: Yuto Hamaguchi <Hamaguchi.Yuto@da.MitsubishiElectric.co.jp> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_tables: reset table validation state on abortFlorian Westphal1-0/+7
If a transaction fails the final validation in the commit hook, the table validation state is changed to NFT_VALIDATE_DO and a replay of the batch is performed. Every rule insert will then do a graph validation. This is much slower, but provides better error reporting to the user because we can point at the rule that introduces the validation issue. Without this reset the affected table(s) remain in full validation mode, i.e. on next transaction we start with slow-mode. This makes the next transaction after a failed incremental update very slow: # time iptables-restore < /tmp/ruleset real 0m0.496s [..] # time iptables -A CALLEE -j CALLER iptables v1.8.11 (nf_tables): RULE_APPEND failed (Too many links): rule in chain CALLEE real 0m0.022s [..] # time iptables-restore < /tmp/ruleset real 1m22.355s [..] After this patch, 2nd iptables-restore is back to ~0.5s. Fixes: 9a32e9850686 ("netfilter: nf_tables: don't write table validation state without mutex") Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'Paolo Abeni28-117/+1233
Daniel Borkmann says: ==================== netkit: Support for io_uring zero-copy and AF_XDP Containers use virtual netdevs to route traffic from a physical netdev in the host namespace. They do not have access to the physical netdev in the host and thus can't use memory providers or AF_XDP that require reconfiguring/restarting queues in the physical netdev. This patchset adds the concept of queue leasing to virtual netdevs that allow containers to use memory providers and AF_XDP at native speed. Leased queues are bound to a real queue in a physical netdev and act as a proxy. Memory providers and AF_XDP operations take an ifindex and queue id, so containers would pass in an ifindex for a virtual netdev and a queue id of a leased queue, which then gets proxied to the underlying real queue. We have implemented support for this concept in netkit and tested the latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504 (bnxt_en) 100G NICs. For more details see the individual patches. ==================== Link: https://patch.msgid.link/20260115082603.219152-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20selftests/net: Add netkit container testsDavid Wei3-0/+80
Add two tests using NetDrvContEnv. One basic test that sets up a netkit pair, with one end in a netns. Use LOCAL_PREFIX_V6 and nk_forward BPF program to ping from a remote host to the netkit in netns. Second is a selftest for netkit queue leasing, using io_uring zero copy test binary inside of a netns with netkit. This checks that memory providers can be bound against virtual queues in a netkit within a netns that are leasing from a physical netdev in the default netns. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-17-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20selftests/net: Make NetDrvContEnv support queue leasingDavid Wei1-1/+46
Add a new parameter `lease` to NetDrvContEnv that sets up queue leasing in the env. The NETIF also has some ethtool parameters changed to support memory provider tests. This is needed in NetDrvContEnv rather than individual test cases since the cleanup to restore NETIF can't be done, until the netns in the env is gone. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-16-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20selftests/net: Add env for container based testsDavid Wei4-6/+127
Add an env NetDrvContEnv for container based selftests. This automates the setup of a netns, netkit pair with one inside the netns, and a BPF program that forwards skbs from the NETIF host inside the container. Currently only netkit is used, but other virtual netdevs e.g. veth can be used too. Expect netkit container datapath selftests to have a publicly routable IP prefix to assign to netkit in a container, such that packets will land on eth0. The BPF skb forward program will then forward such packets from the host netns to the container netns. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-15-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20selftests/net: Add bpf skb forwarding programDavid Wei1-0/+49
Add nk_forward.bpf.c, a BPF program that forwards skbs matching some IPv6 prefix received on eth0 ifindex to a specified netkit ifindex. This will be needed by netkit container tests. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-14-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20netkit: Add xsk support for af_xdp applicationsDaniel Borkmann1-1/+75
Enable support for AF_XDP applications to operate on a netkit device. The goal is that AF_XDP applications can natively consume AF_XDP from network namespaces. The use-case from Cilium side is to support Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a virtual machine management add-on for Kubernetes which aims to provide a common ground for virtualization. KubeVirt spawns the VMs inside Kubernetes Pods which reside in their own network namespace just like regular Pods. Raw QEMU AF_XDP backend example with eth0 being a physical device with 16 queues where netkit is bound to the last queue (for multi-queue RSS context can be used if supported by the driver): # ethtool -X eth0 start 0 equal 15 # ethtool -X eth0 start 15 equal 1 context new # ethtool --config-ntuple eth0 flow-type ether \ src 00:00:00:00:00:00 \ src-mask ff:ff:ff:ff:ff:ff \ dst $mac dst-mask 00:00:00:00:00:00 \ proto 0 proto-mask 0xffff action 15 [ ... setup BPF/XDP prog on eth0 to steer into shared xsk map ... ] # ip netns add foo # ip link add numrxqueues 2 nk type netkit single # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \ --do queue-create \ --json "{"ifindex": $(ifindex nk), "type": "rx", \ "lease": { "ifindex": $(ifindex eth0), \ "queue": { "type": "rx", "id": 15 } } }" {'id': 1} # ip link set nk netns foo # ip netns exec foo ip link set lo up # ip netns exec foo ip link set nk up # ip netns exec foo qemu-system-x86_64 \ -kernel $kernel \ -drive file=${image_name},index=0,media=disk,format=raw \ -append "root=/dev/sda rw console=ttyS0" \ -cpu host \ -m $memory \ -enable-kvm \ -device virtio-net-pci,netdev=net0,mac=$mac \ -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \ -nographic We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5) 100G NIC with successful network connectivity out of QEMU. An earlier iteration of this work was presented at LSF/MM/BPF [0] and more recently at LPC [1]. For getting to a first starting point to connect all things with KubeVirt, bind mounting the xsk map from Cilium into the VM launcher Pod which acts as a regular Kubernetes Pod while not perfect, is not a big problem given its out of reach from the application sitting inside the VM (and some of the control plane aspects are baked in the launcher Pod already), so the isolation barrier is still the VM. Eventually the goal is to have a XDP/XSK redirect extension where there is no need to have the xsk map, and the BPF program can just derive the target xsk through the queue where traffic was received on. The exposure through netkit is because Cilium should not act as a proxy handing out xsk sockets. Existing applications expect a netdev from kernel side and should not need to rewrite just to implement against a CNI's protocol. Also, all the memory should not be accounted against Cilium but rather the application Pod itself which is consuming AF_XDP. Further, on up/downgrades we expect the data plane to being completely decoupled from the control plane; if Cilium would own the sockets that would be disruptive. Another use-case which opens up and is regularly asked from users would be to have DPDK applications on top of AF_XDP in regular Kubernetes Pods. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0] Link: https://lpc.events/event/19/contributions/2275/ [1] Link: https://patch.msgid.link/20260115082603.219152-13-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20netkit: Add netkit notifier to check for unregistering devicesDaniel Borkmann2-1/+62
Add a netdevice notifier in netkit to watch for NETDEV_UNREGISTER events. If the target device is indeed NETREG_UNREGISTERING and previously leased a queue to a netkit device, then collect the related netkit devices and batch-unregister_netdevice_many() them. If this would not be done, then the netkit device would hold a reference on the physical device preventing it from going away. However, in case of both io_uring zero-copy as well as AF_XDP this situation is handled gracefully and the allocated resources are torn down. In the case where mentioned infra is used through netkit, the applications have a reference on netkit, and netkit in turn holds a reference on the physical device. In order to have netkit release the reference on the physical device, we need such watcher to then unregister the netkit ones. This is generally quite similar to the dependency handling in case of tunnels (e.g. vxlan bound to a underlying netdev) where the tunnel device gets removed along with the physical device. # ip a [...] 4: enp10s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff inet 10.0.0.2/24 scope global enp10s0f0np0 valid_lft forever preferred_lft forever [...] 8: nk@NONE: <BROADCAST,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff [...] # rmmod mlx5_ib # rmmod mlx5_core [ 309.261822] mlx5_core 0000:0a:00.0 mlx5_0: Port: 1 Link DOWN [ 344.235236] mlx5_core 0000:0a:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.246948] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.463754] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 344.770155] mlx5_core 0000:0a:00.1: E-Switch: cleanup [ 345.345709] mlx5_core 0000:0a:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 345.357524] mlx5_core 0000:0a:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 350.995989] mlx5_core 0000:0a:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) [ 351.574396] mlx5_core 0000:0a:00.0: E-Switch: cleanup # ip a [...] [ both enp10s0f0np0 and nk gone ] [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-12-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20netkit: Implement rtnl_link_ops->alloc and ndo_queue_createDavid Wei1-12/+105
Implement rtnl_link_ops->alloc that allows the number of rx queues to be set when netkit is created. By default, netkit has only a single rxq (and single txq). The number of queues is deliberately not allowed to be changed via ethtool -L and is fixed for the lifetime of a netkit instance. For netkit device creation, numrxqueues with larger than one rxq can be specified. These rxqs are leasable to real rxqs in physical netdevs: ip link add type netkit peer numrxqueues 64 # for device pair ip link add numrxqueues 64 type netkit single # for single device The limit of numrxqueues for netkit is currently set to 1024, which allows leasing multiple real rxqs from physical netdevs. The implementation of ndo_queue_create() adds a new rxq during the queue lease operation. We allow to create queues either in single device mode or for the case of dual device mode for the netkit peer device which gets placed into the target network namespace. For dual device mode the lease against the primary device does not make sense for the targeted use cases, and therefore gets rejected. We also need to add a lockdep class for netkit, such that lockdep does not trip over us, similarly done as in commit 0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers"). This is also the last missing bit to netkit for supporting io_uring with zero-copy mode [0]. Up until this point it was not possible to consume the latter out of containers or Kubernetes Pods where applications are in their own network namespace. io_uring example with eth0 being a physical device with 16 queues where netkit is bound to the last queue, iou-zcrx.c is binary from selftests. Flow steering to that queue is based on the service VIP:port of the server utilizing io_uring: # ethtool -X eth0 start 0 equal 15 # ethtool -X eth0 start 15 equal 1 context new # ethtool --config-ntuple eth0 flow-type tcp4 dst-ip 1.2.3.4 dst-port 5000 action 15 # ip netns add foo # ip link add type netkit peer numrxqueues 2 # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \ --do queue-create \ --json "{"ifindex": $(ifindex nk0), "type": "rx", \ "lease": { "ifindex": $(ifindex eth0), \ "queue": { "type": "rx", "id": 15 } } }" {'id': 1} # ip link set nk0 netns foo # ip link set nk1 up # ip netns exec foo ip link set lo up # ip netns exec foo ip link set nk0 up # ip netns exec foo ip addr add 1.2.3.4/32 dev nk0 [ ... setup routing etc to get external traffic into the netns ... ] # ip netns exec foo ./iou-zcrx -s -p 5000 -i nk0 -q 1 Remote io_uring client: # ./iou-zcrx -c -h 1.2.3.4 -p 5000 -l 12840 -z 65536 We have tested the above against a Broadcom BCM957504 (bnxt_en) 100G NIC, supporting TCP header/data split. Similarly, this also works for devmem which we tested using ncdevmem: # ip netns exec foo ./ncdevmem -s 1.2.3.4 -l -p 5000 -f nk0 -t 1 -q 1 And on the remote client: # ./ncdevmem -s 1.2.3.4 -p 5000 -f eth0 For Cilium, the plan is to open up support for the various memory providers for regular Kubernetes Pods when Cilium is configured with netkit datapath mode. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://kernel-recipes.org/en/2024/schedule/efficient-zero-copy-networking-using-io_uring [0] Link: https://patch.msgid.link/20260115082603.219152-11-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20netkit: Add single device mode for netkitDaniel Borkmann2-40/+76
Add a single device mode for netkit instead of netkit pairs. The primary target for the paired devices is to connect network namespaces, of course, and support has been implemented in projects like Cilium [0]. For the rxq leasing the plan is to support two main scenarios related to single device mode: * For the use-case of io_uring zero-copy, the control plane can either set up a netkit pair where the peer device can perform rxq leasing which is then tied to the lifetime of the peer device, or the control plane can use a regular netkit pair to connect the hostns to a Pod/container and dynamically add/remove rxq leasing through a single device without having to interrupt the device pair. In the case of io_uring, the memory pool is used as skb non-linear pages, and thus the skb will go its way through the regular stack into netkit. Things like the netkit policy when no BPF is attached or skb scrubbing etc apply as-is in case the paired devices are used, or if the backend memory is tied to the single device and traffic goes through a paired device. * For the use-case of AF_XDP, the control plane needs to use netkit in the single device mode. The single device mode currently enforces only a pass policy when no BPF is attached, and does not yet support BPF link attachments for AF_XDP. skbs sent to that device get dropped at the moment. Given AF_XDP operates at a lower layer of the stack tying this to the netkit pair did not make sense. In future, the plan is to allow BPF at the XDP layer which can: i) process traffic coming from the AF_XDP application (e.g. QEMU with AF_XDP backend) to filter egress traffic or to push selected egress traffic up to the single netkit device to the local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the single netkit into the AF_XDP application (e.g. DHCP replies). Also, the control-plane can dynamically manage rxq leasing for the single netkit device without having to interrupt (e.g. down/up cycle) the main netkit pair for the Pod which has traffic going in and out. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Jordan Rife <jordan@jrife.io> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0] Link: https://patch.msgid.link/20260115082603.219152-10-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20xsk: Proxy pool management for leased queuesDaniel Borkmann1-13/+35
Similarly to the net_mp_{open,close}_rxq handling for leased queues, proxy the xsk_{reg,clear}_pool_at_qid via netif_get_rx_queue_lease_locked such that in case a virtual netdev picked a leased rxq, the request gets through to the real rxq in the physical netdev. The proxying is only relevant for queue_id < dev->real_num_rx_queues since right now its only supported for rxqs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-9-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20xsk: Extend xsk_rcv_check validationDaniel Borkmann1-3/+26
xsk_rcv_check tests for inbound packets to see whether they match the bound AF_XDP socket. Refactor the test into a small helper xsk_dev_queue_valid and move the validation against xs->dev and xs->queue_id there. The fast-path case stays in place and allows for quick return in xsk_dev_queue_valid. If it fails, the validation is extended to check whether the AF_XDP socket is bound against a leased queue, and if the case then the test is redone. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-8-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Proxy netdev_queue_get_dma_dev for leased queuesDavid Wei1-2/+15
Extend netdev_queue_get_dma_dev to return the physical device of the real rxq for DMA in case the queue was leased. This allows memory providers like io_uring zero-copy or devmem to bind to the physically leased rxq via virtual devices such as netkit. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-7-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Proxy net_mp_{open,close}_rxq for leased queuesDavid Wei3-20/+66
When a process in a container wants to setup a memory provider, it will use the virtual netdev and a leased rxq, and call net_mp_{open,close}_rxq to try and restart the queue. At this point, proxy the queue restart on the real rxq in the physical netdev. For memory providers (io_uring zero-copy rx and devmem), it causes the real rxq in the physical netdev to be filled from a memory provider that has DMA mapped memory from a process within a container. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260115082603.219152-6-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net, ethtool: Disallow leased real rxqs to be resizedDaniel Borkmann2-9/+12
Similar to AF_XDP, do not allow queues in a physical netdev to be resized by ethtool -L when they are leased. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260115082603.219152-5-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Add lease info to queue-get responseDaniel Borkmann3-0/+91
Populate nested lease info to the queue-get response that returns the ifindex, queue id with type and optionally netns id if the device resides in a different netns. Example with ynl client: # ip a [...] 4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000 link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff inet 10.0.0.2/24 scope global enp10s0f0np0 valid_lft forever preferred_lft forever inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [...] # ethtool -i enp10s0f0np0 driver: mlx5_core [...] # ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-get \ --json '{"ifindex": 4, "id": 15, "type": "rx"}' {'id': 15, 'ifindex': 4, 'lease': {'ifindex': 8, 'netns-id': 0, 'queue': {'id': 1, 'type': 'rx'}}, 'napi-id': 8227, 'type': 'rx', 'xsk': {}} # ip netns list foo (id: 0) # ip netns exec foo ip a [...] 8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [...] # ip netns exec foo ethtool -i nk driver: netkit [...] # ip netns exec foo ls /sys/class/net/nk/queues/ rx-0 rx-1 tx-0 # ip netns exec foo ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-get \ --json '{"ifindex": 8, "id": 1, "type": "rx"}' {'id': 1, 'ifindex': 8, 'type': 'rx'} Note that the caller of netdev_nl_queue_fill_one() holds the netdevice lock. For the queue-get we do not lock both devices. When queues get {un,}leased, both devices are locked, thus if __netif_get_rx_queue_peer() returns true, the peer pointer points to a valid device. The netns-id is fetched via peernet2id_alloc() similarly as done in OVS. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260115082603.219152-4-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Implement netdev_nl_queue_create_doitDaniel Borkmann9-12/+278
Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-3-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Add queue-create operationDaniel Borkmann6-0/+93
Add a ynl netdev family operation called queue-create that creates a new queue on a netdevice: name: queue-create attribute-set: queue flags: [admin-perm] do: request: attributes: - ifindex - type - lease reply: &queue-create-op attributes: - id This is a generic operation such that it can be extended for various use cases in future. Right now it is mandatory to specify ifindex, the queue type which is enforced to rx and a lease. The newly created queue id is returned to the caller. A queue from a virtual device can have a lease which refers to another queue from a physical device. This is useful for memory providers and AF_XDP operations which take an ifindex and queue id to allow applications to bind against virtual devices in containers. The lease couples both queues together and allows to proxy the operations from a virtual device in a container to the physical device. In future, the nested lease attribute can be lifted and made optional for other use-cases such as dynamic queue creation for physical netdevs. The lack of lease and the specification of the physical device as an ifindex will imply that we need a real queue to be allocated. Similarly, the queue type enforcement to rx can then be lifted as well to support tx. An early implementation had only driver-specific integration [0], but in order for other virtual devices to reuse, it makes sense to have this as a generic API in core net. For leasing queues, the virtual netdev must have real_num_rx_queue less than num_rx_queues at the time of calling queue-create. The queue-type must be rx as only rx queues are supported for leasing for now. We also enforce that the queue-create ifindex must point to a virtual device, and that the nested lease attribute's ifindex must point to a physical device. The nested lease attribute set contains a netns-id attribute which is currently only intended for dumping as part of the queue-get operation. Also, it is modeled as an s32 type similarly as done elsewhere in the stack. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0] Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-2-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20Merge branch 'net-hinic3-pf-initialization'Paolo Abeni26-37/+2444
Fan Gong says: ==================== net: hinic3: PF initialization This is [1/3] part of hinic3 Ethernet driver second submission. With this patch hinic3 becomes a complete Ethernet driver with pf and vf. The driver parts contained in this patch: Add support for PF framework based on the VF code. Add PF management interfaces to communicate with HW. Add 8 netdev ops to configure NIC features. Support mac filter to unicast and multicast. Add HW event handler to manage port and link status. V01: https://lore.kernel.org/netdev/cover.1760502478.git.zhuyikai1@h-partners.com/ V02: https://lore.kernel.org/netdev/cover.1760685059.git.zhuyikai1@h-partners.com/ V03: https://lore.kernel.org/netdev/cover.1761362580.git.zhuyikai1@h-partners.com/ V04: https://lore.kernel.org/netdev/cover.1761711549.git.zhuyikai1@h-partners.com/ V05: https://lore.kernel.org/netdev/cover.1762414088.git.zhuyikai1@h-partners.com/ V06: https://lore.kernel.org/netdev/cover.1762581665.git.zhuyikai1@h-partners.com/ V07: https://lore.kernel.org/netdev/cover.1763555878.git.zhuyikai1@h-partners.com/ V08: https://lore.kernel.org/netdev/cover.1767495881.git.zhuyikai1@h-partners.com/ V09: https://lore.kernel.org/netdev/cover.1767707500.git.zhuyikai1@h-partners.com/ V10: https://lore.kernel.org/netdev/cover.1767861236.git.zhuyikai1@h-partners.com/ ==================== Link: https://patch.msgid.link/cover.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add HW event handlerFan Gong6-1/+92
Add HINIC3_INIT_UP flags to trace netdev open status. Add port module event handler. Add link status event type(FAULT, PCIE link down, heart lost, mgmt watchdog). Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/53a2b928136998f740d597bbd45ca1740b95538f.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add mac filter opsFan Gong8-0/+514
Add ops to support unicast and multicast to filter mac address in packets sending and receiving. Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/9ea618ad5d6c404e222e9114e08e913a73177705.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add adaptive IRQ coalescing with DIMFan Gong7-5/+174
DIM offers a way to adjust the coalescing settings based on load. As hinic3 rx and tx share interrupts, we only need to base dim on rx stats. Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/af96c20a836800a5972a09cdaf520029d976ad48.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add .ndo_vlan_rx_add/kill_vid and .ndo_validate_addrFan Gong6-0/+128
Implement following callback function: .ndo_vlan_rx_add_vid .ndo_vlan_rx_kill_vid .ndo_validate_addr Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/70be187afcaa7b38981d114c088ffdc2cba0b4f1.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add .ndo_features_checkFan Gong1-0/+13
As we cannot solve packets with multiple stacked vlan, so we use .ndo_features_check to check for these packets and return a smaller feature without offload features. Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/3879b20b7ffa20106a3f8f56dbf2d5eb389f260a.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add .ndo_set_features and .ndo_fix_featuresFan Gong6-1/+370
Implement following callback function: .ndo_set_features .ndo_fix_features The .ndo_set_features function includes five features: rx_csum, tso, lro, rx_cvlan and vlan_filter. Add these new features in netdev_feature_init. Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/682734a08fde421413048bf70057dafe3cbe8497.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add .ndo_tx_timeout and .ndo_get_stats64Fan Gong8-2/+217
Implement following callback function: .ndo_tx_timeout .ndo_get_stats64 Use a work queue to trace tx_timeout callback and dump necessary debug information. Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/ec34d2ff9b142e1e142e47700714533baf7e659c.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add PF management interfacesFan Gong13-3/+521
Add management and communication pathways between PF and HW. Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/447e72b1c5f255ea5d79ecf96c8dac58011e604d.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20hinic3: Add PF frameworkFan Gong14-25/+415
Add support for PF framework based on the VF code. Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com> Signed-off-by: Fan Gong <gongfan1@huawei.com> Link: https://patch.msgid.link/4796d6076bbad0f096eb10261ab3c7d989c5f7b7.1768375903.git.zhuyikai1@h-partners.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-19Merge branch 'net-mlx5e-save-per-channel-async-icosq-in-default'Jakub Kicinski10-62/+154
Tariq Toukan says: ==================== net/mlx5e: Save per-channel async ICOSQ in default This series by William reduces the default number of SQs in a channel from 3 down to 2, by not creating the async ICOSQ (asynchronous internal-communication-operations send-queue). This significantly improves the latency of channel configuration operations, like interface up (create channels), interface down (destroy channels), and channels reconfiguration (create new set, destroy old one). This reduces the per-channel memory usage, saves hardware resources, in addition to the improved latency. This significantly speeds up the setup/config stage on systems with high number of channels or many netdevs, in particular systems with hundreds or K's of SFs. The two remaining default SQs per channel after this series: 1 TXQ SQ (for traffic), and 1 ICOSQ (for internal communication operations with the device). Perf numbers: NIC: Connect-X7. Test: Latency of interface up + down operations. Measured 20% speedup. Saving ~0.36 sec for 248 channels (~1.45 msec per channel). ==================== Link: https://patch.msgid.link/1768376800-1607672-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net/mlx5e: Conditionally create async ICOSQWilliam Tu4-29/+50
The async ICOSQ is only required by TLS RX (for re-sync flow) and XSK TX. Create it only when these features are enabled instead of always allocating it. This reduces per-channel memory usage, saves hardware resources, improves latency, and decreases the default number of SQs (from 3 to 2) and CQs (from 4 to 3). It also speeds up channel open/close operations for a netdev when async ICOSQ is not needed. Currently when TLS RX is enabled, there is no channel reset triggered. As a result, async ICOSQ allocation is not triggered, causing a NULL pointer crash. One solution is to do channel reset every time when toggling TLS RX. However, it's not straightforward as the offload state matters only on connection creation, and can go on beyond the channels reset. Instead, introduce a new field 'ktls_rx_was_enabled': if TLS RX is enabled for the first time: reset channels, create async ICOSQ, set the field. From that point on, no need to reset channels for any TLS RX enable/disable. Async ICOSQ will always be needed. For XSK TX, async ICOSQ is used in wakeup control and is guaranteed to have async ICOSQ allocated. This improves the latency of interface up/down operations when it applies. Perf numbers: NIC: Connect-X7. Test: Latency of interface up + down operations. Measured 20% speedup. Saving ~0.36 sec for 248 channels (~1.45 msec per channel). Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768376800-1607672-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net/mlx5e: Move async ICOSQ to dynamic allocationWilliam Tu6-28/+65
Dynamically allocate async ICOSQ. ICO (Internal Communication Operations) is for driver to communicate with the HW, and it's not used for traffic. Currently mlx5 driver has sync and async ICO send queues. The async ICOSQ means that it's not necessarily under NAPI context protection. The patch is in preparation for the later patch to detect its usage and enable it when necessary. Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768376800-1607672-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net/mlx5e: Use regular ICOSQ for triggering NAPIWilliam Tu5-5/+37
Before the cited commit, ICOSQ is used to post NOP WQE to trigger hardware interrupt and start NAPI, but this mechanism suffers from a race condition: mlx5e_alloc_rx_mpwqe may post UMR WQEs to ICOSQ _before_ NOP WQE is posted. The cited commit fixes the issue by replacing ICOSQ with async ICOSQ, as a new way to post the NOP WQE to trigger the hardware interrupt and NAPI. The patch changes it back by replacing async ICOSQ with regular ICOSQ, for the purpose of saving memory in later patches, and solves the issue by adding a new SQ state, MLX5E_SQ_STATE_LOCK_NEEDED for syncing the start of NAPI. What it does: - Switch trigger path from async ICOSQ to regular ICOSQ to reduce need for async SQ. - Introduce MLX5E_SQ_STATE_LOCK_NEEDED and mlx5e_icosq_sync_lock(), unlock() to prevent the race where UMR WQEs could be posted before the NOP WQE used to trigger NAPI. - Use synchronize_net() once per trigger cycle to quiesce in-flight softirqs before serializing the NOP WQE and any UMR postings via the ICOSQ lock. - Wrap ICOSQ UMR posting in en_rx.c and xsk/rx.c with the new conditional lock. The conditional locking approach is critical for performance: always locking would impose unnecessary overhead. Synchronization is not needed between regular NAPI cycles once the channel is activated and running. The lock is only required to protect against the race during channel activation—specifically, when the very first NOP WQE is posted to trigger NAPI. After that initial trigger, normal NAPI polling handles subsequent work without contention. The MLX5E_SQ_STATE_LOCK_NEEDED flag ensures we pay the synchronization cost only when necessary. Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768376800-1607672-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net/mlx5e: Move async ICOSQ lock into ICOSQ structWilliam Tu3-16/+18
Move the async_icosq spinlock from the mlx5e_channel structure into the mlx5e_icosq structure itself for better encapsulation and for later patch to also use it for other icosq use cases. Changes: - Add spinlock_t lock field to struct mlx5e_icosq - Remove async_icosq_lock field from struct mlx5e_channel - Initialize the new lock in mlx5e_open_icosq() - Update all lock usage in ktls_rx.c and en_main.c to use sq->lock instead of c->async_icosq_lock Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768376800-1607672-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19octeon_ep: reset firmware ready statusVimlesh Kumar4-1/+58
Add support to reset firmware ready status when the driver is removed(either in unload or unbind) Signed-off-by: Sathesh Edara <sedara@marvell.com> Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115092048.870237-1-vimleshk@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19Merge branch 'net-thunderbolt-various-improvements'Jakub Kicinski8-3/+73
Mika Westerberg says: ==================== net: thunderbolt: Various improvements This series improves the Thunderbolt networking driver so that it should work with the bonding driver. The discussion that started this patch series can be read below: https://lore.kernel.org/netdev/CAFJzfF9N4Hak23sc-zh0jMobbkjK7rg4odhic1DQ1cC+=MoQoA@mail.gmail.com/ v2: https://lore.kernel.org/20260109122606.3586895-1-mika.westerberg@linux.intel.com v1: https://lore.kernel.org/20251127131521.2580237-1-mika.westerberg@linux.intel.com ==================== Link: https://patch.msgid.link/20260115115646.328898-1-mika.westerberg@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net: thunderbolt: Allow reading link settingsIan MacDonald1-0/+49
In order to use Thunderbolt networking as part of bonding device it needs to support ->get_link_ksettings() ethtool operation, so that the bonding driver can read the link speed and the related attributes. Add support for this to the driver. Signed-off-by: Ian MacDonald <ian@netstatz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Link: https://patch.msgid.link/20260115115646.328898-5-mika.westerberg@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19bonding: 3ad: Add support for SPEED_80000Mika Westerberg1-0/+9
Add support for ethtool SPEED_80000. This is needed to allow Thunderbolt/USB4 networking driver to be used with the bonding driver. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115115646.328898-4-mika.westerberg@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net: ethtool: Add support for 80Gbps speedMika Westerberg6-3/+11
USB4 v2 link used in peer-to-peer networking is symmetric 80Gbps so in order to support reading this link speed, add support for it to ethtool. Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260115115646.328898-3-mika.westerberg@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net: thunderbolt: Allow changing MAC address of the deviceMika Westerberg1-0/+4
The MAC address we use is based on a suggestion in the USB4 Inter-domain spec but it is not really used in the USB4NET protocol. It is more targeted for the upper layers of the network stack. There is no reason why it should not be changed by the userspace for example if needed for bonding. Reported-by: Ian MacDonald <ian@netstatz.com> Closes: https://lore.kernel.org/netdev/CAFJzfF9N4Hak23sc-zh0jMobbkjK7rg4odhic1DQ1cC+=MoQoA@mail.gmail.com/ Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Link: https://patch.msgid.link/20260115115646.328898-2-mika.westerberg@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19Merge branch 'dpll-support-mode-switching'Jakub Kicinski5-8/+182
Ivan Vecera says: ==================== dpll: support mode switching This series adds support for switching the working mode (automatic vs manual) of a DPLL device via netlink. Currently, the DPLL subsystem allows userspace to retrieve the current working mode but lacks the mechanism to configure it. Userspace is also unaware of which modes a specific device actually supports, as it currently assumes only the active mode is supported. The series addresses these limitations by: 1. Introducing .supported_modes_get() callback to allow drivers to report all modes capable of running on the device. 2. Introducing .mode_set() callback and updating the netlink policy to allow userspace to request a mode change. 3. Implementing these callbacks in the zl3073x driver, enabling dynamic switching between automatic and manual modes. ==================== Link: https://patch.msgid.link/20260114122726.120303-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19dpll: zl3073x: Implement device mode setting supportIvan Vecera1-0/+112
Add support for .supported_modes_get() and .mode_set() callbacks to enable switching between manual and automatic modes via netlink. Implement .supported_modes_get() to report available modes based on the current hardware configuration: * manual mode is always supported * automatic mode is supported unless the dpll channel is configured in NCO (Numerically Controlled Oscillator) mode Implement .mode_set() to handle the specific logic required when transitioning between modes: 1) Transition to manual: * If a valid reference is currently active, switch the hardware to ref-lock mode (force lock to that reference). * If no reference is valid and the DPLL is unlocked, switch to freerun. * Otherwise, switch to Holdover. 2) Transition to automatic: * If the currently selected reference pin was previously marked as non-selectable (likely during a previous manual forcing operation), restore its priority and selectability in the hardware. * Switch the hardware to Automatic selection mode. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Prathosh Satish <Prathosh.Satish@microchip.com> Link: https://patch.msgid.link/20260114122726.120303-4-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>