kernel/linux.git/drivers/net/vrf.c, branch v6.18.21

ipv4: Convert ->flowi4_tos to dscp_t.

2025-08-27T00:34:31+00:00

Convert the ->flowic_tos field of struct flowi_common from __u8 to dscp_t, rename it ->flowic_dscp and propagate these changes to struct flowi and struct flowi4. We've had several bugs in the past where ECN bits could interfere with IPv4 routing, because these bits were not properly cleared when setting ->flowi4_tos. These bugs should be fixed now and the dscp_t type has been introduced to ensure that variables carrying DSCP values don't accidentally have any ECN bits set. Several variables and structure fields have been converted to dscp_t already, but the main IPv4 routing structure, struct flowi4, is still using a __u8. To avoid any future regression, this patch converts it to dscp_t. There are many users to convert at once. Fortunately, around half of ->flowi4_tos users already have a dscp_t value at hand, which they currently convert to __u8 using inet_dscp_to_dsfield(). For all of these users, we just need to drop that conversion. But, although we try to do the __u8 <-> dscp_t conversions at the boundaries of the network or of user space, some places still store TOS/DSCP variables as __u8 in core networking code. Those can hardly be converted either because the data structure is part of UAPI or because the same variable or field is also used for handling ECN in other parts of the code. In all of these cases where we don't have a dscp_t variable at hand, we need to use inet_dsfield_to_dscp() when interacting with ->flowi4_dscp. Changes since v1: * Fix space alignment in __bpf_redirect_neigh_v4() (Ido). Signed-off-by: Guillaume Nault Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com Signed-off-by: Jakub Kicinski

vrf: Drop existing dst reference in vrf_ip6_input_dst

2025-07-26T18:28:45+00:00

Commit ff3fbcdd4724 ("selftests: tc: Add generic erspan_opts matching support for tc-flower") started triggering the following kmemleak warning: unreferenced object 0xffff888015fb0e00 (size 512): comm "softirq", pid 0, jiffies 4294679065 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 40 d2 85 9e ff ff ff ff ........@....... 41 69 59 9d ff ff ff ff 00 00 00 00 00 00 00 00 AiY............. backtrace (crc 30b71e8b): __kmalloc_noprof+0x359/0x460 metadata_dst_alloc+0x28/0x490 erspan_rcv+0x4f1/0x1160 [ip_gre] gre_rcv+0x217/0x240 [ip_gre] gre_rcv+0x1b8/0x400 [gre] ip_protocol_deliver_rcu+0x31d/0x3a0 ip_local_deliver_finish+0x37d/0x620 ip_local_deliver+0x174/0x460 ip_rcv+0x52b/0x6b0 __netif_receive_skb_one_core+0x149/0x1a0 process_backlog+0x3c8/0x1390 __napi_poll.constprop.0+0xa1/0x390 net_rx_action+0x59b/0xe00 handle_softirqs+0x22b/0x630 do_softirq+0xb1/0xf0 __local_bh_enable_ip+0x115/0x150 vrf_ip6_input_dst unconditionally sets skb dst entry, add a call to skb_dst_drop to drop any existing entry. Cc: David Ahern Reviewed-by: Ido Schimmel Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses") Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20250725160043.350725-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski

net: sched: generalize check for no-queue qdisc on TX queue

2025-04-28T21:06:58+00:00

The "noqueue" qdisc can either be directly attached, or get default attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the allocated Qdisc structure gets it's enqueue function pointer reset to NULL by noqueue_init() via noqueue_qdisc_ops. This is a common case for software virtual net_devices. For these devices with no-queue, the transmission path in __dev_queue_xmit() will bypass the qdisc layer. Directly invoking device drivers ndo_start_xmit (via dev_hard_start_xmit). In this mode the device driver is not allowed to ask for packets to be queued (either via returning NETDEV_TX_BUSY or stopping the TXQ). The simplest and most reliable way to identify this no-queue case is by checking if enqueue == NULL. The vrf driver currently open-codes this check (!qdisc->enqueue). While functionally correct, this low-level detail is better encapsulated in a dedicated helper for clarity and long-term maintainability. To make this behavior more explicit and reusable, this patch introduce a new helper: qdisc_txq_has_no_queue(). Helper will also be used by the veth driver in the next patch, which introduces optional qdisc-based backpressure. This is a non-functional change. Reviewed-by: David Ahern Reviewed-by: Toke Høiland-Jørgensen Signed-off-by: Jesper Dangaard Brouer Link: https://patch.msgid.link/174559293172.827981.7583862632045264175.stgit@firesoul Signed-off-by: Jakub Kicinski

net: move misc netdev_lock flavors to a separate header

2025-03-08T17:06:50+00:00

Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski

net: rename netns_local to netns_immutable

2025-03-04T11:44:48+00:00

The name 'netns_local' is confusing. A following commit will export it via netlink, so let's use a more explicit name. Reported-by: Eric Dumazet Suggested-by: Kuniyuki Iwashima Signed-off-by: Nicolas Dichtel Reviewed-by: Kuniyuki Iwashima Signed-off-by: Paolo Abeni

rtnetlink: Pack newlink() params into struct

2025-02-21T23:28:02+00:00

There are 4 net namespaces involved when creating links: - source netns - where the netlink socket resides, - target netns - where to put the device being created, - link netns - netns associated with the device (backend), - peer netns - netns of peer device. Currently, two nets are passed to newlink() callback - "src_net" parameter and "dev_net" (implicitly in net_device). They are set as follows, depending on netlink attributes in the request. +------------+-------------------+---------+---------+ | peer netns | IFLA_LINK_NETNSID | src_net | dev_net | +------------+-------------------+---------+---------+ | | absent | source | target | | absent +-------------------+---------+---------+ | | present | link | link | +------------+-------------------+---------+---------+ | | absent | peer | target | | present +-------------------+---------+---------+ | | present | peer | link | +------------+-------------------+---------+---------+ When IFLA_LINK_NETNSID is present, the device is created in link netns first and then moved to target netns. This has some side effects, including extra ifindex allocation, ifname validation and link events. These could be avoided if we create it in target netns from the beginning. On the other hand, the meaning of src_net parameter is ambiguous. It varies depending on how parameters are passed. It is the effective link (or peer netns) by design, but some drivers ignore it and use dev_net instead. To provide more netns context for drivers, this patch packs existing newlink() parameters, along with the source netns, link netns and peer netns, into a struct. The old "src_net" is renamed to "net" to avoid confusion with real source netns, and will be deprecated later. The use of src_net are converted to params->net trivially. Signed-off-by: Xiao Liang Reviewed-by: Kuniyuki Iwashima Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski

net: fib_rules: Factorise fib_newrule() and fib_delrule().

2025-02-11T03:08:52+00:00

fib_nl_newrule() / fib_nl_delrule() is the doit() handler for RTM_NEWRULE / RTM_DELRULE but also called from vrf_newlink(). Currently, we hold RTNL on both paths but will not on the former. Also, we set dev_net(dev)->rtnl to skb->sk in vrf_fib_rule() because fib_nl_newrule() / fib_nl_delrule() fetch net as sock_net(skb->sk). Let's Factorise the two functions and pass net and rtnl_held flag. Signed-off-by: Kuniyuki Iwashima Reviewed-by: Eric Dumazet Reviewed-by: Ido Schimmel Tested-by: Ido Schimmel Link: https://patch.msgid.link/20250207072502.87775-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski

vrf: Make pcpu_dstats update functions available to other modules.

2024-12-07T01:56:52+00:00

Currently vrf is the only module that uses NETDEV_PCPU_STAT_DSTATS. In order to make this kind of statistics available to other modules, we need to define the update functions in netdevice.h. Therefore, let's define dev_dstats_*() functions for RX and TX packet updates (packets, bytes and drops). Use these new functions in vrf.c instead of vrf_rx_stats() and the other manual counter updates. While there, update the type of the "len" variables to "unsigned int", so that there're aligned with both skb->len and the new dstats update functions. Signed-off-by: Guillaume Nault Link: https://patch.msgid.link/d7a552ee382c79f4854e7fcc224cf176cd21150d.1733313925.git.gnault@redhat.com Signed-off-by: Jakub Kicinski

vrf: Prepare vrf_process_v4_outbound() to future .flowi4_tos conversion.

2024-11-03T22:29:05+00:00

Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the dscp_t value to __u8 with inet_dscp_to_dsfield(). Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault Reviewed-by: David Ahern Link: https://patch.msgid.link/6be084229008dcfa7a4e2758befccfd2217a331e.1730294788.git.gnault@redhat.com Signed-off-by: Jakub Kicinski

vrf: revert "vrf: Remove unnecessary RCU-bh critical section"

2024-10-03T00:26:11+00:00

This reverts commit 504fc6f4f7f681d2a03aa5f68aad549d90eab853. dev_queue_xmit_nit is expected to be called with BH disabled. __dev_queue_xmit has the following: /* Disable soft irqs for various locks below. Also * stops preemption for RCU. */ rcu_read_lock_bh(); VRF must follow this invariant. The referenced commit removed this protection. Which triggered a lockdep warning: ================================ WARNING: inconsistent lock state 6.11.0 #1 Tainted: G W -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. btserver/134819 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff8882da30c118 (rlock-AF_PACKET){+.?.}-{2:2}, at: tpacket_rcv+0x863/0x3b30 {IN-SOFTIRQ-W} state was registered at: lock_acquire+0x19a/0x4f0 _raw_spin_lock+0x27/0x40 packet_rcv+0xa33/0x1320 __netif_receive_skb_core.constprop.0+0xcb0/0x3a90 __netif_receive_skb_list_core+0x2c9/0x890 netif_receive_skb_list_internal+0x610/0xcc0 [...] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(rlock-AF_PACKET); lock(rlock-AF_PACKET); *** DEADLOCK *** Call Trace: dump_stack_lvl+0x73/0xa0 mark_lock+0x102e/0x16b0 __lock_acquire+0x9ae/0x6170 lock_acquire+0x19a/0x4f0 _raw_spin_lock+0x27/0x40 tpacket_rcv+0x863/0x3b30 dev_queue_xmit_nit+0x709/0xa40 vrf_finish_direct+0x26e/0x340 [vrf] vrf_l3_out+0x5f4/0xe80 [vrf] __ip_local_out+0x51e/0x7a0 [...] Fixes: 504fc6f4f7f6 ("vrf: Remove unnecessary RCU-bh critical section") Link: https://lore.kernel.org/netdev/20240925185216.1990381-1-greearb@candelatech.com/ Reported-by: Ben Greear Signed-off-by: Willem de Bruijn Cc: stable@vger.kernel.org Reviewed-by: Ido Schimmel Tested-by: Ido Schimmel Reviewed-by: David Ahern Link: https://patch.msgid.link/20240929061839.1175300-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski