kernel/linux.git/drivers/net/vrf.c, branch v7.1

vrf: Fix a potential NPD when removing a port from a VRF

2026-04-28T00:43:22+00:00

RCU readers that identified a net device as a VRF port using netif_is_l3_slave() assume that a subsequent call to netdev_master_upper_dev_get_rcu() will return a VRF device. They then continue to dereference its l3mdev operations. This assumption is not always correct and can result in a NPD [1]. There is no RCU synchronization when removing a port from a VRF, so it is possible for an RCU reader to see a new master device (e.g., a bridge) that does not have l3mdev operations. Fix by adding RCU synchronization after clearing the IFF_L3MDEV_SLAVE flag. Skip this synchronization when a net device is removed from a VRF as part of its deletion and when the VRF device itself is deleted. In the latter case an RCU grace period will pass by the time RTNL is released. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] RIP: 0010:l3mdev_fib_table_rcu (net/l3mdev/l3mdev.c:181) [...] Call Trace: l3mdev_fib_table_by_index (net/l3mdev/l3mdev.c:201 net/l3mdev/l3mdev.c:189) __inet_bind (net/ipv4/af_inet.c:499 (discriminator 3)) inet_bind_sk (net/ipv4/af_inet.c:469) __sys_bind (./include/linux/file.h:62 (discriminator 1) ./include/linux/file.h:83 (discriminator 1) net/socket.c:1951 (discriminator 1)) __x64_sys_bind (net/socket.c:1969 (discriminator 1) net/socket.c:1967 (discriminator 1) net/socket.c:1967 (discriminator 1)) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Fixes: fdeea7be88b1 ("net: vrf: Set slave's private flag before linking") Reported-by: Haoze Xie Reported-by: Yifan Wu Reported-by: Juefei Pu Reported-by: Yuan Tan Closes: https://lore.kernel.org/netdev/20260419145332.3988923-1-n05ec@lzu.edu.cn/ Signed-off-by: Ido Schimmel Reviewed-by: David Ahern Link: https://patch.msgid.link/20260423063607.1208202-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski

vrf: Remove unnecessary RCU protection around dst entries

2026-03-28T04:07:24+00:00

During initialization of a VRF device, the VRF driver creates two dst entries (for IPv4 and IPv6). They are attached to locally generated packets that are transmitted out of the VRF ports (via the l3mdev_l3_out() hook). Their purpose is to redirect packets towards the VRF device instead of having the packets egress directly out of the VRF ports. This is useful, for example, when a queuing discipline is configured on the VRF device. In order to avoid a NULL pointer dereference, commit b0e95ccdd775 ("net: vrf: protect changes to private data with rcu") made the pointers to the dst entries RCU protected. As far as I can tell, this was needed because back then the dst entries were released (and the pointers reset to NULL) before removing the VRF ports. Later on, commit f630c38ef0d7 ("vrf: fix bug_on triggered by rx when destroying a vrf") moved the removal of the VRF ports to the VRF device's dellink() callback. As such, the tear down sequence of a VRF device looks as follows: 1. VRF ports are removed. 2. VRF device is unregistered. a. Device is closed. b. An RCU grace period passes. c. ndo_uninit() is called. i. dst entries are released. Given the above, the Tx path will always see the same fully initialized dst entries and will never race with the ndo_uninit() callback. Therefore, there is no need to make the pointers to the dst entries RCU protected. Remove it as well as the unnecessary NULL checks in the Tx path. Signed-off-by: Ido Schimmel Reviewed-by: David Ahern Link: https://patch.msgid.link/20260326203233.1128554-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski

vrf: Use dst_dev_put() instead of using loopback device

2026-03-28T04:07:24+00:00

Use dst_dev_put() to clean up the device referenced by the dst entry instead of partially open coding it. Internally, the helper uses the blackhole device instead of the loopback device. Reviewed-by: Petr Machata Reviewed-by: David Ahern Signed-off-by: Ido Schimmel Link: https://patch.msgid.link/20260326203233.1128554-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski

vrf: Remove unnecessary NULL check

2026-03-28T04:07:24+00:00

The VRF driver always allocates an IPv4 dst entry for a VRF device and prevents the device from being registered if the allocation fails. Therefore, there is no need to check if the entry exists when tearing down a VRF device. Remove the check. Note that the same is not true for the IPv6 dst entry. Its creation can be skipped if IPv6 is administratively disabled (i.e., 'ipv6.disable=1'). Reviewed-by: Petr Machata Reviewed-by: David Ahern Signed-off-by: Ido Schimmel Link: https://patch.msgid.link/20260326203233.1128554-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

2026-02-21T09:02:28+00:00

This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook

ipv4: Convert ->flowi4_tos to dscp_t.

2025-08-27T00:34:31+00:00

Convert the ->flowic_tos field of struct flowi_common from __u8 to dscp_t, rename it ->flowic_dscp and propagate these changes to struct flowi and struct flowi4. We've had several bugs in the past where ECN bits could interfere with IPv4 routing, because these bits were not properly cleared when setting ->flowi4_tos. These bugs should be fixed now and the dscp_t type has been introduced to ensure that variables carrying DSCP values don't accidentally have any ECN bits set. Several variables and structure fields have been converted to dscp_t already, but the main IPv4 routing structure, struct flowi4, is still using a __u8. To avoid any future regression, this patch converts it to dscp_t. There are many users to convert at once. Fortunately, around half of ->flowi4_tos users already have a dscp_t value at hand, which they currently convert to __u8 using inet_dscp_to_dsfield(). For all of these users, we just need to drop that conversion. But, although we try to do the __u8 <-> dscp_t conversions at the boundaries of the network or of user space, some places still store TOS/DSCP variables as __u8 in core networking code. Those can hardly be converted either because the data structure is part of UAPI or because the same variable or field is also used for handling ECN in other parts of the code. In all of these cases where we don't have a dscp_t variable at hand, we need to use inet_dsfield_to_dscp() when interacting with ->flowi4_dscp. Changes since v1: * Fix space alignment in __bpf_redirect_neigh_v4() (Ido). Signed-off-by: Guillaume Nault Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com Signed-off-by: Jakub Kicinski

vrf: Drop existing dst reference in vrf_ip6_input_dst

2025-07-26T18:28:45+00:00

Commit ff3fbcdd4724 ("selftests: tc: Add generic erspan_opts matching support for tc-flower") started triggering the following kmemleak warning: unreferenced object 0xffff888015fb0e00 (size 512): comm "softirq", pid 0, jiffies 4294679065 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 40 d2 85 9e ff ff ff ff ........@....... 41 69 59 9d ff ff ff ff 00 00 00 00 00 00 00 00 AiY............. backtrace (crc 30b71e8b): __kmalloc_noprof+0x359/0x460 metadata_dst_alloc+0x28/0x490 erspan_rcv+0x4f1/0x1160 [ip_gre] gre_rcv+0x217/0x240 [ip_gre] gre_rcv+0x1b8/0x400 [gre] ip_protocol_deliver_rcu+0x31d/0x3a0 ip_local_deliver_finish+0x37d/0x620 ip_local_deliver+0x174/0x460 ip_rcv+0x52b/0x6b0 __netif_receive_skb_one_core+0x149/0x1a0 process_backlog+0x3c8/0x1390 __napi_poll.constprop.0+0xa1/0x390 net_rx_action+0x59b/0xe00 handle_softirqs+0x22b/0x630 do_softirq+0xb1/0xf0 __local_bh_enable_ip+0x115/0x150 vrf_ip6_input_dst unconditionally sets skb dst entry, add a call to skb_dst_drop to drop any existing entry. Cc: David Ahern Reviewed-by: Ido Schimmel Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses") Signed-off-by: Stanislav Fomichev Link: https://patch.msgid.link/20250725160043.350725-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski

net: sched: generalize check for no-queue qdisc on TX queue

2025-04-28T21:06:58+00:00

The "noqueue" qdisc can either be directly attached, or get default attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the allocated Qdisc structure gets it's enqueue function pointer reset to NULL by noqueue_init() via noqueue_qdisc_ops. This is a common case for software virtual net_devices. For these devices with no-queue, the transmission path in __dev_queue_xmit() will bypass the qdisc layer. Directly invoking device drivers ndo_start_xmit (via dev_hard_start_xmit). In this mode the device driver is not allowed to ask for packets to be queued (either via returning NETDEV_TX_BUSY or stopping the TXQ). The simplest and most reliable way to identify this no-queue case is by checking if enqueue == NULL. The vrf driver currently open-codes this check (!qdisc->enqueue). While functionally correct, this low-level detail is better encapsulated in a dedicated helper for clarity and long-term maintainability. To make this behavior more explicit and reusable, this patch introduce a new helper: qdisc_txq_has_no_queue(). Helper will also be used by the veth driver in the next patch, which introduces optional qdisc-based backpressure. This is a non-functional change. Reviewed-by: David Ahern Reviewed-by: Toke Høiland-Jørgensen Signed-off-by: Jesper Dangaard Brouer Link: https://patch.msgid.link/174559293172.827981.7583862632045264175.stgit@firesoul Signed-off-by: Jakub Kicinski

net: move misc netdev_lock flavors to a separate header

2025-03-08T17:06:50+00:00

Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski

net: rename netns_local to netns_immutable

2025-03-04T11:44:48+00:00

The name 'netns_local' is confusing. A following commit will export it via netlink, so let's use a more explicit name. Reported-by: Eric Dumazet Suggested-by: Kuniyuki Iwashima Signed-off-by: Nicolas Dichtel Reviewed-by: Kuniyuki Iwashima Signed-off-by: Paolo Abeni