summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2024-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni3-3/+5
Merge in late fixes to prepare for the 6.13 net-next PR. Conflicts: include/linux/phy.h 41ffcd95015f net: phy: fix phylib's dual eee_enabled 721aa69e708b net: phy: convert eee_broken_modes to a linkmode bitmap https://lore.kernel.org/all/20241118135512.1039208b@canb.auug.org.au/ drivers/net/ethernet/wangxun/txgbe/txgbe_phy.c 2160428bcb20 net: txgbe: fix null pointer to pcs 2160428bcb20 net: txgbe: remove GPIO interrupt controller Adjacent commits: include/linux/phy.h 41ffcd95015f net: phy: fix phylib's dual eee_enabled 516a5f11eb97 net: phy: respect cached advertising when re-enabling EEE Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-19bpf: fix recursive lock when verdict program return SK_PASSJiayuan Chen1-2/+2
When the stream_verdict program returns SK_PASS, it places the received skb into its own receive queue, but a recursive lock eventually occurs, leading to an operating system deadlock. This issue has been present since v6.9. ''' sk_psock_strp_data_ready write_lock_bh(&sk->sk_callback_lock) strp_data_ready strp_read_sock read_sock -> tcp_read_sock strp_recv cb.rcv_msg -> sk_psock_strp_read # now stream_verdict return SK_PASS without peer sock assign __SK_PASS = sk_psock_map_verd(SK_PASS, NULL) sk_psock_verdict_apply sk_psock_skb_ingress_self sk_psock_skb_ingress_enqueue sk_psock_data_ready read_lock_bh(&sk->sk_callback_lock) <= dead lock ''' This topic has been discussed before, but it has not been fixed. Previous discussion: https://lore.kernel.org/all/6684a5864ec86_403d20898@john.notmuch Fixes: 6648e613226e ("bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue") Reported-by: Vincent Whitchurch <vincent.whitchurch@datadoghq.com> Signed-off-by: Jiayuan Chen <mrpre@163.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20241118030910.36230-2-mrpre@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-19netpoll: Use rcu_access_pointer() in __netpoll_setupBreno Leitao1-1/+1
The ndev->npinfo pointer in __netpoll_setup() is RCU-protected but is being accessed directly for a NULL check. While no RCU read lock is held in this context, we should still use proper RCU primitives for consistency and correctness. Replace the direct NULL check with rcu_access_pointer(), which is the appropriate primitive when only checking for NULL without dereferencing the pointer. This function provides the necessary ordering guarantees without requiring RCU read-side protection. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20241118-netpoll_rcu-v1-1-a1888dcb4a02@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-19net/neighbor: clear error in case strict check is not setJakub Kicinski1-0/+1
Commit 51183d233b5a ("net/neighbor: Update neigh_dump_info for strict data checking") added strict checking. The err variable is not cleared, so if we find no table to dump we will return the validation error even if user did not want strict checking. I think the only way to hit this is to send an buggy request, and ask for a table which doesn't exist, so there's no point treating this as a real fix. I only noticed it because a syzbot repro depended on it to trigger another bug. Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241115003221.733593-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-18Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-6/+4
Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
2024-11-16ndo_fdb_del: Add a parameter to report whether notification was sentPetr Machata1-3/+8
In a similar fashion to ndo_fdb_add, which was covered in the previous patch, add the bool *notified argument to ndo_fdb_del. Callees that send a notification on their own set the flag to true. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/06b1acf4953ef0a5ed153ef1f32d7292044f2be6.1731589511.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-16ndo_fdb_add: Add a parameter to report whether notification was sentPetr Machata1-3/+6
Currently when FDB entries are added to or deleted from a VXLAN netdevice, the VXLAN driver emits one notification, including the VXLAN-specific attributes. The core however always sends a notification as well, a generic one. Thus two notifications are unnecessarily sent for these operations. A similar situation comes up with bridge driver, which also emits notifications on its own: # ip link add name vx type vxlan id 1000 dstport 4789 # bridge monitor fdb & [1] 1981693 # bridge fdb add de:ad:be:ef:13:37 dev vx self dst 192.0.2.1 de:ad:be:ef:13:37 dev vx dst 192.0.2.1 self permanent de:ad:be:ef:13:37 dev vx self permanent In order to prevent this duplicity, add a paremeter to ndo_fdb_add, bool *notified. The flag is primed to false, and if the callee sends a notification on its own, it sets it to true, thus informing the core that it should not generate another notification. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/cbf6ae8195e85cbf922f8058ce4eba770f3b71ed.1731589511.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-16net: netpoll: flush skb pool during cleanupBreno Leitao1-1/+13
The netpoll subsystem maintains a pool of 32 pre-allocated SKBs per instance, but these SKBs are not freed when the netpoll user is brought down. This leads to memory waste as these buffers remain allocated but unused. Add skb_pool_flush() to properly clean up these SKBs when netconsole is terminated, improving memory efficiency. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20241114-skb_buffers_v2-v3-2-9be9f52a8b69@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-16net: netpoll: Individualize the skb poolBreno Leitao1-18/+13
The current implementation of the netpoll system uses a global skb pool, which can lead to inefficient memory usage and waste when targets are disabled or no longer in use. This can result in a significant amount of memory being unnecessarily allocated and retained, potentially causing performance issues and limiting the availability of resources for other system components. Modify the netpoll system to assign a skb pool to each target instead of using a global one. This approach allows for more fine-grained control over memory allocation and deallocation, ensuring that resources are only allocated and retained as needed. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20241114-skb_buffers_v2-v3-1-9be9f52a8b69@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-16netdev-genl: Hold rcu_read_lock in napi_setJoe Damato1-0/+2
Hold rcu_read_lock during netdev_nl_napi_set_doit, which calls napi_by_id and requires rcu_read_lock to be held. Closes: https://lore.kernel.org/netdev/719083c2-e277-447b-b6ea-ca3acb293a03@redhat.com/ Fixes: 1287c1ae0fc2 ("netdev-genl: Support setting per-NAPI config values") Signed-off-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20241114175600.18882-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-16netdev-genl: Hold rcu_read_lock in napi_getJoe Damato1-0/+2
Hold rcu_read_lock in netdev_nl_napi_get_doit, which calls napi_by_id and is required to be called under rcu_read_lock. Cc: stable@vger.kernel.org Fixes: 27f91aaf49b3 ("netdev-genl: Add netlink framework functions for napi") Signed-off-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20241114175157.16604-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15Merge tag 'for-netdev' of ↵Jakub Kicinski1-37/+51
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2024-11-14 We've added 9 non-merge commits during the last 4 day(s) which contain a total of 3 files changed, 226 insertions(+), 84 deletions(-). The main changes are: 1) Fixes to bpf_msg_push/pop_data and test_sockmap. The changes has dependency on the other changes in the bpf-next/net branch, from Zijian Zhang. 2) Drop netns codes from mptcp test. Reuse the common helpers in test_progs, from Geliang Tang. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: bpf, sockmap: Fix sk_msg_reset_curr bpf, sockmap: Several fixes to bpf_msg_pop_data bpf, sockmap: Several fixes to bpf_msg_push_data selftests/bpf: Add more tests for test_txmsg_push_pop in test_sockmap selftests/bpf: Add push/pop checking for msg_verify_data in test_sockmap selftests/bpf: Fix total_bytes in msg_loop_rx in test_sockmap selftests/bpf: Fix SENDPAGE data logic in test_sockmap selftests/bpf: Add txmsg_pass to pull/push/pop in test_sockmap selftests/bpf: Drop netns helpers in mptcp ==================== Link: https://patch.msgid.link/20241114202832.3187927-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15bpf: lwtunnel: Prepare bpf_lwt_xmit_reroute() to future .flowi4_tos conversion.Guillaume Nault1-1/+1
Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the dscp_t value to __u8 with inet_dscp_to_dsfield(). Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/8338a12377c44f698a651d1ce357dd92bdf18120.1731064982.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-15bpf: ipv4: Prepare __bpf_redirect_neigh_v4() to future .flowi4_tos conversion.Guillaume Nault1-1/+1
Use ip4h_dscp() to get the DSCP from the IPv4 header, then convert the dscp_t value to __u8 with inet_dscp_to_dsfield(). Then, when we'll convert .flowi4_tos to dscp_t, we'll just have to drop the inet_dscp_to_dsfield() call. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/35eacc8955003e434afb1365d404193cc98a9579.1731064982.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-19/+25
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14Merge tag 'net-6.12-rc8' of ↵Linus Torvalds1-18/+24
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bluetooth. Quite calm week. No new regression under investigation. Current release - regressions: - eth: revert "igb: Disable threaded IRQ for igb_msix_other" Current release - new code bugs: - bluetooth: btintel: direct exception event to bluetooth stack Previous releases - regressions: - core: fix data-races around sk->sk_forward_alloc - netlink: terminate outstanding dump on socket close - mptcp: error out earlier on disconnect - vsock: fix accept_queue memory leak - phylink: ensure PHY momentary link-fails are handled - eth: mlx5: - fix null-ptr-deref in add rule err flow - lock FTE when checking if active - eth: dwmac-mediatek: fix inverted handling of mediatek,mac-wol Previous releases - always broken: - sched: fix u32's systematic failure to free IDR entries for hnodes. - sctp: fix possible UAF in sctp_v6_available() - eth: bonding: add ns target multicast address to slave device - eth: mlx5: fix msix vectors to respect platform limit - eth: icssg-prueth: fix 1 PPS sync" * tag 'net-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits) net: sched: u32: Add test case for systematic hnode IDR leaks selftests: bonding: add ns multicast group testing bonding: add ns target multicast address to slave device net: ti: icssg-prueth: Fix 1 PPS sync stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines net: Make copy_safe_from_sockptr() match documentation net: stmmac: dwmac-mediatek: Fix inverted handling of mediatek,mac-wol ipmr: Fix access to mfc_cache_list without lock held samples: pktgen: correct dev to DEV net: phylink: ensure PHY momentary link-fails are handled mptcp: pm: use _rcu variant under rcu_read_lock mptcp: hold pm lock when deleting entry mptcp: update local address flags when setting it net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes. MAINTAINERS: Re-add cancelled Renesas driver sections Revert "igb: Disable threaded IRQ for igb_msix_other" Bluetooth: btintel: Direct exception event to bluetooth stack Bluetooth: hci_core: Fix calling mgmt_device_connected virtio/vsock: Improve MSG_ZEROCOPY error handling vsock: Fix sk_error_queue memory leak ...
2024-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov4-3/+11
Cross-merge bpf fixes after downstream PR. In particular to bring the fix in commit aa30eb3260b2 ("bpf: Force checkpoint when jmp history is too long"). The follow up verifier work depends on it. And the fix in commit 6801cf7890f2 ("selftests/bpf: Use -4095 as the bad address for bits iterator"). It's fixing instability of BPF CI on s390 arch. No conflicts. Adjacent changes in: Auto-merging arch/Kconfig Auto-merging kernel/bpf/helpers.c Auto-merging kernel/bpf/memalloc.c Auto-merging kernel/bpf/verifier.c Auto-merging mm/slab_common.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-13net: page_pool: do not count normal frag allocation in statsJakub Kicinski1-1/+1
Commit 0f6deac3a079 ("net: page_pool: add page allocation stats for two fast page allocate path") added increments for "fast path" allocation to page frag alloc. It mentions performance degradation analysis but the details are unclear. Could be that the author was simply surprised by the alloc stats not matching packet count. In my experience the key metric for page pool is the recycling rate. Page return stats, however, count returned _pages_ not frags. This makes it impossible to calculate recycling rate for drivers using the frag API. Here is example output of the page-pool YNL sample for a driver allocating 1200B frags (4k pages) with nearly perfect recycling: $ ./page-pool eth0[2] page pools: 32 (zombies: 0) refs: 291648 bytes: 1194590208 (refs: 0 bytes: 0) recycling: 33.3% (alloc: 4557:2256365862 recycle: 200476245:551541893) The recycling rate is reported as 33.3% because we give out 4096 // 1200 = 3 frags for every recycled page. Effectively revert the aforementioned commit. This also aligns with the stats we would see for drivers which do the fragmentation themselves, although that's not a strong reason in itself. On the (very unlikely) path where we can reuse the current page let's bump the "cached" stat. The fact that we don't put the page in the cache is just an optimization. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://patch.msgid.link/20241109023303.3366500-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12net: Implement fault injection forcing skb reallocationBreno Leitao2-0/+107
Introduce a fault injection mechanism to force skb reallocation. The primary goal is to catch bugs related to pointer invalidation after potential skb reallocation. The fault injection mechanism aims to identify scenarios where callers retain pointers to various headers in the skb but fail to reload these pointers after calling a function that may reallocate the data. This type of bug can lead to memory corruption or crashes if the old, now-invalid pointers are used. By forcing reallocation through fault injection, we can stress-test code paths and ensure proper pointer management after potential skb reallocations. Add a hook for fault injection in the following functions: * pskb_trim_rcsum() * pskb_may_pull_reason() * pskb_trim() As the other fault injection mechanism, protect it under a debug Kconfig called CONFIG_FAIL_SKB_REALLOC. This patch was *heavily* inspired by Jakub's proposal from: https://lore.kernel.org/all/20240719174140.47a868e6@kernel.org/ CC: Akinobu Mita <akinobu.mita@gmail.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20241107-fault_v6-v6-1-1b82cb6ecacd@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12net: ip: make ip_route_input_noref() return drop reasonsMenglong Dong1-2/+4
In this commit, we make ip_route_input_noref() return drop reasons, which come from ip_route_input_rcu(). We need adjust the callers of ip_route_input_noref() to make sure the return value of ip_route_input_noref() is used properly. The errno that ip_route_input_noref() returns comes from ip_route_input and bpf_lwt_input_reroute in the origin logic, and we make them return -EINVAL on error instead. In the following patch, we will make ip_route_input() returns drop reasons too. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-12net: Add control functions for irq suspensionMartin Karsten1-0/+37
The napi_suspend_irqs routine bootstraps irq suspension by elongating the defer timeout to irq_suspend_timeout. The napi_resume_irqs routine effectively cancels irq suspension by forcing the napi to be scheduled immediately. Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Joe Damato <jdamato@fastly.com> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://patch.msgid.link/20241109050245.191288-3-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12net: Add napi_struct parameter irq_suspend_timeoutMartin Karsten4-2/+42
Add a per-NAPI IRQ suspension parameter, which can be get/set with netdev-genl. This patch doesn't change any behavior but prepares the code for other changes in the following commits which use irq_suspend_timeout as a timeout for IRQ suspension. Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Joe Damato <jdamato@fastly.com> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://patch.msgid.link/20241109050245.191288-2-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12net: fix SO_DEVMEM_DONTNEED looping too longMina Almasry1-18/+24
Exit early if we're freeing more than 1024 frags, to prevent looping too long. Also minor code cleanups: - Flip checks to reduce indentation. - Use sizeof(*tokens) everywhere for consistentcy. Cc: Yi Lai <yi1.lai@linux.intel.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20241107210331.3044434-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12rtnetlink: Register rtnl_dellink() and rtnl_setlink() with ↵Kuniyuki Iwashima1-3/+16
RTNL_FLAG_DOIT_PERNET_WIP. Currently, rtnl_setlink() and rtnl_dellink() cannot be fully converted to per-netns RTNL due to a lack of handling peer/lower/upper devices in different netns. For example, when we change a device in rtnl_setlink() and need to propagate that to its upper devices, we want to avoid acquiring all netns locks, for which we do not know the upper limit. The same situation happens when we remove a device. rtnl_dellink() could be transformed to remove a single device in the requested netns and delegate other devices to per-netns work, and rtnl_setlink() might be ? Until we come up with a better idea, let's use a new flag RTNL_FLAG_DOIT_PERNET_WIP for rtnl_dellink() and rtnl_setlink(). This will unblock converting RTNL users where such devices are not related. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241108004823.29419-11-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12rtnetlink: Convert RTM_NEWLINK to per-netns RTNL.Kuniyuki Iwashima1-3/+24
Now, we are ready to convert rtnl_newlink() to per-netns RTNL; rtnl_link_ops is protected by SRCU and netns is prefetched in rtnl_newlink(). Let's register rtnl_newlink() with RTNL_FLAG_DOIT_PERNET and push RTNL down as rtnl_nets_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12rtnetlink: Add peer_type in struct rtnl_link_ops.Kuniyuki Iwashima1-4/+51
In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with a net pointer, which is the first argument of ->newlink(). rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID and IFLA_NET_NS_FD in the peer device's attributes. We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink() for per-netns RTNL. All of the three get the peer netns in the same way: 1. Call rtnl_nla_parse_ifinfomsg() 2. Call ops->validate() (vxcan doesn't have) 3. Call rtnl_link_get_net_tb() Let's add a new field peer_type to struct rtnl_link_ops and prefetch netns in the peer ifla to add it to rtnl_nets in rtnl_newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12rtnetlink: Introduce struct rtnl_nets and helpers.Kuniyuki Iwashima1-3/+67
rtnl_newlink() needs to hold 3 per-netns RTNL: 2 for a new device and 1 for its peer. We will add rtnl_nets_lock() later, which performs the nested locking based on struct rtnl_nets, which has an array of struct net pointers. rtnl_nets_add() adds a net pointer to the array and sorts it so that rtnl_nets_lock() can simply acquire per-netns RTNL from array[0] to [2]. Before calling rtnl_nets_add(), get_net() must be called for the net, and rtnl_nets_destroy() will call put_net() for each. Let's apply the helpers to rtnl_newlink(). When CONFIG_DEBUG_NET_SMALL_RTNL is disabled, we do not call rtnl_net_lock() thus do not care about the array order, so rtnl_net_cmp_locks() returns -1 so that the loop in rtnl_nets_add() can be optimised to NOP. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12rtnetlink: Remove __rtnl_link_register()Kuniyuki Iwashima2-29/+7
link_ops is protected by link_ops_mutex and no longer needs RTNL, so we have no reason to have __rtnl_link_register() separately. Let's remove it and call rtnl_link_register() from ifb.ko and dummy.ko. Note that both modules' init() work on init_net only, so we need not export pernet_ops_rwsem and can use rtnl_net_lock() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12rtnetlink: Protect link_ops by mutex.Kuniyuki Iwashima1-13/+20
rtnl_link_unregister() holds RTNL and calls synchronize_srcu(), but rtnl_newlink() will acquire SRCU frist and then RTNL. Then, we need to unlink ops and call synchronize_srcu() outside of RTNL to avoid the deadlock. rtnl_link_unregister() rtnl_newlink() ---- ---- lock(rtnl_mutex); lock(&ops->srcu); lock(rtnl_mutex); sync(&ops->srcu); Let's move as such and add a mutex to protect link_ops. Now, link_ops is protected by its dedicated mutex and rtnl_link_register() no longer needs to hold RTNL. While at it, we move the initialisation of ops->dellink and ops->srcu out of the mutex scope. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12rtnetlink: Remove __rtnl_link_unregister().Kuniyuki Iwashima1-22/+10
rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(), where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests for per-netns RTNL. We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko and dummy.ko. However, rtnl_newlink() will acquire SRCU before RTNL later in this series. Then, lockdep will detect the deadlock: rtnl_link_unregister() rtnl_newlink() ---- ---- lock(rtnl_mutex); lock(&ops->srcu); lock(rtnl_mutex); sync(&ops->srcu); To avoid the problem, we must call synchronize_srcu() before RTNL in rtnl_link_unregister(). As a preparation, let's remove __rtnl_link_unregister(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11mm: page_frag: avoid caller accessing 'page_frag_cache' directlyYunsheng Lin1-3/+3
Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly. CC: Andrew Morton <akpm@linux-foundation.org> CC: Linux-MM <linux-mm@kvack.org> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20241028115343.3405838-5-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net: convert to nla_get_*_default()Johannes Berg2-6/+2
Most of the original conversion is from the spatch below, but I edited some and left out other instances that were either buggy after conversion (where default values don't fit into the type) or just looked strange. @@ expression attr, def; expression val; identifier fn =~ "^nla_get_.*"; fresh identifier dfn = fn ## "_default"; @@ ( -if (attr) - val = fn(attr); -else - val = def; +val = dfn(attr, def); | -if (!attr) - val = def; -else - val = fn(attr); +val = dfn(attr, def); | -if (!attr) - return def; -return fn(attr); +return dfn(attr, def); | -attr ? fn(attr) : def +dfn(attr, def) | -!attr ? def : fn(attr) +dfn(attr, def) ) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-10neighbour: Create netdev->neighbour associationGilad Naaman1-38/+58
Create a mapping between a netdev and its neighoburs, allowing for much cheaper flushes. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241107160444.2913124-7-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-10neighbour: Remove bare neighbour::next pointerGilad Naaman1-80/+10
Remove the now-unused neighbour::next pointer, leaving struct neighbour solely with the hlist_node implementation. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-6-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-10neighbour: Convert iteration to use hlist+macroGilad Naaman1-29/+18
Remove all usage of the bare neighbour::next pointer, replacing them with neighbour::hash and its for_each macro. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-5-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-10neighbour: Convert seq_file functions to use hlistGilad Naaman1-56/+48
Convert seq_file-related neighbour functionality to use neighbour::hash and the related for_each macro. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-4-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-10neighbour: Add hlist_node to struct neighbourGilad Naaman1-1/+19
Add a doubly-linked node to neighbours, so that they can be deleted without iterating the entire bucket they're in. Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241107160444.2913124-2-gnaaman@drivenets.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-08bpf: Fix mismatched RCU unlock flavour in bpf_out_neigh_v6Jiawei Ye1-1/+1
In the bpf_out_neigh_v6 function, rcu_read_lock() is used to begin an RCU read-side critical section. However, when unlocking, one branch incorrectly uses a different RCU unlock flavour rcu_read_unlock_bh() instead of rcu_read_unlock(). This mismatch in RCU locking flavours can lead to unexpected behavior and potential concurrency issues. This possible bug was identified using a static analysis tool developed by myself, specifically designed to detect RCU-related issues. This patch corrects the mismatched unlock flavour by replacing the incorrect rcu_read_unlock_bh() with the appropriate rcu_read_unlock(), ensuring that the RCU critical section is properly exited. This change prevents potential synchronization issues and aligns with proper RCU usage patterns. Fixes: 09eed1192cec ("neighbour: switch to standard rcu, instead of rcu_bh") Signed-off-by: Jiawei Ye <jiawei.ye@foxmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/tencent_CFD3D1C3D68B45EA9F52D8EC76D2C4134306@qq.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-11-07net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()Nam Cao1-1/+1
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack() to keep the naming convention consistent. Convert the usage site over to it. The conversion was done with Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/c4b40b8fef250b6a325e1b8bd6057005fb3cb660.1730386209.git.namcao@linutronix.de
2024-11-07bpf, sockmap: Fix sk_msg_reset_currZijian Zhang1-11/+9
Found in the test_txmsg_pull in test_sockmap, ``` txmsg_cork = 512; // corking is importrant here opt->iov_length = 3; opt->iov_count = 1; opt->rate = 512; // sendmsg will be invoked 512 times ``` The first sendmsg will send an sk_msg with size 3, and bpf_msg_pull_data will be invoked the first time. sk_msg_reset_curr will reset the copybreak from 3 to 0. In the second sendmsg, since we are in the stage of corking, psock->cork will be reused in func sk_msg_alloc. msg->sg.copybreak is 0 now, the second msg will overwrite the first msg. As a result, we could not pass the data integrity test. The same problem happens in push and pop test. Thus, fix sk_msg_reset_curr to restore the correct copybreak. Fixes: bb9aefde5bba ("bpf: sockmap, updating the sg structure should also update curr") Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Link: https://lore.kernel.org/r/20241106222520.527076-9-zijianzhang@bytedance.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-11-07bpf, sockmap: Several fixes to bpf_msg_pop_dataZijian Zhang1-6/+9
Several fixes to bpf_msg_pop_data, 1. In sk_msg_shift_left, we should put_page 2. if (len == 0), return early is better 3. pop the entire sk_msg (last == msg->sg.size) should be supported 4. Fix for the value of variable "a" 5. In sk_msg_shift_left, after shifting, i has already pointed to the next element. Addtional sk_msg_iter_var_next may result in BUG. Fixes: 7246d8ed4dcc ("bpf: helper to pop data from messages") Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20241106222520.527076-8-zijianzhang@bytedance.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-11-07bpf, sockmap: Several fixes to bpf_msg_push_dataZijian Zhang1-20/+33
Several fixes to bpf_msg_push_data, 1. test_sockmap has tests where bpf_msg_push_data is invoked to push some data at the end of a message, but -EINVAL is returned. In this case, in bpf_msg_push_data, after the first loop, i will be set to msg->sg.end, add the logic to handle it. 2. In the code block of "if (start - offset)", it's possible that "i" points to the last of sk_msg_elem. In this case, "sk_msg_iter_next(msg, end)" might still be called twice, another invoking is in "if (!copy)" code block, but actually only one is needed. Add the logic to handle it, and reconstruct the code to make the logic more clear. Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data") Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Link: https://lore.kernel.org/r/20241106222520.527076-7-zijianzhang@bytedance.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-11-04Merge tag 'for-netdev' of ↵Jakub Kicinski1-28/+11
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-10-31 We've added 13 non-merge commits during the last 16 day(s) which contain a total of 16 files changed, 710 insertions(+), 668 deletions(-). The main changes are: 1) Optimize and homogenize bpf_csum_diff helper for all archs and also add a batch of new BPF selftests for it, from Puranjay Mohan. 2) Rewrite and migrate the test_tcp_check_syncookie.sh BPF selftest into test_progs so that it can be run in BPF CI, from Alexis Lothoré. 3) Two BPF sockmap selftest fixes, from Zijian Zhang. 4) Small XDP synproxy BPF selftest cleanup to remove IP_DF check, from Vincent Li. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add a selftest for bpf_csum_diff() selftests/bpf: Don't mask result of bpf_csum_diff() in test_verifier bpf: bpf_csum_diff: Optimize and homogenize for all archs net: checksum: Move from32to16() to generic header selftests/bpf: remove xdp_synproxy IP_DF check selftests/bpf: remove test_tcp_check_syncookie selftests/bpf: test MSS value returned with bpf_tcp_gen_syncookie selftests/bpf: add ipv4 and dual ipv4/ipv6 support in btf_skc_cls_ingress selftests/bpf: get rid of global vars in btf_skc_cls_ingress selftests/bpf: add missing ns cleanups in btf_skc_cls_ingress selftests/bpf: factorize conn and syncookies tests in a single runner selftests/bpf: Fix txmsg_redir of test_txmsg_pull in test_sockmap selftests/bpf: Fix msg_verify_data in test_sockmap ==================== Link: https://patch.msgid.link/20241031221543.108853-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03fdget(), trivial conversionsAl Viro1-6/+4
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-29/+25
Cross-merge networking fixes after downstream PR (net-6.12-rc6). Conflicts: drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c cbe84e9ad5e2 ("wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd") 188a1bf89432 ("wifi: mac80211: re-order assigning channel in activate links") https://lore.kernel.org/all/20241028123621.7bbb131b@canb.auug.org.au/ net/mac80211/cfg.c c4382d5ca1af ("wifi: mac80211: update the right link for tx power") 8dd0498983ee ("wifi: mac80211: Fix setting txpower with emulate_chanctx") drivers/net/ethernet/intel/ice/ice_ptp_hw.h 6e58c3310622 ("ice: fix crash on probe for DPLL enabled E810 LOM") e4291b64e118 ("ice: Align E810T GPIO to other products") ebb2693f8fbd ("ice: Read SDP section from NVM for pin definitions") ac532f4f4251 ("ice: Cleanup unused declarations") https://lore.kernel.org/all/20241030120524.1ee1af18@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-01Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds1-0/+4
Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to force a checkpoint when the program's jump history becomes too long (Eduard Zingerman) - Add several fixes to the BPF bits iterator addressing issues like memory leaks and overflow problems (Hou Tao) - Fix an out-of-bounds write in trie_get_next_key (Byeonguk Jeong) - Fix BPF test infra's LIVE_FRAME frame update after a page has been recycled (Toke Høiland-Jørgensen) - Fix BPF verifier and undo the 40-bytes extra stack space for bpf_fastcall patterns due to various bugs (Eduard Zingerman) - Fix a BPF sockmap race condition which could trigger a NULL pointer dereference in sock_map_link_update_prog (Cong Wang) - Fix tcp_bpf_recvmsg_parser to retrieve seq_copied from tcp_sk under the socket lock (Jiayuan Chen) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled selftests/bpf: Add three test cases for bits_iter bpf: Use __u64 to save the bits in bits iterator bpf: Check the validity of nr_words in bpf_iter_bits_new() bpf: Add bpf_mem_alloc_check_size() helper bpf: Free dynamically allocated bits in bpf_iter_bits_destroy() bpf: disallow 40-bytes extra stack for bpf_fastcall patterns selftests/bpf: Add test for trie_get_next_key() bpf: Fix out-of-bounds write in trie_get_next_key() selftests/bpf: Test with a very short loop bpf: Force checkpoint when jmp history is too long bpf: fix filed access without lock sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
2024-11-01Merge tag 'net-6.12-rc6' of ↵Linus Torvalds2-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from WiFi, bluetooth and netfilter. No known new regressions outstanding. Current release - regressions: - wifi: mt76: do not increase mcu skb refcount if retry is not supported Current release - new code bugs: - wifi: - rtw88: fix the RX aggregation in USB 3 mode - mac80211: fix memory corruption bug in struct ieee80211_chanctx Previous releases - regressions: - sched: - stop qdisc_tree_reduce_backlog on TC_H_ROOT - sch_api: fix xa_insert() error path in tcf_block_get_ext() - wifi: - revert "wifi: iwlwifi: remove retry loops in start" - cfg80211: clear wdev->cqm_config pointer on free - netfilter: fix potential crash in nf_send_reset6() - ip_tunnel: fix suspicious RCU usage warning in ip_tunnel_find() - bluetooth: fix null-ptr-deref in hci_read_supported_codecs - eth: mlxsw: add missing verification before pushing Tx header - eth: hns3: fixed hclge_fetch_pf_reg accesses bar space out of bounds issue Previous releases - always broken: - wifi: mac80211: do not pass a stopped vif to the driver in .get_txpower - netfilter: sanitize offset and length before calling skb_checksum() - core: - fix crash when config small gso_max_size/gso_ipv4_max_size - skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension - mptcp: protect sched with rcu_read_lock - eth: ice: fix crash on probe for DPLL enabled E810 LOM - eth: macsec: fix use-after-free while sending the offloading packet - eth: stmmac: fix unbalanced DMA map/unmap for non-paged SKB data - eth: hns3: fix kernel crash when 1588 is sent on HIP08 devices - eth: mtk_wed: fix path of MT7988 WO firmware" * tag 'net-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits) net: hns3: fix kernel crash when 1588 is sent on HIP08 devices net: hns3: fixed hclge_fetch_pf_reg accesses bar space out of bounds issue net: hns3: initialize reset_timer before hclgevf_misc_irq_init() net: hns3: don't auto enable misc vector net: hns3: Resolved the issue that the debugfs query result is inconsistent. net: hns3: fix missing features due to dev->features configuration too early net: hns3: fixed reset failure issues caused by the incorrect reset type net: hns3: add sync command to sync io-pgtable net: hns3: default enable tx bounce buffer when smmu enabled netfilter: nft_payload: sanitize offset and length before calling skb_checksum() net: ethernet: mtk_wed: fix path of MT7988 WO firmware selftests: forwarding: Add IPv6 GRE remote change tests mlxsw: spectrum_ipip: Fix memory leak when changing remote IPv6 address mlxsw: pci: Sync Rx buffers for device mlxsw: pci: Sync Rx buffers for CPU mlxsw: spectrum_ptp: Add missing verification before pushing Tx header net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension Bluetooth: hci: fix null-ptr-deref in hci_read_supported_codecs netfilter: nf_reject_ipv6: fix potential crash in nf_send_reset6() netfilter: Fix use-after-free in get_info() ...
2024-10-31rtnetlink: Fix an error handling path in rtnl_newlink()Christophe JAILLET1-2/+4
When some code has been moved in the commit in Fixes, some "return err;" have correctly been changed in goto <some_where_in_the_error_handling_path> but this one was missed. Should "ops->maxtype > RTNL_MAX_TYPE" happen, then some resources would leak. Go through the error handling path to fix these leaks. Fixes: 0d3008d1a9ae ("rtnetlink: Move ops->validate to rtnl_newlink().") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/eca90eeb4d9e9a0545772b68aeaab883d9fe2279.1729952228.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-31net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extensionBenoît Monin1-0/+4
As documented in skbuff.h, devices with NETIF_F_IPV6_CSUM capability can only checksum TCP and UDP over IPv6 if the IP header does not contains extension. This is enforced for UDP packets emitted from user-space to an IPv6 address as they go through ip6_make_skb(), which calls __ip6_append_data() where a check is done on the header size before setting CHECKSUM_PARTIAL. But the introduction of UDP encapsulation with fou6 added a code-path where it is possible to get an skb with a partial UDP checksum and an IPv6 header with extension: * fou6 adds a UDP header with a partial checksum if the inner packet does not contains a valid checksum. * ip6_tunnel adds an IPv6 header with a destination option extension header if encap_limit is non-zero (the default value is 4). The thread linked below describes in more details how to reproduce the problem with GRE-in-UDP tunnel. Add a check on the network header size in skb_csum_hwoffload_help() to make sure no IPv6 packet with extension header is handed to a network device with NETIF_F_IPV6_CSUM capability. Link: https://lore.kernel.org/netdev/26548921.1r3eYUQgxm@benoit.monin/T/#u Fixes: aa3463d65e7b ("fou: Add encap ops for IPv6 tunnels") Signed-off-by: Benoît Monin <benoit.monin@gmx.fr> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/5fbeecfc311ea182aa1d1c771725ab8b4cac515e.1729778144.git.benoit.monin@gmx.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-30bpf: bpf_csum_diff: Optimize and homogenize for all archsPuranjay Mohan1-28/+11
1. Optimization ------------ The current implementation copies the 'from' and 'to' buffers to a scratchpad and it takes the bitwise NOT of 'from' buffer while copying. In the next step csum_partial() is called with this scratchpad. so, mathematically, the current implementation is doing: result = csum(to - from) Here, 'to' and '~ from' are copied in to the scratchpad buffer, we need it in the scratchpad buffer because csum_partial() takes a single contiguous buffer and not two disjoint buffers like 'to' and 'from'. We can re write this equation to: result = csum(to) - csum(from) using the distributive property of csum(). this allows 'to' and 'from' to be at different locations and therefore this scratchpad and copying is not needed. This in C code will look like: result = csum_sub(csum_partial(to, to_size, seed), csum_partial(from, from_size, 0)); 2. Homogenization -------------- The bpf_csum_diff() helper calls csum_partial() which is implemented by some architectures like arm and x86 but other architectures rely on the generic implementation in lib/checksum.c The generic implementation in lib/checksum.c returns a 16 bit value but the arch specific implementations can return more than 16 bits, this works out in most places because before the result is used, it is passed through csum_fold() that turns it into a 16-bit value. bpf_csum_diff() directly returns the value from csum_partial() and therefore the returned values could be different on different architectures. see discussion in [1]: for the int value 28 the calculated checksums are: x86 : -29 : 0xffffffe3 generic (arm64, riscv) : 65507 : 0x0000ffe3 arm : 131042 : 0x0001ffe2 Pass the result of bpf_csum_diff() through from32to16() before returning to homogenize this result for all architectures. NOTE: from32to16() is used instead of csum_fold() because csum_fold() does from32to16() + bitwise NOT of the result, which is not what we want to do here. [1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/ Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20241026125339.26459-3-puranjay@kernel.org