summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-03-25sfc: update MCDI protocol headersEdward Cree1-8814/+5028
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/bcb7597460a5a99d1dca4ef282f4aa2dd46ae545.1742493017.git.ecree.xilinx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25sfc: rip out MDIO supportEdward Cree5-99/+2
Unlike Siena, no EF10 board ever had an external PHY, and consequently MDIO handling isn't even built into the firmware. Since Siena has been split out into its own driver, the MDIO code can be deleted from the sfc driver. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/aa689d192ddaef7abe82709316c2be648a7bd66e.1742493017.git.ecree.xilinx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: reorganize IP MIB values (II)Eric Dumazet1-6/+6
Commit 14a196807482 ("net: reorganize IP MIB values") changed MIB values to group hot fields together. Since then 5 new fields have been added without caring about data locality. This patch moves IPSTATS_MIB_OUTPKTS, IPSTATS_MIB_NOECTPKTS, IPSTATS_MIB_ECT1PKTS, IPSTATS_MIB_ECT0PKTS, IPSTATS_MIB_CEPKTS to the hot portion of per-cpu data. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250320101434.3174412-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25tcp: avoid atomic operations on sk->sk_rmem_allocEric Dumazet4-6/+35
TCP uses generic skb_set_owner_r() and sock_rfree() for received packets, with socket lock being owned. Switch to private versions, avoiding two atomic operations per packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250320121604.3342831-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge branch 'nexthop-convert-rtm_-new-del-nexthop-to-per-netns-rtnl'Jakub Kicinski1-71/+112
Kuniyuki Iwashima says: ==================== nexthop: Convert RTM_{NEW,DEL}NEXTHOP to per-netns RTNL. Patch 1 - 5 move some validation for RTM_NEWNEXTHOP so that it can be called without RTNL. Patch 6 & 7 converts RTM_NEWNEXTHOP and RTM_DELNEXTHOP to per-netns RTNL. Note that RTM_GETNEXTHOP and RTM_GETNEXTHOPBUCKET are not touched in this series. rtm_get_nexthop() can be easily converted to RCU, but rtm_dump_nexthop() needs more work due to the left-to-right rbtree walk, which looks prone to node deletion and tree rotation without a retry mechanism. v1: https://lore.kernel.org/netdev/20250318233240.53946-1-kuniyu@amazon.com/ ==================== Link: https://patch.msgid.link/20250319230743.65267-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25nexthop: Convert RTM_DELNEXTHOP to per-netns RTNL.Kuniyuki Iwashima1-5/+10
In rtm_del_nexthop(), only nexthop_find_by_id() and remove_nexthop() require RTNL as they touch net->nexthop.rb_root. Let's move RTNL down as rtnl_net_lock() before nexthop_find_by_id(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250319230743.65267-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25nexthop: Convert RTM_NEWNEXTHOP to per-netns RTNL.Kuniyuki Iwashima1-5/+11
If we pass false to the rtnl_held param of lwtunnel_valid_encap_type(), we can move RTNL down before rtm_to_nh_config_rtnl(). Let's use rtnl_net_lock() in rtm_new_nexthop(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250319230743.65267-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25nexthop: Remove redundant group len check in nexthop_create_group().Kuniyuki Iwashima1-3/+0
The number of NHA_GROUP entries is guaranteed to be non-zero in nh_check_attr_group(). Let's remove the redundant check in nexthop_create_group(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250319230743.65267-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25nexthop: Check NLM_F_REPLACE and NHA_ID in rtm_new_nexthop().Kuniyuki Iwashima1-5/+6
nexthop_add() checks if NLM_F_REPLACE is specified without non-zero NHA_ID, which does not require RTNL. Let's move the check to rtm_new_nexthop(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250319230743.65267-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25nexthop: Move NHA_OIF validation to rtm_to_nh_config_rtnl().Kuniyuki Iwashima1-20/+23
NHA_OIF needs to look up a device by __dev_get_by_index(), which requires RTNL. Let's move NHA_OIF validation to rtm_to_nh_config_rtnl(). Note that the proceeding checks made the original !cfg->nh_fdb check redundant. NHA_FDB is set -> NHA_OIF cannot be set NHA_FDB is set but false -> NHA_OIF must be set NHA_FDB is not set -> NHA_OIF must be set Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250319230743.65267-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25nexthop: Split nh_check_attr_group().Kuniyuki Iwashima1-21/+47
We will push RTNL down to rtm_new_nexthop(), and then we want to move non-RTNL operations out of the scope. nh_check_attr_group() validates NHA_GROUP attributes, and nexthop_find_by_id() and some validation requires RTNL. Let's factorise such parts as nh_check_attr_group_rtnl() and call it from rtm_to_nh_config_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250319230743.65267-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25nexthop: Move nlmsg_parse() in rtm_to_nh_config() to rtm_new_nexthop().Kuniyuki Iwashima1-15/+18
We will split rtm_to_nh_config() into non-RTNL and RTNL parts, and then the latter also needs tb. As a prep, let's move nlmsg_parse() to rtm_new_nexthop(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250319230743.65267-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25ipv6: fix _DEVADD() and _DEVUPD() macrosEric Dumazet1-4/+7
ip6_rcv_core() is using: __IP6_ADD_STATS(net, idev, IPSTATS_MIB_NOECTPKTS + (ipv6_get_dsfield(hdr) & INET_ECN_MASK), max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs)); This is currently evaluating both expressions twice. Fix _DEVADD() and _DEVUPD() macros to evaluate their arguments once. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250319212516.2385451-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge branch 'mlx5-misc-enhancements-2025-03-19'Jakub Kicinski6-38/+61
Tariq Toukan says: ==================== mlx5 misc enhancements 2025-03-19 This series introduces multiple small misc enhancements from the team to the mlx5 core and Eth drivers. ==================== Link: https://patch.msgid.link/1742392983-153050-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5e: TC, Don't offload CT commit if it's the last actionJianbo Liu1-0/+11
For CT action with commit argument, it's usually followed by the forward action, either to the output netdev or next chain. The default behavior for software is to drop by setting action attribute to TC_ACT_SHOT instead of TC_ACT_PIPE if it's the last action. But driver can't handle it, so block the offload for such case. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5e: CT: Filter legacy rules that are unrelated to nicPaul Blakey1-0/+29
In nic mode CT setup where we do hairpin between the two nics, both nics register to the same flow table (per zone), and try to offload all rules on it. Instead, filter the rules that originated from the relevant nic (so only one side is offloaded for each nic). Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5: Update pfnum retrieval for devlink port attributesShay Drory2-3/+3
Align mlx5 driver usage of 'pfnum' with the documentation clarification introduced in commit bb70b0d48d8e ("devlink: Improve the port attributes description"). Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5: fw reset, check bridge accessibility at earlier stageAmir Tzin1-6/+9
Currently, mlx5_is_reset_now_capable() checks whether the pci bridge is accessible only on bridge hot plug capability check. If the pci bridge is not accessible, reset now will fail regardless of bridge hotplug capability. Move this check to function mlx5_is_reset_now_capable() which, in such case, aborts the reset and does so in the request phase instead of the reset now phase. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Amir Tzin <amirtz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5: Lag, use port selection tables when availableMark Bloch1-29/+9
As queue affinity is being deprecated and will no longer be supported in the future, Always check for the presence of the port selection namespace. When available, leverage it to distribute traffic across the physical ports via steering, ensuring compatibility with future NICs. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5e: TX, Utilize WQ fragments edge for multi-packet WQEsTariq Toukan5-9/+25
For simplicity reasons, the driver avoids crossing work queue fragment boundaries within the same TX WQE (Work-Queue Element). Until today, as the number of packets in a TX MPWQE (Multi-Packet WQE) descriptor is not known in advance, the driver pre-prepared contiguous memory for the largest possible WQE. For this, when getting too close to the fragment edge, having no room for the largest WQE possible, the driver was filling the fragment remainder with NOP descriptors, aligning the next descriptor to the beginning of the next fragment. Generating and handling these NOPs wastes resources, like: CPU cycles, work-queue entries fetched to the device, and PCI bandwidth. In this patch, we replace this NOPs filling mechanism in the TX MPWQE flow. Instead, we utilize the remaining entries of the fragment with a TX MPWQE. If this room turns out to be too small, we simply open an additional descriptor starting at the beginning of the next fragment. Performance benchmark: uperf test, single server against 3 clients. TCP multi-stream, bidir, traffic profile "2x350B read, 1400B write". Bottleneck is in inbound PCI bandwidth (device POV). +---------------+------------+------------+--------+ | | Before | After | | +---------------+------------+------------+--------+ | BW | 117.4 Gbps | 121.1 Gbps | +3.1% | +---------------+------------+------------+--------+ | tx_packets | 15 M/sec | 15.5 M/sec | +3.3% | +---------------+------------+------------+--------+ | tx_nops | 3 M/sec | 0 | -100% | +---------------+------------+------------+--------+ Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1742391746-118647-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25dql: Fix dql->limit value when reset.Jing Su1-1/+1
Executing dql_reset after setting a non-zero value for limit_min can lead to an unreasonable situation where dql->limit is less than dql->limit_min. For instance, after setting /sys/class/net/eth*/queues/tx-0/byte_queue_limits/limit_min, an ifconfig down/up operation might cause the ethernet driver to call netdev_tx_reset_queue, which in turn invokes dql_reset. In this case, dql->limit is reset to 0 while dql->limit_min remains non-zero value, which is unexpected. The limit should always be greater than or equal to limit_min. Signed-off-by: Jing Su <jingsusu@didiglobal.com> Link: https://patch.msgid.link/Z9qHD1s/NEuQBdgH@pilot-ThinkCentre-M930t-N000 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge branch 'selftests-net-mixed-select-polling-mode-for-tcp-ao-tests'Jakub Kicinski12-345/+552
Dmitry Safonov via says: ==================== selftests/net: Mixed select()+polling mode for TCP-AO tests Should fix flaky tcp-ao/connect-deny-ipv6 test. v1: https://lore.kernel.org/20250312-tcp-ao-selftests-polling-v1-0-72a642b855d5@gmail.com ==================== Link: https://patch.msgid.link/20250319-tcp-ao-selftests-polling-v2-0-da48040153d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25selftests/net: Drop timeout argument from test_client_verify()Dmitry Safonov9-25/+22
It's always TEST_TIMEOUT_SEC, with an unjustified exception in rst test, that is more paranoia-long timeout rather than based on requirements. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250319-tcp-ao-selftests-polling-v2-7-da48040153d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25selftests/net: Delete timeout from test_connect_socket()Dmitry Safonov4-21/+13
Unused: it's always either the default timeout or asynchronous connect(). Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250319-tcp-ao-selftests-polling-v2-6-da48040153d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25selftests/net: Print the testing side in unsigned-md5Dmitry Safonov1-22/+22
As both client and server print the same test name on failure or pass, add "[server]" so that it's more obvious from a log which side printed "ok" or "not ok". Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250319-tcp-ao-selftests-polling-v2-5-da48040153d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25selftests/net: Add mixed select()+polling mode to TCP-AO testsDmitry Safonov5-67/+232
Currently, tcp_ao tests have two timeouts: TEST_RETRANSMIT_SEC and TEST_TIMEOUT_SEC [by default 1 and 5 seconds]. The first one, TEST_RETRANSMIT_SEC is used for operations that are expected to succeed in order for a test to pass. It is usually not consumed and exists only to avoid indefinite test run if the operation didn't complete. The second one, TEST_RETRANSMIT_SEC exists for the tests that checking operations, that are expected to fail/timeout. It is shorter as it is fully consumed, with an expectation that if operation didn't succeed during that period, it will timeout. And the related test that expects the timeout is passing. The actual operation failure is then cross-verified by other means like counters checks. The issue with TEST_RETRANSMIT_SEC timeout is that 1 second is the exact initial TCP timeout. So, in case the initial segment gets lost (quite unlikely on local veth interface between two net namespaces, yet happens in slow VMs), the retransmission never happens and as a result, the test is not actually testing the functionality. Which in the end fails counters checks. As I want tcp_ao selftests to be fast and finishing in a reasonable amount of time on manual run, I didn't consider increasing TEST_RETRANSMIT_SEC. Rather, initially, BPF_SOCK_OPS_TIMEOUT_INIT looked promising as a lever to make the initial TCP timeout shorter. But as it's not a socket bpf attached thing, but sock_ops (attaches to cgroups), the selftests would have to use libbpf, which I wanted to avoid if not absolutely required. Instead, use a mixed select() and counters polling mode with the longer TEST_TIMEOUT_SEC timeout to detect running-away failed tests. It actually not only allows losing segments and succeeding after the previous TEST_RETRANSMIT_SEC timeout was consumed, but makes the tests expecting timeout/failure pass faster. The only test case taking longer (TEST_TIMEOUT_SEC) now is connect-deny "wrong snd id", which checks for no key on SYN-ACK for which there is no counter in the kernel (see tcp_make_synack()). Yet it can be speed up by poking skpair from the trace event (see trace_tcp_ao_synack_no_key). Fixes: ed9d09b309b1 ("selftests/net: Add a test for TCP-AO keys matching") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20241205070656.6ef344d7@kernel.org/ Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250319-tcp-ao-selftests-polling-v2-4-da48040153d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25selftests/net: Fetch and check TCP-MD5 countersDmitry Safonov11-189/+219
There are related TCP-MD5 <=> TCP and TCP-MD5 <=> TCP-AO tests that can benefit from checking the related counters, not only from validating operations timeouts. It also prepares the code for introduction of mixed select()+poll mode, see the follow-up patches. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250319-tcp-ao-selftests-polling-v2-3-da48040153d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25selftests/net: Provide tcp-ao counters comparison helperDmitry Safonov11-63/+89
Rename __test_tcp_ao_counters_cmp() into test_assert_counters_ao() and test_tcp_ao_key_counters_cmp() into test_assert_counters_key() as they are asserts, rather than just compare functions. Provide test_cmp_counters() helper, that's going to be used to compare ao_info and netns counters as a stop condition for polling the sockets. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250319-tcp-ao-selftests-polling-v2-2-da48040153d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25selftests/net: Print TCP flags in more common formatDmitry Safonov1-5/+2
Before: ># 13145[lib/ftrace-tcp.c:427] trace event filter tcp_ao_key_not_found [2001:db8:1::1:-1 => 2001:db8:254::1:7010, L3index 0, flags: !FS!R!P!., keyid: 100, rnext: 100, maclen: -1, sne: -1] = 1 After: ># 13487[lib/ftrace-tcp.c:427] trace event filter tcp_ao_key_not_found [2001:db8:1::1:-1 => 2001:db8:254::1:7010, L3index 0, flags: S, keyid: 100, rnext: 100, maclen: -1, sne: -1] = 1 For the history, I think the initial format was to emphasize the absence of flags as well as their presence (!R meant no RST flag). But looking again, it's just unreadable and hard to understand. Make it the standard/expected one. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250319-tcp-ao-selftests-polling-v2-1-da48040153d1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25ynl: devlink: add missing board-serial-numberJiri Pirko1-0/+1
Add a missing attribute of board serial number. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250320085947.103419-2-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge branch 'net-xdp-add-missing-metadata-support-for-some-xdp-drvs'Jakub Kicinski9-14/+44
Lorenzo Bianconi says: ==================== net: xdp: Add missing metadata support for some xdp drvs Introduce missing metadata support for some xdp drivers setting metadata size building the skb from xdp_buff. Please note most of the drivers are just compile tested. v1: https://lore.kernel.org/20250311-mvneta-xdp-meta-v1-0-36cf1c99790e@kernel.org ==================== Link: https://patch.msgid.link/20250318-mvneta-xdp-meta-v2-0-b6075778f61f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: ti: cpsw: Add metadata support for xdp modeLorenzo Bianconi2-2/+10
Set metadata size building the skb from xdp_buff in cpsw/cpsw_new drivers. ti cpsw and cpsw_new drivers set xdp headroom at least to CPSW_HEADROOM_NA: CPSW_HEADROOM_NA max(XDP_PACKET_HEADROOM, NET_SKB_PAD) + NET_IP_ALIGN so the headroom is large enough to contain xdp_frame and xdp metadata. Please note this patch is just compiled tested. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250318-mvneta-xdp-meta-v2-7-b6075778f61f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: mana: Add metadata support for xdp modeLorenzo Bianconi2-1/+5
Set metadata size building the skb from xdp_buff in mana driver. mana driver sets xdp headroom to XDP_PACKET_HEADROOM so the headroom is large enough to contain xdp_frame and xdp metadata. Please note this patch is just compiled tested. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250318-mvneta-xdp-meta-v2-6-b6075778f61f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: ethernet: mediatek: Add metadata support for xdp modeLorenzo Bianconi1-2/+5
Set metadata size building the skb from xdp_buff in mediatek driver. mtk_eth_soc driver sets xdp headroom to XDP_PACKET_HEADROOM so the headroom is large enough to contain xdp_frame and xdp metadata. Please note this patch is just compiled tested. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250318-mvneta-xdp-meta-v2-5-b6075778f61f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: octeontx2: Add metadata support for xdp modeLorenzo Bianconi1-4/+9
Set metadata size building the skb from xdp_buff in octeontx2 driver. octeontx2 driver sets xdp headroom to OTX2_HEAD_ROOM OTX2_HEAD_ROOM OTX2_ALIGN OTX2_ALIGN 128 so the headroom is large enough to contain xdp_frame and xdp metadata. Please note this patch is just compiled tested. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250318-mvneta-xdp-meta-v2-4-b6075778f61f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: netsec: Add metadata support for xdp modeLorenzo Bianconi1-2/+5
Set metadata size building the skb from xdp_buff in netsec driver. netsec driver sets xdp headroom to NETSEC_RXBUF_HEADROOM: NETSEC_RXBUF_HEADROOM max(XDP_PACKET_HEADROOM, NET_SKB_PAD) + NET_IP_ALIGN so the headroom is large enough to contain xdp_frame and xdp metadata. Please note this patch is just compiled tested. Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250318-mvneta-xdp-meta-v2-3-b6075778f61f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: mvpp2: Add metadata support for xdp modeLorenzo Bianconi1-2/+6
Set metadata size building the skb from xdp_buff in mvpp2 driver mvpp2 driver sets xdp headroom to: MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM where MVPP2_MH_SIZE 2 MVPP2_SKB_HEADROOM min(max(XDP_PACKET_HEADROOM, NET_SKB_PAD), 224) so the headroom is large enough to contain xdp_frame and xdp metadata. Please note this patch is just compiled tested. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250318-mvneta-xdp-meta-v2-2-b6075778f61f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: mvneta: Add metadata support for xdp modeLorenzo Bianconi1-1/+4
Set metadata size building the skb from xdp_buff in mvneta driver mvneta sets xdp headroom to: MVNETA_MH_SIZE + MVNETA_SKB_HEADROOM where MVNETA_MH_SIZE 2 MVNETA_SKB_HEADROOM max(NET_SKB_PAD, XDP_PACKET_HEADROOM) so the headroom is large enough to contain xdp_frame and xdp metadata. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250318-mvneta-xdp-meta-v2-1-b6075778f61f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: tulip: avoid unused variable warningSimon Horman1-5/+2
There is an effort to achieve W=1 kernel builds without warnings. As part of that effort Helge Deller highlighted the following warnings in the tulip driver when compiling with W=1 and CONFIG_TULIP_MWI=n: .../tulip_core.c: In function ‘tulip_init_one’: .../tulip_core.c:1309:22: warning: variable ‘force_csr0’ set but not used This patch addresses that problem using IS_ENABLED(). This approach has the added benefit of reducing conditionally compiled code. And thus increasing compile coverage. E.g. for allmodconfig builds which enable CONFIG_TULIP_MWI. Compile tested only. No run-time effect intended. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250318-tulip-w1-v3-1-a813fadd164d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge branch 'af_unix-clean-up-headers'Jakub Kicinski7-133/+143
Kuniyuki Iwashima says: ==================== af_unix: Clean up headers. AF_UNIX files include many unnecessary headers (netdevice.h and rtnetlink.h, etc), and this series cleans them up. Note that there are still some headers included indirectly and modifying them triggers rebuild, which seems mostly inevitable. [0] $ python3 include_graph.py net/unix/garbage.c linux/rtnetlink.h linux/netdevice.h ... include/net/af_unix.h | include/linux/net.h | | include/linux/once.h | | include/linux/sockptr.h | | include/uapi/linux/net.h | include/net/sock.h | | include/linux/netdevice.h <--- ... | | include/net/dst.h | | | include/linux/rtnetlink.h <--- [0]: https://gist.github.com/q2ven/9c5897f11a493145829029c0bfb364d0 ==================== Link: https://patch.msgid.link/20250318034934.86708-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25af_unix: Clean up #include under net/unix/.Kuniyuki Iwashima4-17/+9
net/unix/*.c include many unnecessary header files (rtnetlink.h, netdevice.h, etc). Let's clean them up. af_unix.c: +uapi/linux/sockios.h : Only exist under include/uapi +uapi/linux/termios.h : Only exist under include/uapi -linux/freezer.h : No longer use freezable_schedule_timeout() -linux/in.h : No ipv4_is_XXX() etc -linux/module.h : No longer support CONFIG_UNIX=m -linux/netdevice.h : No dev used -linux/rtnetlink.h : Not part of rtnetlink API -linux/signal.h : signal_pending() is defined in sched/signal.h -linux/stat.h : No struct stat used -net/checksum.h : CHECKSUM_UNNECESSARY is defined in skbuff.h diag.c: +linux/dcache.h : struct dentry in sk_diag_dump_vfs() +linux/user_namespace.h : struct user_namespace in sk_diag_dump_uid() +uapi/linux/unix_diag.h : Only exist under include/uapi/ garbage.c: +linux/list.h : struct unix_{vertex,edge}, etc +linux/workqueue.h : DECLARE_WORK(unix_gc_work, ...) -linux/file.h : No fget() etc -linux/kernel.h : No cond_resched() etc -linux/netdevice.h : No dev used -linux/proc_fs.h : No procfs provided -linux/string.h : No memcpy(), kmemdup(), etc sysctl_net_unix.c: +linux/string.h : kmemdup() +net/net_namespace.h : struct net, net_eq() -linux/mm.h : slab.h is enough Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250318034934.86708-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25af_unix: Explicitly include headers for non-pointer struct fields.Kuniyuki Iwashima5-14/+6
include/net/af_unix.h indirectly includes some definitions for structs. Let's include such headers explicitly. linux/atomic.h : scm_stat.nr_fds linux/net.h : unix_sock.peer_wq linux/path.h : unix_sock.path linux/spinlock.h : unix_sock.lock linux/wait.h : unix_sock.peer_wake uapi/linux/un.h : unix_address.name[] linux/socket.h is removed as the structs there are not used directly, and linux/un.h is clarified with uapi as un.h only exists under include/uapi. While at it, duplicate headers are removed from .c files. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250318034934.86708-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25af_unix: Move internal definitions to net/unix/.Kuniyuki Iwashima7-75/+102
net/af_unix.h is included by core and some LSMs, but most definitions need not be. Let's move struct unix_{vertex,edge} to net/unix/garbage.c and other definitions to net/unix/af_unix.h. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250318034934.86708-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25af_unix: Sort headers.Kuniyuki Iwashima6-52/+51
This is a prep patch to make the following changes cleaner. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250318034934.86708-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge branch 'support-tcp_rto_min_us-and-tcp_delack_max_us-for-set-getsockopt'Jakub Kicinski5-6/+30
Jason Xing says: ==================== support TCP_RTO_MIN_US and TCP_DELACK_MAX_US for set/getsockopt Add set/getsockopt supports for TCP_RTO_MIN_US and TCP_DELACK_MAX_US. ==================== Link: https://patch.msgid.link/20250317120314.41404-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25tcp: support TCP_DELACK_MAX_US for set/getsockopt useJason Xing3-2/+14
Support adjusting/reading delayed ack max for socket level by using set/getsockopt(). This option aligns with TCP_BPF_DELACK_MAX usage. Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use. Add WRITE_ONCE/READ_ONCE() to prevent data-race if setsockopt() happens to write one value to icsk_delack_max while icsk_delack_max is being read. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250317120314.41404-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25tcp: support TCP_RTO_MIN_US for set/getsockopt useJason Xing4-4/+16
Support adjusting/reading RTO MIN for socket level by using set/getsockopt(). This new option has the same effect as TCP_BPF_RTO_MIN, which means it doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use. When the socket is created, its icsk_rto_min is set to the default value that is controlled by sysctl_tcp_rto_min_us. Then if application calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then icsk_rto_min will be overridden in jiffies unit. This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around icsk_rto_min. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250317120314.41404-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge branch ↵Jakub Kicinski7-49/+83
'mlxsw-add-vxlan-to-the-same-hardware-domain-as-physical-bridge-ports' Petr Machata says: ==================== mlxsw: Add VXLAN to the same hardware domain as physical bridge ports Amit Cohen writes: Packets which are trapped to CPU for forwarding in software data path are handled according to driver marking of skb->offload_{,l3}_fwd_mark. Packets which are marked as L2-forwarded in hardware, will not be flooded by the bridge to bridge ports which are in the same hardware domain as the ingress port. Currently, mlxsw does not add VXLAN bridge ports to the same hardware domain as physical bridge ports despite the fact that the device is able to forward packets to and from VXLAN tunnels in hardware. In some scenarios this can result in remote VTEPs receiving duplicate packets. To solve such packets duplication, add VXLAN bridge ports to the same hardware domain as other bridge ports. One complication is ARP suppression which requires the local VTEP to avoid flooding ARP packets to remote VTEPs if the local VTEP is able to reply on behalf of remote hosts. This is currently implemented by having the device flood ARP packets in hardware and trapping them during VXLAN encapsulation, but marking them with skb->offload_fwd_mark=1 so that the bridge will not re-flood them to physical bridge ports. The above scheme will break when VXLAN bridge ports are added to the same hardware domain as physical bridge ports as ARP packets that cannot be suppressed by the bridge will not be able to egress the VXLAN bridge ports due to hardware domain filtering. This is solved by trapping ARP packets when they enter the device and not marking them as being forwarded in hardware. Patch set overview: Patch #1 sets hardware to trap ARP packets at layer 2 Patches #2-#4 are preparations for setting hardwarwe domain of VXLAN Patch #5 sets hardware domain of VXLAN Patch #6 extends VXLAN flood test to verify that this set solves the packets duplication ==================== Link: https://patch.msgid.link/cover.1742224300.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25selftests: vxlan_bridge: Test flood with unresolved FDB entryAmit Cohen2-0/+23
Extend flood test to configure FDB entry with unresolved destination IP, check that packets are not sent twice. Without the previous patch which handles such scenario in mlxsw, the tests fail: $ TESTS='test_flood' ./vxlan_bridge_1d.sh Running tests with UDP port 4789 TEST: VXLAN: flood [ OK ] TEST: VXLAN: flood, unresolved FDB entry [FAIL] vx2 ns2: Expected to capture 10 packets, got 20. $ TESTS='test_flood' ./vxlan_bridge_1q.sh INFO: Running tests with UDP port 4789 TEST: VXLAN: flood vlan 10 [ OK ] TEST: VXLAN: flood vlan 20 [ OK ] TEST: VXLAN: flood vlan 10, unresolved FDB entry [FAIL] vx10 ns2: Expected to capture 10 packets, got 20. TEST: VXLAN: flood vlan 20, unresolved FDB entry [FAIL] vx20 ns2: Expected to capture 10 packets, got 20. With the previous patch, the tests pass. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/7bc96e317531f3bf06319fb2ea447bd8666f29fa.1742224300.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25mlxsw: Add VXLAN bridge ports to same hardware domain as physical bridge portsAmit Cohen2-6/+26
When hardware floods packets to bridge ports, but flooding to VXLAN bridge port fails during encapsulation to one of the remote VTEPs, the packets are trapped to CPU. In such case, the packets are marked with skb->offload_fwd_mark, which means that packet was L2-forwarded in hardware. Software data path repeats flooding, but packets which are marked with skb->offload_fwd_mark will not be flooded by the bridge to bridge ports which are in the same hardware domain as the ingress port. Currently, mlxsw does not add VXLAN bridge ports to the same hardware domain as physical bridge ports despite the fact that the device is able to forward packets to and from VXLAN tunnels in hardware. In some scenarios (as mentioned above) this can result in remote VTEPs receiving duplicate packets. The packets are first flooded by hardware and after an encapsulation failure, they are flooded again to all remote VTEPs by software. Solve this by adding VXLAN bridge ports to the same hardware domain as physical bridge ports, so then nbp_switchdev_allowed_egress() will return false also for VXLAN, and packets will not be sent twice from VXLAN device. switchdev_bridge_port_offload() should get vxlan_dev not as const, so some changes are required. Call switchdev API from mlxsw_sp_bridge_vxlan_{join,leave}() which handle offload configurations. Reported-by: Vladimir Oltean <olteanv@gmail.com> Closes: https://lore.kernel.org/all/20250210152246.4ajumdchwhvbarik@skbuf/ Reported-by: Vladyslav Mykhaliuk <vmykhaliuk@nvidia.com> Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/7279056843140fae3a72c2d204c7886b79d03899.1742224300.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>