summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-04-23netfilter: nfnetlink: Handle ACK flags for batch messagesDonald Hunter1-0/+5
The NLM_F_ACK flag is ignored for nfnetlink batch begin and end messages. This is a problem for ynl which wants to receive an ack for every message it sends, not just the commands in between the begin/end messages. Add processing for ACKs for begin/end messages and provide responses when requested. I have checked that iproute2, pyroute2 and systemd are unaffected by this change since none of them use NLM_F_ACK for batch begin/end. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240418104737.77914-5-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23tools/net/ynl: Add multi message support to ynlDonald Hunter2-22/+71
Add a "--multi <do-op> <json>" command line to ynl that makes it possible to add several operations to a single netlink request payload. The --multi command line option is repeated for each operation. This is used by the nftables family for transaction batches. For example: ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/nftables.yaml \ --multi batch-begin '{"res-id": 10}' \ --multi newtable '{"name": "test", "nfgen-family": 1}' \ --multi newchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \ --multi batch-end '{"res-id": 10}' [None, None, None, None] It can also be used for bundling get requests: ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/nftables.yaml \ --multi gettable '{"name": "test", "nfgen-family": 1}' \ --multi getchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \ --output-json [{"name": "test", "use": 1, "handle": 1, "flags": [], "nfgen-family": 1, "version": 0, "res-id": 2}, {"table": "test", "name": "chain", "handle": 1, "use": 0, "nfgen-family": 1, "version": 0, "res-id": 2}] Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240418104737.77914-4-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23tools/net/ynl: Fix extack decoding for directional opsDonald Hunter1-8/+6
NetlinkProtocol.decode() was looking up ops by response value which breaks when it is used for extack decoding of directional ops. Instead, pass the op to decode(). Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240418104737.77914-3-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23doc/netlink/specs: Add draft nftables specDonald Hunter1-0/+1264
Add a spec for nftables that has nearly complete coverage of the ops, but limited coverage of rule types and subexpressions. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240418104737.77914-2-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23Merge branch 'for-uring-ubufops' into HEADJakub Kicinski9-36/+69
Pavel Begunkov says: ==================== implement io_uring notification (ubuf_info) stacking (net part) To have per request buffer notifications each zerocopy io_uring send request allocates a new ubuf_info. However, as an skb can carry only one uarg, it may force the stack to create many small skbs hurting performance in many ways. The patchset implements notification, i.e. an io_uring's ubuf_info extension, stacking. It attempts to link ubuf_info's into a list, allowing to have multiple of them per skb. liburing/examples/send-zerocopy shows up 6 times performance improvement for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without the patchset it requires much larger sends to utilise all potential. bytes | before | after (Kqps) 1200 | 195 | 1023 4000 | 193 | 1386 8000 | 154 | 1058 ==================== Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net: add callback for setting a ubuf_info to skbPavel Begunkov2-6/+16
At the moment an skb can only have one ubuf_info associated with it, which might be a performance problem for zerocopy sends in cases like TCP via io_uring. Add a callback for assigning ubuf_info to skb, this way we will implement smarter assignment later like linking ubuf_info together. Note, it's an optional callback, which should be compatible with skb_zcopy_set(), that's because the net stack might potentially decide to clone an skb and take another reference to ubuf_info whenever it wishes. Also, a correct implementation should always be able to bind to an skb without prior ubuf_info, otherwise we could end up in a situation when the send would not be able to progress. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net: extend ubuf_info callback to ops structurePavel Begunkov9-30/+53
We'll need to associate additional callbacks with ubuf_info, introduce a structure holding ubuf_info callbacks. Apart from a more smarter io_uring notification management introduced in next patches, it can be used to generalise msg_zerocopy_put_abort() and also store ->sg_from_iter, which is currently passed in struct msghdr. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23Merge branch 'tcp-avoid-sending-too-small-packets'Jakub Kicinski1-25/+53
Eric Dumazet says: ==================== tcp: avoid sending too small packets tcp_sendmsg() cooks 'large' skbs, that are later split if needed from tcp_write_xmit(). After a split, the leftover skb size is smaller than the optimal size, and this causes a performance drop. In this series, tcp_grow_skb() helper is added to shift payload from the second skb in the write queue to the first skb to always send optimal sized skbs. This increases TSO efficiency, and decreases number of ACK packets. ==================== Link: https://lore.kernel.org/r/20240418214600.1291486-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23tcp: try to send bigger TSO packetsEric Dumazet1-2/+36
While investigating TCP performance, I found that TCP would sometimes send big skbs followed by a single MSS skb, in a 'locked' pattern. For instance, BIG TCP is enabled, MSS is set to have 4096 bytes of payload per segment. gso_max_size is set to 181000. This means that an optimal TCP packet size should contain 44 * 4096 = 180224 bytes of payload, However, I was seeing packets sizes interleaved in this pattern: 172032, 8192, 172032, 8192, 172032, 8192, <repeat> tcp_tso_should_defer() heuristic is defeated, because after a split of a packet in write queue for whatever reason (this might be a too small CWND or a small enough pacing_rate), the leftover packet in the queue is smaller than the optimal size. It is time to try to make 'leftover packets' bigger so that tcp_tso_should_defer() can give its full potential. After this patch, we can see the following output: 14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980 14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980 14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980 14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980 14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980 14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980 14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980 14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980 14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980 14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980 14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980 With 200 concurrent flows on a 100Gbit NIC, we can see a reduction of TSO packets (and ACK packets) of about 30 %. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240418214600.1291486-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()Eric Dumazet1-12/+14
tcp_write_xmit() calls tcp_init_tso_segs() to set gso_size and gso_segs on the packet. tcp_init_tso_segs() requires the stack to maintain an up to date tcp_skb_pcount(), and this makes sense for packets in rtx queue. Not so much for packets still in the write queue. In the following patch, we don't want to deal with tcp_skb_pcount() when moving payload from 2nd skb to 1st skb in the write queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240418214600.1291486-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23tcp: remove dubious FIN exception from tcp_cwnd_test()Eric Dumazet1-13/+5
tcp_cwnd_test() has a special handing for the last packet in the write queue if it is smaller than one MSS and has the FIN flag. This is in violation of TCP RFC, and seems quite dubious. This packet can be sent only if the current CWND is bigger than the number of packets in flight. Making tcp_cwnd_test() result independent of the first skb in the write queue is needed for the last patch of the series. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240418214600.1291486-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23Merge branch 'mlx5e-per-queue-coalescing'Jakub Kicinski13-196/+672
Tariq Toukan says: ==================== mlx5e per-queue coalescing This patchset adds ethtool per-queue coalescing support for the mlx5e driver. The series introduce some changes needed as preparations for the final patch which adds the support and implements the callbacks. Main changes: - DIM code movements into its own header file. - Switch to dynamic allocation of the DIM struct in the RQs/SQs. - Allow coalescing config change without channels reset when possible. ==================== Link: https://lore.kernel.org/r/20240419080445.417574-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net/mlx5e: Implement ethtool callbacks for supporting per-queue coalescingRahul Rameshbabu3-0/+152
Use mlx5 on-the-fly coalescing configuration support to enable individual channel configuration. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net/mlx5e: Support updating coalescing configuration without resetting channelsRahul Rameshbabu12-169/+460
When CQE mode or DIM state is changed, gracefully reconfigure channels to handle new configuration. Previously, would create new channels that would reflect the changes rather than update the original channels. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net/mlx5e: Dynamically allocate DIM structure for SQs/RQsRahul Rameshbabu4-12/+31
Make it possible for the DIM structure to be torn down while an SQ or RQ is still active. Changing the CQ period mode is an example where the previous sampling done with the DIM structure would need to be invalidated. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net/mlx5e: Use DIM constants for CQ period mode parameterRahul Rameshbabu6-54/+53
Use core DIM CQ period mode enum values for the CQ parameter for the period mode. Translate the value to the specific mlx5 device constant for the selected period mode when creating a CQ. Avoid needing to translate mlx5 device constants to DIM constants for core DIM functionality. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net/mlx5e: Move DIM function declarations to en/dim.hRahul Rameshbabu5-3/+18
Create a header specifically for DIM-related declarations. Move existing DIM-specific functionality from en.h. Future DIM-related functionality will be declared in en/dim.h in subsequent patches. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23Merge branch 'net-dsa-vsc73xx-convert-to-phylink-and-do-some-cleanup'Jakub Kicinski2-138/+144
Pawel Dembicki says: ==================== net: dsa: vsc73xx: convert to PHYLINK and do some cleanup This patch series is a result of splitting a larger patch series [0], where some parts needed to be refactored. The first patch switches from a poll loop to read_poll_timeout. The second patch is a simple conversion to phylink because adjust_link won't work anymore. The third patch is preparation for future use. Using the "phy_interface_mode_is_rgmii" macro allows for the proper recognition of all RGMII modes. Patches 4-5 involve some cleanup: The fourth patch introduces a definition with the maximum number of ports to avoid using magic numbers. The next one fills in documentation. [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=841034&state=%2A&archive=both ==================== Link: https://lore.kernel.org/r/20240417205048.3542839-1-paweldembicki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net: dsa: vsc73xx: add structure descriptionsPawel Dembicki1-1/+15
This commit adds updates to the documentation describing the structures used in vsc73xx. This will help prevent kdoc-related issues in the future. Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com> Link: https://lore.kernel.org/r/20240417205048.3542839-6-paweldembicki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net: dsa: vsc73xx: Add define for max num of portsPawel Dembicki2-12/+12
This patch introduces a new define: VSC73XX_MAX_NUM_PORTS, which can be used in the future instead of a hardcoded value. Currently, the only hardcoded value is vsc->ds->num_ports. It is being replaced with the new define. Suggested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20240417205048.3542839-5-paweldembicki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net: dsa: vsc73xx: use macros for rgmii recognitionPawel Dembicki1-1/+1
It's preparation for future use. At this moment, the RGMII port is used only for a connection to the MAC interface, but in the future, someone could connect a PHY to it. Using the "phy_interface_mode_is_rgmii" macro allows for the proper recognition of all RGMII modes. Suggested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240417205048.3542839-4-paweldembicki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net: dsa: vsc73xx: convert to PHYLINKPawel Dembicki1-127/+117
This patch replaces the adjust_link api with the phylink apis that provide equivalent functionality. The remaining functionality from the adjust_link is now covered in the mac_link_* and mac_config from phylink_mac_ops structure. Removes: .adjust_link Adds phylink_mac_ops structure: .mac_config .mac_link_up .mac_link_down Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com> Link: https://lore.kernel.org/r/20240417205048.3542839-3-paweldembicki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-23net: dsa: vsc73xx: use read_poll_timeout instead delay loopPawel Dembicki1-14/+16
Switch the delay loop during the Arbiter empty check from vsc73xx_adjust_link() to use read_poll_timeout(). Functionally, one msleep() call is eliminated at the end of the loop in the timeout case. As Russell King suggested: "This [change] avoids the issue that on the last iteration, the code reads the register, tests it, finds the condition that's being waiting for is false, _then_ waits and end up printing the error message - that last wait is rather useless, and as the arbiter state isn't checked after waiting, it could be that we had success during the last wait." Suggested-by: Russell King <linux@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com> Link: https://lore.kernel.org/r/20240417205048.3542839-2-paweldembicki@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22tcp: do not export tcp_twsk_purge()Eric Dumazet1-1/+0
After commit 1eeb50435739 ("tcp/dccp: do not care about families in inet_twsk_purge()") tcp_twsk_purge() is no longer potentially called from a module. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22octeontx2-pf: Add support for offload tc with skbedit mark actionGeetha sowjanya6-0/+24
Support offloading of skbedit mark action. For example, to mark with 0x0008, with dest ip 60.60.60.2 on eth2 interface: # tc qdisc add dev eth2 ingress # tc filter add dev eth2 ingress protocol ip flower \ dst_ip 60.60.60.2 action skbedit mark 0x0008 skip_sw Signed-off-by: Geetha sowjanya <gakula@marvell.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22net: ethernet: ti: am65-cpsw: Fix xdp_rxq error for disabled portJulien Panis1-0/+6
When an ethX port is disabled in the device tree, an error is returned by xdp_rxq_info_reg() function while transitioning the CPSW device to the up state. The message 'Missing net_device from driver' is output. This patch fixes the issue by registering xdp_rxq info only if ethX port is enabled (i.e. ndev pointer is not NULL). Fixes: 8acacc40f733 ("net: ethernet: ti: am65-cpsw: Add minimal XDP support") Link: https://lore.kernel.org/all/260d258f-87a1-4aac-8883-aab4746b32d8@ti.com/ Reported-by: Siddharth Vadapalli <s-vadapalli@ti.com> Closes: https://gist.github.com/Siddharth-Vadapalli-at-TI/5ed0e436606001c247a7da664f75edee Signed-off-by: Julien Panis <jpanis@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-22sysctl: treewide: constify ctl_table_header::ctl_table_argThomas Weißschuh27-30/+30
To be able to constify instances of struct ctl_tables it is necessary to remove ways through which non-const versions are exposed from the sysctl core. One of these is the ctl_table_arg member of struct ctl_table_header. Constify this reference as a prerequisite for the full constification of struct ctl_table instances. No functional change. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-20Merge branch 'testing-make-netfilter-selftests-functional-in-vng-environment'Jakub Kicinski12-520/+498
Florian Westphal says: ==================== testing: make netfilter selftests functional in vng environment This is the second batch of the netfilter selftest move. Changes since v1: - makefile and kernel config are updated to have all required features - fix makefile with missing bits to make kselftest-install work - test it via vng as per https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-style (Thanks Jakub!) - squash a few fixes, e.g. nft_queue.sh v1 had a race w. NFNETLINK_QUEUE=m - add a settings file with 8m timeout, for nft_concat_range.sh sake. That script can be sped up a bit, I think, but its not contained in this batch yet. - toss the first two bogus rebase artifacts (Matthieu Baerts) scripts are moved to lib.sh infra. This allows to use busywait helper and ditch various 'sleep 2' all over the place. Tested on Fedora 39: vng --build --config tools/testing/selftests/net/netfilter/config make -C tools/testing/selftests/ TARGETS=net/netfilter vng -v --run . --user root --cpus 2 -- \ make -C tools/testing/selftests TARGETS=net/netfilter run_tests ... all tests pass except nft_audit.sh which SKIPs due to nft version mismatch (Fedora is on nft 1.0.7 which lacks reset keyword support). Missing/WIP bits: - speed up nf_concat_range.sh test - extend flowtable selftest - shellcheck fixups for remaining scripts ==================== Link: https://lore.kernel.org/r/20240418152744.15105-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: update makefiles and kernel configFlorian Westphal3-1/+57
Jakub reports the Makefile missed a few updates to make kselftest-install work for the netfilter tests and points out that config file lacks many dependencies such as VETH support. The settings file (timeout 8m) is added for nft_concat_range.sh script which can take several minutes to complete. Fixes: 3f189349e52a ("selftests: netfilter: move to net subdir") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/all/20240412175413.04e5e616@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-13-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: nft_audit.sh: add more skip checksFlorian Westphal1-4/+26
This testcase doesn't work if auditd is running, audit_logread will not receive any data in that case. Add a nftables feature test for the reset keyword and skip this test if that fails. While at it, do a few minor shellcheck cleanups. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-12-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: nft_meta.sh: small shellcheck cleanupFlorian Westphal1-2/+2
shellcheck complains about missing "", so add those. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-11-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: nft_fib.sh: shellcheck cleanupsFlorian Westphal1-67/+61
no functional change intended. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-10-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: conntrack_ipip_mtu.sh: shellcheck cleanupsFlorian Westphal1-37/+37
No functional change intended. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-9-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: nft_nat_zones.sh: shellcheck cleanupsFlorian Westphal1-118/+75
While at it: No need for iperf here, use socat. This also reduces the script runtime. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-8-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: xt_string.sh: shellcheck cleanupsFlorian Westphal1-17/+17
no functional change intended. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-7-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: xt_string.sh: move to lib.sh infraFlorian Westphal1-25/+30
Intentional changes: - Use socat instead of netcat - Use a temporary file instead of pipe, else packets do not match "-m string" rules, multiple writes to the pipe cause multiple packets, but this needs only one to work. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-6-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: nft_zones_many.sh: move to lib.sh infraFlorian Westphal1-48/+45
Also do shellcheck cleanups here, no functional changes intended. When running tests via vng tool, the packetpath insertion test fails: dd: failed to open '/dev/stdout': Device or resource busy Just omit 'of=' and this will work as intended. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-5-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: nft_synproxy.sh: move to lib.sh infraFlorian Westphal1-49/+28
use checktool helper where applicable. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-4-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: nft_queue.sh: shellcheck cleanupsFlorian Westphal1-108/+103
No functional change intended. Disable frequent shellcheck warnings wrt. "unreachable" code, those helpers get called indirectly from busywait helper. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-3-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-20selftests: netfilter: nft_queue.sh: move to lib.sh infraFlorian Westphal1-61/+34
- switch to socat, like other tests - use buswait helper to test once listener netns is ready - do not generate multiple input test files, only generate one and use cleanup hook to remove it, like other temporary files. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240418152744.15105-2-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-19Merge branch 'net-neigh-rcu'David S. Miller1-32/+36
Eric Dumazet says: ==================== neighbour: convert neigh_dump_info() to RCU Remove RTNL requirement for "ip neighbour show" command. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19neighbour: no longer hold RTNL in neigh_dump_info()Eric Dumazet1-4/+5
neigh_dump_table() is already relying on RCU protection. pneigh_dump_table() is using its own protection (tbl->lock) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19neighbour: fix neigh_dump_info() return valueEric Dumazet1-18/+13
Change neigh_dump_table() and pneigh_dump_table() to either return 0 or -EMSGSIZE if not enough space was available in the skb. Then neigh_dump_info() can do the same. This allows NLMSG_DONE to be appended to the current skb at the end of a dump, saving a couple of recvmsg() system calls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19neighbour: add RCU protection to neigh_tables[]Eric Dumazet1-11/+19
In order to remove RTNL protection from neightbl_dump_info() and neigh_dump_info() later, we need to add RCU protection to neigh_tables[]. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19net: dsa: xrs700x: fix missing initialisation of ds->phylink_mac_opsRussell King (Oracle)1-0/+1
The kernel build bot identified the following mistake in the recently merged 860a9bed2651 ("net: dsa: xrs700x: provide own phylink MAC operations") patch: drivers/net/dsa/xrs700x/xrs700x.c:714:37: warning: 'xrs700x_phylink_mac_ops' defined but not used [-Wunused-const-variable=] 714 | static const struct phylink_mac_ops xrs700x_phylink_mac_ops = { | ^~~~~~~~~~~~~~~~~~~~~~~ Fix the omitted assignment of ds->phylink_mac_ops. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19Merge branch 'net-rps-lockless'David S. Miller1-9/+9
Jason Xing says: ==================== locklessly protect left members in struct rps_dev_flow From: Jason Xing <kernelxing@tencent.com> Since Eric did a more complicated locklessly change to last_qtail member[1] in struct rps_dev_flow, the left members are easier to change as the same. One thing important I would like to share by qooting Eric: "rflow is located in rxqueue->rps_flow_table, it is thus private to current thread. Only one cpu can service an RX queue at a time." So we only pay attention to the reader in the rps_may_expire_flow() and writer in the set_rps_cpu(). They are in the two different contexts. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=3b4cf29bdab v3 Link: https://lore.kernel.org/all/20240417062721.45652-1-kerneljasonxing@gmail.com/ 1. adjust the protection in a right way (Eric) v2 1. fix passing wrong type qtail. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19net: rps: locklessly access rflow->cpuJason Xing1-1/+1
This is the last member in struct rps_dev_flow which should be protected locklessly. So finish it. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19net: rps: protect filter locklesslyJason Xing1-4/+4
As we can see, rflow->filter can be written/read concurrently, so lockless access is needed. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19net: rps: protect last_qtail with rps_input_queue_tail_save() helperJason Xing1-4/+4
Removing one unnecessary reader protection and add another writer protection to finish the locklessly proctection job. Note: the removed READ_ONCE() is not needed because we only have to protect the locklessly reader in the different context (rps_may_expire_flow()). Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-19Merge branch 'net_sched-dump-no-rtnl'David S. Miller15-234/+323
Eric Dumazet says: ==================== net_sched: first series for RTNL-less qdisc dumps Medium term goal is to implement "tc qdisc show" without needing to acquire RTNL. This first series makes the requested changes in 14 qdisc. Notes : - RTNL is still held in "tc qdisc show", more changes are needed. - Qdisc returning many attributes might want/need to provide a consistent set of attributes. If that is the case, their dump() method could acquire the qdisc spinlock, to pair the spinlock acquision in their change() method. V2: Addressed Simon feedback (Thanks a lot Simon) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>