summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-10-08net: dsa: b53: fix max MTU for BCM5325/BCM5365Jonas Gorski1-0/+6
BCM5325/BCM5365 do not support jumbo frames, so we should not report a jumbo frame mtu for them. But they do support so called "oversized" frames up to 1536 bytes long by default, so report an appropriate MTU. Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08net: dsa: b53: fix max MTU for 1g switchesJonas Gorski1-1/+4
JMS_MAX_SIZE is the ethernet frame length, not the MTU, which is payload without ethernet headers. According to the datasheets maximum supported frame length for most gigabyte swithes is 9720 bytes, so convert that to the expected MTU when using VLAN tagged frames. Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08net: dsa: b53: fix jumbo frame mtu checkJonas Gorski1-1/+1
JMS_MIN_SIZE is the full ethernet frame length, while mtu is just the data payload size. Comparing these two meant that mtus between 1500 and 1518 did not trigger enabling jumbo frames. So instead compare the set mtu ETH_DATA_LEN, which is equal to JMS_MIN_SIZE - ETH_HLEN - ETH_FCS_LEN; Also do a check that the requested mtu is actually greater than the minimum length, else we do not need to enable jumbo frames. In practice this only introduced a very small range of mtus that did not work properly. Newer chips allow 2000 byte large frames by default, and older chips allow 1536 bytes long, which is equivalent to an mtu of 1514. So effectivly only mtus of 1515~1517 were broken. Fixes: 6ae5834b983a ("net: dsa: b53: add MTU configuration support") Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08Merge branch 'fix-ti-am65-cpsw-nuss-module-removal'Paolo Abeni1-9/+13
Nicolas Pitre says: ==================== fix ti-am65-cpsw-nuss module removal Fix issues preventing rmmod of ti-am65-cpsw-nuss from working properly. v3: - more patch submission minutiae v2: https://lore.kernel.org/netdev/20241003172105.2712027-2-nico@fluxnic.net/T/ - conform to netdev patch submission customs - address patch review trivias v1: https://lore.kernel.org/netdev/20240927025301.1312590-2-nico@fluxnic.net/T/ ==================== Link: https://patch.msgid.link/20241004041218.2809774-1-nico@fluxnic.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08net: ethernet: ti: am65-cpsw: avoid devm_alloc_etherdev, fix module removalNicolas Pitre1-8/+12
Usage of devm_alloc_etherdev_mqs() conflicts with am65_cpsw_nuss_cleanup_ndev() as the same struct net_device instances get unregistered twice. Switch to alloc_etherdev_mqs() and make sure am65_cpsw_nuss_cleanup_ndev() unregisters and frees those net_device instances properly. With this, it is finally possible to rmmod the driver without oopsing the kernel. Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver") Signed-off-by: Nicolas Pitre <npitre@baylibre.com> Reviewed-by: Roger Quadros <roger@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08net: ethernet: ti: am65-cpsw: prevent WARN_ON upon module removalNicolas Pitre1-1/+1
In am65_cpsw_nuss_remove(), move the call to am65_cpsw_unregister_devlink() after am65_cpsw_nuss_cleanup_ndev() to avoid triggering the WARN_ON(devlink_port->type != DEVLINK_PORT_TYPE_NOTSET) in devl_port_unregister(). Makes it coherent with usage in m65_cpsw_nuss_register_ndevs()'s cleanup path. Fixes: 58356eb31d60 ("net: ti: am65-cpsw-nuss: Add devlink support") Signed-off-by: Nicolas Pitre <npitre@baylibre.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08net: qcom/emac: Find sgmii_ops by device_for_each_child()Zijun Hu1-5/+17
To prepare for constifying the following old driver core API: struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); to new: struct device *device_find_child(struct device *dev, const void *data, int (*match)(struct device *dev, const void *data)); The new API does not allow its match function (*match)() to modify caller's match data @*data, but emac_sgmii_acpi_match(), as the old API's match function, indeed modifies relevant match data, so it is not suitable for the new API any more, solved by implementing the same finding sgmii_ops function by correcting the function and using it as parameter of device_for_each_child() instead of device_find_child(). By the way, this commit does not change any existing logic. Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Link: https://patch.msgid.link/20241003-qcom_emac_fix-v6-1-0658e3792ca4@quicinc.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-08Merge branch 'netkit: Add option for scrubbing skb meta data'Martin KaFai Lau6-44/+736
Daniel Borkmann says: ===================== This series is to add a NETKIT_SCRUB_NONE mode such that the netkit device will not scrub the skb->{mark, priority} before running the netkit bpf prog. This will allow the netkit bpf prog to implement different policies based on the skb->{mark, priority}. The default mode NETKIT_SCRUB_DEFAULT will always scrub the skb->{mark, priority} before calling the netkit bpf prog. This is the existing behavior of the netkit device and this change will not affect the existing netkit users. ===================== Link: https://lore.kernel.org/r/20241004101335.117711-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-08selftests/bpf: Extend netkit tests to validate skb meta dataDaniel Borkmann2-9/+97
Add a small netkit test to validate skb mark and priority under the default scrubbing as well as with mark and priority scrubbing off. # ./vmtest.sh -- ./test_progs -t netkit [...] ./test_progs -t netkit [ 1.419662] tsc: Refined TSC clocksource calibration: 3407.993 MHz [ 1.420151] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fcd52370, max_idle_ns: 440795242006 ns [ 1.420897] clocksource: Switched to clocksource tsc [ 1.447996] bpf_testmod: loading out-of-tree module taints kernel. [ 1.448447] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #357 tc_netkit_basic:OK #358 tc_netkit_device:OK #359 tc_netkit_multi_links:OK #360 tc_netkit_multi_opts:OK #361 tc_netkit_neigh_links:OK #362 tc_netkit_pkt_type:OK #363 tc_netkit_scrub:OK Summary: 7/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20241004101335.117711-5-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-08net: airoha: Update tx cpu dma ring idx at the end of xmit loopLorenzo Bianconi1-4/+5
Move the tx cpu dma ring index update out of transmit loop of airoha_dev_xmit routine in order to not start transmitting the packet before it is fully DMA mapped (e.g. fragmented skbs). Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC") Reported-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241004-airoha-eth-7581-mapping-fix-v1-1-8e4279ab1812@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net: phy: Remove LED entry from LEDs list on unregisterChristian Marangi1-2/+3
Commit c938ab4da0eb ("net: phy: Manual remove LEDs to ensure correct ordering") correctly fixed a problem with using devm_ but missed removing the LED entry from the LEDs list. This cause kernel panic on specific scenario where the port for the PHY is torn down and up and the kmod for the PHY is removed. On setting the port down the first time, the assosiacted LEDs are correctly unregistered. The associated kmod for the PHY is now removed. The kmod is now added again and the port is now put up, the associated LED are registered again. On putting the port down again for the second time after these step, the LED list now have 4 elements. With the first 2 already unregistered previously and the 2 new one registered again. This cause a kernel panic as the first 2 element should have been removed. Fix this by correctly removing the element when LED is unregistered. Reported-by: Daniel Golle <daniel@makrotopia.org> Tested-by: Daniel Golle <daniel@makrotopia.org> Cc: stable@vger.kernel.org Fixes: c938ab4da0eb ("net: phy: Manual remove LEDs to ensure correct ordering") Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20241004182759.14032-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08tools: Sync if_link.h uapi tooling headerDaniel Borkmann1-1/+552
Sync if_link uapi header to the latest version as we need the refresher in tooling for netkit device. Given it's been a while since the last sync and the diff is fairly big, it has been done as its own commit. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20241004101335.117711-4-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-08netkit: Add add netkit scrub support to rt_link.yamlDaniel Borkmann1-0/+15
Add netkit scrub attribute support to the rt_link.yaml spec file. Example: # ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_link.yaml \ --do getlink --json '{"ifname": "nk0"}' --output-json | jq [...] "linkinfo": { "kind": "netkit", "data": { "primary": 0, "policy": "forward", "mode": "l3", "scrub": "default", "peer-policy": "forward", "peer-scrub": "default" } }, [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20241004101335.117711-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-08netkit: Simplify netkit mode over to use NLA_POLICY_MAXDaniel Borkmann1-22/+3
Jakub suggested to rely on netlink policy validation via NLA_POLICY_MAX() instead of open-coding it. netkit_check_mode() is a candidate which can be simplified through this as well aside from the netkit scrubbing one. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20241004101335.117711-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-08netkit: Add option for scrubbing skb meta dataDaniel Borkmann2-13/+70
Jordan reported that when running Cilium with netkit in per-endpoint-routes mode, network policy misclassifies traffic. In this direct routing mode of Cilium which is used in case of GKE/EKS/AKS, the Pod's BPF program to enforce policy sits on the netkit primary device's egress side. The issue here is that in case of netkit's netkit_prep_forward(), it will clear meta data such as skb->mark and skb->priority before executing the BPF program. Thus, identity data stored in there from earlier BPF programs (e.g. from tcx ingress on the physical device) gets cleared instead of being made available for the primary's program to process. While for traffic egressing the Pod via the peer device this might be desired, this is different for the primary one where compared to tcx egress on the host veth this information would be available. To address this, add a new parameter for the device orchestration to allow control of skb->mark and skb->priority scrubbing, to make the two accessible from BPF (and eventually leave it up to the program to scrub). By default, the current behavior is retained. For netkit peer this also enables the use case where applications could cooperate/signal intent to the BPF program. Note that struct netkit has a 4 byte hole between policy and bundle which is used here, in other words, struct netkit's first cacheline content used in fast-path does not get moved around. Fixes: 35dfaad7188c ("netkit, bpf: Add bpf programmable net device") Reported-by: Jordan Rife <jrife@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nikolay Aleksandrov <razor@blackwall.org> Link: https://github.com/cilium/cilium/issues/34042 Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20241004101335.117711-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-08net: phy: mxl-gpy: add missing support for TRIGGER_NETDEV_LINK_10Daniel Golle1-0/+1
The PHY also support 10MBit/s links as well as the corresponding link indication trigger to be offloaded. Add TRIGGER_NETDEV_LINK_10 to the supported triggers. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/cc5da0a989af8b0d49d823656d88053c4de2ab98.1728057367.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08Merge tag 'for-net-2024-10-04' of ↵Jakub Kicinski3-4/+21
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - RFCOMM: FIX possible deadlock in rfcomm_sk_state_change - hci_conn: Fix UAF in hci_enhanced_setup_sync - btusb: Don't fail external suspend requests * tag 'for-net-2024-10-04' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: btusb: Don't fail external suspend requests Bluetooth: hci_conn: Fix UAF in hci_enhanced_setup_sync Bluetooth: RFCOMM: FIX possible deadlock in rfcomm_sk_state_change ==================== Link: https://patch.msgid.link/20241004210124.4010321-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08vmxnet3: support higher link speeds from vmxnet3 v9Ronak Doshi1-0/+8
Until now, vmxnet3 was default reporting 10Gbps as link speed. Vmxnet3 v9 adds support for user to configure higher link speeds. User can configure the link speed via VMs advanced parameters options in VCenter. This speed is reported in gbps by hypervisor. This patch adds support for vmxnet3 to report higher link speeds and converts it to mbps as expected by Linux stack. Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com> Acked-by: Guolin Yang <guolin.yang@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241004174303.5370-1-ronak.doshi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08dt-bindings: net: realtek: Use proper node namesLinus Walleij1-23/+23
We eventually want to get to a place where we fix all DTS files so that we can simply disallow switch/port/ports without the ethernet-* prefix so the DTS files are more readable. Replace: - switch with ethernet-switch - ports with ethernet-ports - port with ethernet-port Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Link: https://patch.msgid.link/20241004-realtek-bindings-fixup-v2-1-667afa08d184@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net: ethernet: adi: adin1110: Fix some error handling path in ↵Christophe JAILLET1-2/+2
adin1110_read_fifo() If 'frame_size' is too small or if 'round_len' is an error code, it is likely that an error code should be returned to the caller. Actually, 'ret' is likely to be 0, so if one of these sanity checks fails, 'success' is returned. Return -EINVAL instead. Fixes: bc93e19d088b ("net: ethernet: adi: Add ADIN1110 support") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://patch.msgid.link/8ff73b40f50d8fa994a454911b66adebce8da266.1727981562.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08Revert "net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled"Jakub Kicinski1-1/+1
This reverts commit b514c47ebf41a6536551ed28a05758036e6eca7c. The commit describes that we don't have to sync the page when recycling, and it tries to optimize that case. But we do need to sync after allocation. Recycling side should be changed to pass the right sync size instead. Fixes: b514c47ebf41 ("net: stmmac: set PP_FLAG_DMA_SYNC_DEV only if XDP is enabled") Reported-by: Jon Hunter <jonathanh@nvidia.com> Link: https://lore.kernel.org/20241004070846.2502e9ea@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Furong Xu <0x1207@gmail.com> Link: https://patch.msgid.link/20241004142115.910876-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08Merge branch 'ipv4-preliminary-work-for-per-netns-rtnl'Jakub Kicinski3-49/+32
Eric Dumazet says: ==================== ipv4: preliminary work for per-netns RTNL Inspired by 9b8ca04854fd ("ipv4: avoid quadratic behavior in FIB insertion of common address") and per-netns RTNL conversion started by Kuniyuki this week. ip_fib_check_default() can use RCU instead of a shared spinlock. fib_info_lock can be removed, RTNL is already used. fib_info_devhash[] can be removed in favor of a single pointer in net_device. ==================== Link: https://patch.msgid.link/20241004134720.579244-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08ipv4: remove fib_info_devhash[]Eric Dumazet3-20/+19
Upcoming per-netns RTNL conversion needs to get rid of shared hash tables. fib_info_devhash[] is one of them. It is unclear why we used a hash table, because a single hlist_head per net device was cheaper and scalable. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241004134720.579244-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08ipv4: remove fib_info_lockEric Dumazet1-12/+6
After the prior patch, fib_info_lock became redundant because all of its users are holding RTNL. BH protection is not needed. Remove the READ_ONCE()/WRITE_ONCE() annotations around fib_info_cnt, since it is protected by RTNL. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241004134720.579244-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08ipv4: use rcu in ip_fib_check_default()Eric Dumazet1-9/+4
fib_info_devhash[] is not resized in fib_info_hash_move(). fib_nh structs are already freed after an rcu grace period. This will allow to remove fib_info_lock in the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241004134720.579244-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08ipv4: remove fib_devindex_hashfn()Eric Dumazet1-9/+4
fib_devindex_hashfn() converts a 32bit ifindex value to a 8bit hash. It makes no sense doing this from fib_info_hashfn() and fib_find_info_nh(). It is better to keep as many bits as possible to let fib_info_hashfn_result() have better spread. Only fib_info_devhash_bucket() needs to make this operation, we can 'inline' trivial fib_devindex_hashfn() in it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241004134720.579244-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net: dsa: lan9303: ensure chip reset and wait for READY statusAnatolij Gustschin1-0/+29
Accessing device registers seems to be not reliable, the chip revision is sometimes detected wrongly (0 instead of expected 1). Ensure that the chip reset is performed via reset GPIO and then wait for 'Device Ready' status in HW_CFG register before doing any register initializations. Cc: stable@vger.kernel.org Fixes: a1292595e006 ("net: dsa: add new DSA switch driver for the SMSC-LAN9303") Signed-off-by: Anatolij Gustschin <agust@denx.de> [alex: reworked using read_poll_timeout()] Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://patch.msgid.link/20241004113655.3436296-1-alexander.sverdlin@siemens.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08lib: packing: catch kunit_kzalloc() failure in the pack() testVladimir Oltean1-0/+1
kunit_kzalloc() may fail. Other call sites verify that this is the case, either using a direct comparison with the NULL pointer, or the KUNIT_ASSERT_NOT_NULL() or KUNIT_ASSERT_NOT_ERR_OR_NULL(). Pick KUNIT_ASSERT_NOT_NULL() as the error handling method that made most sense to me. It's an unlikely thing to happen, but at least we call __kunit_abort() instead of dereferencing this NULL pointer. Fixes: e9502ea6db8a ("lib: packing: add KUnit tests adapted from selftests") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20241004110012.1323427-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08mlxsw: spectrum_acl_flex_keys: Constify struct mlxsw_afk_element_instChristophe JAILLET3-37/+37
'struct mlxsw_afk_element_inst' are not modified in these drivers. Constifying these structures moves some data to a read-only section, so increases overall security. Update a few functions and struct mlxsw_afk_block accordingly. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 4278 4032 0 8310 2076 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.o After: ===== text data bss dec hex filename 7934 352 0 8286 205e drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/8ccfc7bfb2365dcee5b03c81ebe061a927d6da2e.1727541677.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net: dsa: remove obsolete phylink dsa_switch operationsRussell King (Oracle)3-56/+1
No driver now uses the DSA switch phylink members, so we can now remove the method pointers, but we need to leave empty shim functions to allow those drivers that do not provide phylink MAC operations structure to continue functioning. Signed-off-by: Russell King (oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # sja1105, felix, dsa_loop Link: https://patch.msgid.link/E1swKNV-0060oN-1b@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net: explicitly clear the sk pointer, when pf->create failsIgnat Korchagin1-1/+6
We have recently noticed the exact same KASAN splat as in commit 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails"). The problem is that commit did not fully address the problem, as some pf->create implementations do not use sk_common_release in their error paths. For example, we can use the same reproducer as in the above commit, but changing ping to arping. arping uses AF_PACKET socket and if packet_create fails, it will just sk_free the allocated sk object. While we could chase all the pf->create implementations and make sure they NULL the freed sk object on error from the socket, we can't guarantee future protocols will not make the same mistake. So it is easier to just explicitly NULL the sk pointer upon return from pf->create in __sock_create. We do know that pf->create always releases the allocated sk object on error, so if the pointer is not NULL, it is definitely dangling. Fixes: 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails") Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241003170151.69445-1-ignat@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net: tcp: refresh tcp_mstamp for compressed ack in timerMenglong Dong1-0/+1
For now, we refresh the tcp_mstamp for delayed acks and keepalives, but not for the compressed ack in tcp_compressed_ack_kick(). I have not found out the effact of the tcp_mstamp when sending ack, but we can still refresh it for the compressed ack to keep consistent. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241003082231.759759-1-dongml2@chinatelecom.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08btrfs: fix missing error handling when adding delayed ref with qgroups enabledFilipe Manana1-9/+33
When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(), if we fail to insert the qgroup record we don't error out, we ignore it. In fact we treat it as if there was no error and there was already an existing record - we don't distinguish between the cases where btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already existed and we can free the given record, and the case where it returns a negative error value, meaning the insertion into the xarray that is used to track records failed. Effectively we end up ignoring that we are lacking qgroup record in the dirty extents xarray, resulting in incorrect qgroup accounting. Fix this by checking for errors and return them to the callers. Fixes: 3cce39a8ca4e ("btrfs: qgroup: use xarray to track dirty extents in transaction") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-08btrfs: add cancellation points to trim loopsLuca Stefani3-3/+14
There are reports that system cannot suspend due to running trim because the task responsible for trimming the device isn't able to finish in time, especially since we have a free extent discarding phase, which can trim a lot of unallocated space. There are no limits on the trim size (unlike the block group part). Since trime isn't a critical call it can be interrupted at any time, in such cases we stop the trim, report the amount of discarded bytes and return an error. Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-08btrfs: split remaining space to discard in chunksLuca Stefani2-4/+21
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, mostly empty although we will do the split according to our super block locations, the last super block ends at 256G, we can submit a huge discard for the range [256G, 8T), causing a large delay. Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in preparation of introduction of cancellation points to trim. The value of the chunk size is arbitrary, it can be higher or derived from actual device capabilities but we can't easily read that using bio_discard_limit(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-07sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTEDTejun Heo2-2/+3
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to tell whether ops.select_cpu() was called. This is incorrect as ops.select_cpu() can be skipped in the wakeup path and leads to e.g. incorrectly skipping direct dispatch for tasks that are bound to a single CPU. sched core has been updated to specify ENQUEUE_RQ_SELECTED if ->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update scx_qmap to test it instead of SCX_ENQ_WAKEUP. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-07sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() ↵Tejun Heo2-2/+9
was called During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or migration is disabled. sched_ext schedulers may perform operations such as direct dispatch from ->select_task_rq() path and it is useful for them to know whether ->select_task_rq() was skipped in the ->enqueue_task() path. Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose and end up assuming incorrectly that ->select_task_rq() was called for tasks that are bound to a single CPU or migration disabled. Make select_task_rq() indicate whether ->select_task_rq() was called by setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that to ENQUEUE_RQ_SELECTED for ->enqueue_task(). This will be used by sched_ext to fix ->select_task_rq() skip detection. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-07sched/core: Make select_task_rq() take the pointer to wake_flags instead of ↵Tejun Heo1-5/+8
value This will be used to allow select_task_rq() to indicate whether ->select_task_rq() was called by modifying *wake_flags. This makes try_to_wake_up() call all functions that take wake_flags with WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is more consistent, and, as the flag is only tested by ->select_task_rq() implementations, it doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
2024-10-07Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds5-31/+36
Pull virtio fixes from Michael Tsirkin: "Several small bugfixes all over the place. Most notably, fixes the vsock allocation with GFP_KERNEL in atomic context, which has been triggering warnings for lots of testers" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost/scsi: null-ptr-dereference in vhost_scsi_get_req() vsock/virtio: use GFP_ATOMIC under RCU read lock virtio_console: fix misc probe bugs virtio_ring: tag event_triggered as racy for KCSAN vdpa/octeon_ep: Fix format specifier for pointers in debug messages
2024-10-07vhost/scsi: null-ptr-dereference in vhost_scsi_get_req()Haoran Zhang1-12/+15
Since commit 3f8ca2e115e5 ("vhost/scsi: Extract common handling code from control queue handler") a null pointer dereference bug can be triggered when guest sends an SCSI AN request. In vhost_scsi_ctl_handle_vq(), `vc.target` is assigned with `&v_req.tmf.lun[1]` within a switch-case block and is then passed to vhost_scsi_get_req() which extracts `vc->req` and `tpg`. However, for a `VIRTIO_SCSI_T_AN_*` request, tpg is not required, so `vc.target` is set to NULL in this branch. Later, in vhost_scsi_get_req(), `vc->target` is dereferenced without being checked, leading to a null pointer dereference bug. This bug can be triggered from guest. When this bug occurs, the vhost_worker process is killed while holding `vq->mutex` and the corresponding tpg will remain occupied indefinitely. Below is the KASAN report: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 PID: 840 Comm: poc Not tainted 6.10.0+ #1 Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:vhost_scsi_get_req+0x165/0x3a0 Code: 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 2b 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 65 30 4c 89 e2 48 c1 ea 03 <0f> b6 04 02 4c 89 e2 83 e2 07 38 d0 7f 08 84 c0 0f 85 be 01 00 00 RSP: 0018:ffff888017affb50 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffff88801b000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888017affcb8 RBP: ffff888017affb80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888017affc88 R14: ffff888017affd1c R15: ffff888017993000 FS: 000055556e076500(0000) GS:ffff88806b100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000200027c0 CR3: 0000000010ed0004 CR4: 0000000000370ef0 Call Trace: <TASK> ? show_regs+0x86/0xa0 ? die_addr+0x4b/0xd0 ? exc_general_protection+0x163/0x260 ? asm_exc_general_protection+0x27/0x30 ? vhost_scsi_get_req+0x165/0x3a0 vhost_scsi_ctl_handle_vq+0x2a4/0xca0 ? __pfx_vhost_scsi_ctl_handle_vq+0x10/0x10 ? __switch_to+0x721/0xeb0 ? __schedule+0xda5/0x5710 ? __kasan_check_write+0x14/0x30 ? _raw_spin_lock+0x82/0xf0 vhost_scsi_ctl_handle_kick+0x52/0x90 vhost_run_work_list+0x134/0x1b0 vhost_task_fn+0x121/0x350 ... </TASK> ---[ end trace 0000000000000000 ]--- Let's add a check in vhost_scsi_get_req. Fixes: 3f8ca2e115e5 ("vhost/scsi: Extract common handling code from control queue handler") Signed-off-by: Haoran Zhang <wh1sper@zju.edu.cn> [whitespace fixes] Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <b26d7ddd-b098-4361-88f8-17ca7f90adf7@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-10-07vsock/virtio: use GFP_ATOMIC under RCU read lockMichael S. Tsirkin1-4/+4
virtio_transport_send_pkt in now called on transport fast path, under RCU read lock. In that case, we have a bug: virtio_add_sgs is called with GFP_KERNEL, and might sleep. Pass the gfp flags as an argument, and use GFP_ATOMIC on the fast path. Link: https://lore.kernel.org/all/hfcr2aget2zojmqpr4uhlzvnep4vgskblx5b6xf2ddosbsrke7@nt34bxgp7j2x Fixes: efcd71af38be ("vsock/virtio: avoid queuing packets when intermediate queue is empty") Reported-by: Christian Brauner <brauner@kernel.org> Cc: Stefano Garzarella <sgarzare@redhat.com> Cc: Luigi Leonardi <luigi.leonardi@outlook.com> Message-ID: <3fbfb6e871f625f89eb578c7228e127437b1975a.1727876449.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Luigi Leonardi <luigi.leonardi@outlook.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2024-10-07xfs: skip background cowblock trims on inodes open for writeBrian Foster1-8/+23
The background blockgc scanner runs on a 5m interval by default and trims preallocation (post-eof and cow fork) from inodes that are otherwise idle. Idle effectively means that iolock can be acquired without blocking and that the inode has no dirty pagecache or I/O in flight. This simple mechanism and heuristic has worked fairly well for post-eof speculative preallocations. Support for reflink and COW fork preallocations came sometime later and plugged into the same mechanism, with similar heuristics. Some recent testing has shown that COW fork preallocation may be notably more sensitive to blockgc processing than post-eof preallocation, however. For example, consider an 8GB reflinked file with a COW extent size hint of 1MB. A worst case fully randomized overwrite of this file results in ~8k extents of an average size of ~1MB. If the same workload is interrupted a couple times for blockgc processing (assuming the file goes idle), the resulting extent count explodes to over 100k extents with an average size <100kB. This is significantly worse than ideal and essentially defeats the COW extent size hint mechanism. While this particular test is instrumented, it reflects a fairly reasonable pattern in practice where random I/Os might spread out over a large period of time with varying periods of (in)activity. For example, consider a cloned disk image file for a VM or container with long uptime and variable and bursty usage. A background blockgc scan that races and processes the image file when it happens to be clean and idle can have a significant effect on the future fragmentation level of the file, even when still in use. To help combat this, update the heuristic to skip cowblocks inodes that are currently opened for write access during non-sync blockgc scans. This allows COW fork preallocations to persist for as long as possible unless otherwise needed for functional purposes (i.e. a sync scan), the file is idle and closed, or the inode is being evicted from cache. While here, update the comments to help distinguish performance oriented heuristics from the logic that exists to maintain functional correctness. Suggested-by: Darrick Wong <djwong@kernel.org> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_allocChristoph Hellwig1-1/+7
Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation variant fails to drop into the lowmode last resort allocator, and thus can sometimes fail allocations for which the caller has a transaction block reservation. Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btallocChristoph Hellwig1-48/+13
xfs_bmap_exact_minlen_extent_alloc duplicates the args setup in xfs_bmap_btalloc. Switch to call it from xfs_bmap_btalloc after doing the basic setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07xfs: don't ifdef around the exact minlen allocationsChristoph Hellwig3-14/+3
Exact minlen allocations only exist as an error injection tool for debug builds. Currently this is implemented using ifdefs, which means the code isn't even compiled for non-XFS_DEBUG builds. Enhance the compile test coverage by always building the code and use the compilers' dead code elimination to remove it from the generated binary instead. The only downside is that the alloc_minlen_only field is unconditionally added to struct xfs_alloc_args now, but by moving it around and packing it tightly this doesn't actually increase the size of the structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocateChristoph Hellwig1-45/+28
Userdata and metadata allocations end up in the same allocation helpers. Remove the separate xfs_bmap_alloc_userdata function to make this more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addnameChristoph Hellwig1-5/+8
Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return -ENOSPC both for an actual failure to allocate a disk block, but also to signal the caller to convert the format of the attr fork. Use magic 1 to ask for the conversion here as well. Note that unlike the similar issue in xfs_attr3_leaf_split, this one was only found by code review. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_splitChristoph Hellwig2-3/+7
xfs_attr3_leaf_split propagates the need for an extra btree split as -ENOSPC to it's only caller, but the same return value can also be returned from xfs_da_grow_inode when it fails to find free space. Distinguish the two cases by returning 1 for the extra split case instead of overloading -ENOSPC. This can be triggered relatively easily with the pending realtime group support and a file system with a lot of small zones that use metadata space on the main device. In this case every about 5-10th run of xfs/538 runs into the following assert: ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC); in xfs_attr3_leaf_split caused by an allocation failure. Note that the allocation failure is caused by another bug that will be fixed subsequently, but this commit at least sorts out the error handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07xfs: return bool from xfs_attr3_leaf_addChristoph Hellwig3-27/+25
xfs_attr3_leaf_add only has two potential return values, indicating if the entry could be added or not. Replace the errno return with a bool so that ENOSPC from it can't easily be confused with a real ENOSPC. Remove the return value from the xfs_attr3_leaf_add_work helper entirely, as it always return 0. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addnameChristoph Hellwig1-102/+74
xfs_attr_leaf_try_add is only called by xfs_attr_leaf_addname, and merging the two will simplify a following error handling fix. To facilitate this move the remote block state save/restore helpers up in the file so that they don't need forward declarations now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>