summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet
AgeCommit message (Collapse)AuthorFilesLines
2025-02-17net: stmmac: use GFP_DMA32Matteo Croce1-2/+2
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
2025-02-13mlxsw: Add return value check for mlxsw_sp_port_get_stats_raw()Wentao Liang1-1/+3
Add a check for the return value of mlxsw_sp_port_get_stats_raw() in __mlxsw_sp_port_get_stats(). If mlxsw_sp_port_get_stats_raw() returns an error, exit the function to prevent further processing with potentially invalid data. Fixes: 614d509aa1e7 ("mlxsw: Move ethtool_ops to spectrum_ethtool.c") Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250212152311.1332-1-vulab@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13net: ethernet: ti: am65_cpsw: fix tx_cleanup for XDP caseRoger Quadros1-3/+11
For XDP transmit case, swdata doesn't contain SKB but the XDP Frame. Infer the correct swdata based on buffer type and return the XDP Frame for XDP transmit case. Signed-off-by: Roger Quadros <rogerq@kernel.org> Fixes: 8acacc40f733 ("net: ethernet: ti: am65-cpsw: Add minimal XDP support") Link: https://patch.msgid.link/20250210-am65-cpsw-xdp-fixes-v1-3-ec6b1f7f1aca@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13net: ethernet: ti: am65-cpsw: fix RX & TX statistics for XDP_TX caseRoger Quadros1-3/+7
For successful XDP_TX and XDP_REDIRECT cases, the packet was received successfully so update RX statistics. Use original received packet length for that. TX packets statistics are incremented on TX completion so don't update it while TX queueing. If xdp_convert_buff_to_frame() fails, increment tx_dropped. Signed-off-by: Roger Quadros <rogerq@kernel.org> Fixes: 8acacc40f733 ("net: ethernet: ti: am65-cpsw: Add minimal XDP support") Link: https://patch.msgid.link/20250210-am65-cpsw-xdp-fixes-v1-2-ec6b1f7f1aca@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13net: ethernet: ti: am65-cpsw: fix memleak in certain XDP casesRoger Quadros1-13/+13
If the XDP program doesn't result in XDP_PASS then we leak the memory allocated by am65_cpsw_build_skb(). It is pointless to allocate SKB memory before running the XDP program as we would be wasting CPU cycles for cases other than XDP_PASS. Move the SKB allocation after evaluating the XDP program result. This fixes the memleak. A performance boost is seen for XDP_DROP test. XDP_DROP test: Before: 460256 rx/s 0 err/s After: 784130 rx/s 0 err/s Fixes: 8acacc40f733 ("net: ethernet: ti: am65-cpsw: Add minimal XDP support") Signed-off-by: Roger Quadros <rogerq@kernel.org> Link: https://patch.msgid.link/20250210-am65-cpsw-xdp-fixes-v1-1-ec6b1f7f1aca@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13Merge branch '200GbE' of ↵Jakub Kicinski4-14/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-02-11 (idpf, ixgbe, igc) For idpf: Sridhar fixes a couple issues in handling of RSC packets. Josh adds a call to set_real_num_queues() to keep queue count in sync. For ixgbe: Piotr removes missed IS_ERR() removal when ERR_PTR usage was removed. For igc: Zdenek Bouska fixes reporting of Rx timestamp with AF_XDP. Siang sets buffer type on empty frame to ensure proper handling. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: igc: Set buffer type for empty frames in igc_init_empty_frame igc: Fix HW RX timestamp when passed by ZC XDP ixgbe: Fix possible skb NULL pointer dereference idpf: call set_real_num_queues in idpf_open idpf: record rx queue in skb for RSC packets idpf: fix handling rsc packet with a single segment ==================== Link: https://patch.msgid.link/20250211214343.4092496-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12iavf: Fix a locking bug in an error pathBart Van Assche1-1/+1
If the netdev lock has been obtained, unlock it before returning. This bug has been detected by the Clang thread-safety analyzer. Fixes: afc664987ab3 ("eth: iavf: extend the netdev_lock usage") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20250206175114.1974171-28-bvanassche@acm.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11igc: Set buffer type for empty frames in igc_init_empty_frameSong Yoong Siang1-0/+1
Set the buffer type to IGC_TX_BUFFER_TYPE_SKB for empty frame in the igc_init_empty_frame function. This ensures that the buffer type is correctly identified and handled during Tx ring cleanup. Fixes: db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit") Cc: stable@vger.kernel.org # 6.2+ Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11igc: Fix HW RX timestamp when passed by ZC XDPZdenek Bouska1-9/+12
Fixes HW RX timestamp in the following scenario: - AF_PACKET socket with enabled HW RX timestamps is created - AF_XDP socket with enabled zero copy is created - frame is forwarded to the BPF program, where the timestamp should still be readable (extracted by igc_xdp_rx_timestamp(), kfunc behind bpf_xdp_metadata_rx_timestamp()) - the frame got XDP_PASS from BPF program, redirecting to the stack - AF_PACKET socket receives the frame with HW RX timestamp Moves the skb timestamp setting from igc_dispatch_skb_zc() to igc_construct_skb_zc() so that igc_construct_skb_zc() is similar to igc_construct_skb(). This issue can also be reproduced by running: # tools/testing/selftests/bpf/xdp_hw_metadata enp1s0 When a frame with the wrong port 9092 (instead of 9091) is used: # echo -n xdp | nc -u -q1 192.168.10.9 9092 then the RX timestamp is missing and xdp_hw_metadata prints: skb hwtstamp is not found! With this fix or when copy mode is used: # tools/testing/selftests/bpf/xdp_hw_metadata -c enp1s0 then RX timestamp is found and xdp_hw_metadata prints: found skb hwtstamp = 1736509937.852786132 Fixes: 069b142f5819 ("igc: Add support for PTP .getcyclesx64()") Signed-off-by: Zdenek Bouska <zdenek.bouska@siemens.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com> Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11ixgbe: Fix possible skb NULL pointer dereferencePiotr Kwapulinski1-1/+1
The commit c824125cbb18 ("ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()") stopped utilizing the ERR-like macros for xdp status encoding. Propagate this logic to the ixgbe_put_rx_buffer(). The commit also relaxed the skb NULL pointer check - caught by Smatch. Restore this check. Fixes: c824125cbb18 ("ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/intel-wired-lan/2c7d6c31-192a-4047-bd90-9566d0e14cc0@stanley.mountain/ Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: call set_real_num_queues in idpf_openJoshua Hay1-0/+5
On initial driver load, alloc_etherdev_mqs is called with whatever max queue values are provided by the control plane. However, if the driver is loaded on a system where num_online_cpus() returns less than the max queues, the netdev will think there are more queues than are actually available. Only num_online_cpus() will be allocated, but skb_get_queue_mapping(skb) could possibly return an index beyond the range of allocated queues. Consequently, the packet is silently dropped and it appears as if TX is broken. Set the real number of queues during open so the netdev knows how many queues will be allocated. Fixes: 1c325aac10a8 ("idpf: configure resources for TX queues") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: record rx queue in skb for RSC packetsSridhar Samudrala1-2/+1
Move the call to skb_record_rx_queue in idpf_rx_process_skb_fields() so that RX queue is recorded for RSC packets too. Fixes: 90912f9f4f2d ("idpf: convert header split mode to libeth + napi_build_skb()") Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: fix handling rsc packet with a single segmentSridhar Samudrala1-2/+0
Handle rsc packet with a single segment same as a multi segment rsc packet so that CHECKSUM_PARTIAL is set in the skb->ip_summed field. The current code is passing CHECKSUM_NONE resulting in TCP GRO layer doing checksum in SW and hiding the issue. This will fail when using dmabufs as payload buffers as skb frag would be unreadable. Fixes: 3a8845af66ed ("idpf: add RX splitq napi poll support") Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11net: stmmac: Apply new page pool parameters when SPH is enabledFurong Xu1-0/+5
Commit df542f669307 ("net: stmmac: Switch to zero-copy in non-XDP RX path") makes DMA write received frame into buffer at offset of NET_SKB_PAD and sets page pool parameters to sync from offset of NET_SKB_PAD. But when Header Payload Split is enabled, the header is written at offset of NET_SKB_PAD, while the payload is written at offset of zero. Uncorrect offset parameter for the payload breaks dma coherence [1] since both CPU and DMA touch the page buffer from offset of zero which is not handled by the page pool sync parameter. And in case the DMA cannot split the received frame, for example, a large L2 frame, pp_params.max_len should grow to match the tail of entire frame. [1] https://lore.kernel.org/netdev/d465f277-bac7-439f-be1d-9a47dfe2d951@nvidia.com/ Reported-by: Jon Hunter <jonathanh@nvidia.com> Reported-by: Brad Griffis <bgriffis@nvidia.com> Suggested-by: Ido Schimmel <idosch@idosch.org> Fixes: df542f669307 ("net: stmmac: Switch to zero-copy in non-XDP RX path") Signed-off-by: Furong Xu <0x1207@gmail.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Thierry Reding <treding@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250207085639.13580-1-0x1207@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06Revert "net: stmmac: Specify hardware capability value when FIFO size isn't ↵Russell King (Oracle)1-18/+17
specified" This reverts commit 8865d22656b4, which caused breakage for platforms which are not using xgmac2 or gmac4. Only these two cores have the capability of providing the FIFO sizes from hardware capability fields (which are provided in priv->dma_cap.[tr]x_fifo_size.) All other cores can not, which results in these two fields containing zero. We also have platforms that do not provide a value in priv->plat->[tr]x_fifo_size, resulting in these also being zero. This causes the new tests introduced by the reverted commit to fail, and produce e.g.: stmmaceth f0804000.eth: Can't specify Rx FIFO size An example of such a platform which fails is QEMU's npcm750-evb. This uses dwmac1000 which, as noted above, does not have the capability to provide the FIFO sizes from hardware. Therefore, revert the commit to maintain compatibility with the way the driver used to work. Reported-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/4e98f967-f636-46fb-9eca-d383b9495b86@roeck-us.net Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Steven Price <steven.price@arm.com> Fixes: 8865d22656b4 ("net: stmmac: Specify hardware capability value when FIFO size isn't specified") Link: https://patch.msgid.link/E1tfeyR-003YGJ-Gb@rmk-PC.armlinux.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-05net: atlantic: fix warning during hot unplugJacob Moroni1-1/+3
Firmware deinitialization performs MMIO accesses which are not necessary if the device has already been removed. In some cases, these accesses happen via readx_poll_timeout_atomic which ends up timing out, resulting in a warning at hw_atl2_utils_fw.c:112: [ 104.595913] Call Trace: [ 104.595915] <TASK> [ 104.595918] ? show_regs+0x6c/0x80 [ 104.595923] ? __warn+0x8d/0x150 [ 104.595925] ? aq_a2_fw_deinit+0xcf/0xe0 [atlantic] [ 104.595934] ? report_bug+0x182/0x1b0 [ 104.595938] ? handle_bug+0x6e/0xb0 [ 104.595940] ? exc_invalid_op+0x18/0x80 [ 104.595942] ? asm_exc_invalid_op+0x1b/0x20 [ 104.595944] ? aq_a2_fw_deinit+0xcf/0xe0 [atlantic] [ 104.595952] ? aq_a2_fw_deinit+0xcf/0xe0 [atlantic] [ 104.595959] aq_nic_deinit.part.0+0xbd/0xf0 [atlantic] [ 104.595964] aq_nic_deinit+0x17/0x30 [atlantic] [ 104.595970] aq_ndev_close+0x2b/0x40 [atlantic] [ 104.595975] __dev_close_many+0xad/0x160 [ 104.595978] dev_close_many+0x99/0x170 [ 104.595979] unregister_netdevice_many_notify+0x18b/0xb20 [ 104.595981] ? __call_rcu_common+0xcd/0x700 [ 104.595984] unregister_netdevice_queue+0xc6/0x110 [ 104.595986] unregister_netdev+0x1c/0x30 [ 104.595988] aq_pci_remove+0xb1/0xc0 [atlantic] Fix this by skipping firmware deinitialization altogether if the PCI device is no longer present. Tested with an AQC113 attached via Thunderbolt by performing repeated unplug cycles while traffic was running via iperf. Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code") Signed-off-by: Jacob Moroni <mail@jakemoroni.com> Reviewed-by: Igor Russkikh <irusskikh@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250203143604.24930-3-mail@jakemoroni.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04Merge branch '100GbE' of ↵Jakub Kicinski3-91/+103
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== ice: fix Rx data path for heavy 9k MTU traffic Maciej Fijalkowski says: This patchset fixes a pretty nasty issue that was reported by RedHat folks which occurred after ~30 minutes (this value varied, just trying here to state that it was not observed immediately but rather after a considerable longer amount of time) when ice driver was tortured with jumbo frames via mix of iperf traffic executed simultaneously with wrk/nginx on client/server sides (HTTP and TCP workloads basically). The reported splats were spanning across all the bad things that can happen to the state of page - refcount underflow, use-after-free, etc. One of these looked as follows: [ 2084.019891] BUG: Bad page state in process swapper/34 pfn:97fcd0 [ 2084.025990] page:00000000a60ee772 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x97fcd0 [ 2084.035462] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) [ 2084.041990] raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000 [ 2084.049730] raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 [ 2084.057468] page dumped because: nonzero _refcount [ 2084.062260] Modules linked in: bonding tls sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 irqd [ 2084.137829] CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Not tainted 5.14.0-427.37.1.el9_4.x86_64 #1 [ 2084.147039] Hardware name: Dell Inc. PowerEdge R750/0216NK, BIOS 1.13.2 12/19/2023 [ 2084.154604] Call Trace: [ 2084.157058] <IRQ> [ 2084.159080] dump_stack_lvl+0x34/0x48 [ 2084.162752] bad_page.cold+0x63/0x94 [ 2084.166333] check_new_pages+0xb3/0xe0 [ 2084.170083] rmqueue_bulk+0x2d2/0x9e0 [ 2084.173749] ? ktime_get+0x35/0xa0 [ 2084.177159] rmqueue_pcplist+0x13b/0x210 [ 2084.181081] rmqueue+0x7d3/0xd40 [ 2084.184316] ? xas_load+0x9/0xa0 [ 2084.187547] ? xas_find+0x183/0x1d0 [ 2084.191041] ? xa_find_after+0xd0/0x130 [ 2084.194879] ? intel_iommu_iotlb_sync_map+0x89/0xe0 [ 2084.199759] get_page_from_freelist+0x11f/0x530 [ 2084.204291] __alloc_pages+0xf2/0x250 [ 2084.207958] ice_alloc_rx_bufs+0xcc/0x1c0 [ice] [ 2084.212543] ice_clean_rx_irq+0x631/0xa20 [ice] [ 2084.217111] ice_napi_poll+0xdf/0x2a0 [ice] [ 2084.221330] __napi_poll+0x27/0x170 [ 2084.224824] net_rx_action+0x233/0x2f0 [ 2084.228575] __do_softirq+0xc7/0x2ac [ 2084.232155] __irq_exit_rcu+0xa1/0xc0 [ 2084.235821] common_interrupt+0x80/0xa0 [ 2084.239662] </IRQ> [ 2084.241768] <TASK> The fix is mostly about reverting what was done in commit 1dc1a7e7f410 ("ice: Centrallize Rx buffer recycling") followed by proper timing on page_count() storage and then removing the ice_rx_buf::act related logic (which was mostly introduced for purposes from cited commit). Special thanks to Xu Du for providing reproducer and Jacob Keller for initial extensive analysis. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: stop storing XDP verdict within ice_rx_buf ice: gather page_count()'s of each frag right before XDP prog call ice: put Rx buffers after being done with current frame ==================== Link: https://patch.msgid.link/20250131185415.3741532-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-03tg3: Disable tg3 PCIe AER on system rebootLenny Szubowicz1-0/+58
Disable PCIe AER on the tg3 device on system reboot on a limited list of Dell PowerEdge systems. This prevents a fatal PCIe AER event on the tg3 device during the ACPI _PTS (prepare to sleep) method for S5 on those systems. The _PTS is invoked by acpi_enter_sleep_state_prep() as part of the kernel's reboot sequence as a result of commit 38f34dba806a ("PM: ACPI: reboot: Reinstate S5 for reboot"). There was an earlier fix for this problem by commit 2ca1c94ce0b6 ("tg3: Disable tg3 device on system reboot to avoid triggering AER"). But it was discovered that this earlier fix caused a reboot hang when some Dell PowerEdge servers were booted via ipxe. To address this reboot hang, the earlier fix was essentially reverted by commit 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF"). This re-exposed the tg3 PCIe AER on reboot problem. This fix is not an ideal solution because the root cause of the AER is in system firmware. Instead, it's a targeted work-around in the tg3 driver. Note also that the PCIe AER must be disabled on the tg3 device even if the system is configured to use "firmware first" error handling. V3: - Fix sparse warning on improper comparison of pdev->current_state - Adhere to netdev comment style Fixes: 9fc3bc764334 ("tg3: power down device only on SYSTEM_POWER_OFF") Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-02-02ice: Add check for devm_kzalloc()Jiasheng Jiang1-0/+3
Add check for the return value of devm_kzalloc() to guarantee the success of allocation. Fixes: 42c2eb6b1f43 ("ice: Implement devlink-rate API") Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250131013832.24805-1-jiashengjiangcool@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-02net: bcmgenet: Correct overlaying of PHY and MAC Wake-on-LANFlorian Fainelli1-4/+12
Some Wake-on-LAN modes such as WAKE_FILTER may only be supported by the MAC, while others might be only supported by the PHY. Make sure that the .get_wol() returns the union of both rather than only that of the PHY if the PHY supports Wake-on-LAN. When disabling Wake-on-LAN, make sure that this is done at both the PHY and MAC level, rather than doing an early return from the PHY driver. Fixes: 7e400ff35cbe ("net: bcmgenet: Add support for PHY-based Wake-on-LAN") Fixes: 9ee09edc05f2 ("net: bcmgenet: Properly overlay PHY and MAC Wake-on-LAN capabilities") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250129231342.35013-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-31ice: stop storing XDP verdict within ice_rx_bufMaciej Fijalkowski3-70/+36
Idea behind having ice_rx_buf::act was to simplify and speed up the Rx data path by walking through buffers that were representing cleaned HW Rx descriptors. Since it caused us a major headache recently and we rolled back to old approach that 'puts' Rx buffers right after running XDP prog/creating skb, this is useless now and should be removed. Get rid of ice_rx_buf::act and related logic. We still need to take care of a corner case where XDP program releases a particular fragment. Make ice_run_xdp() to return its result and use it within ice_put_rx_mbuf(). Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-31ice: gather page_count()'s of each frag right before XDP prog callMaciej Fijalkowski1-1/+26
If we store the pgcnt on few fragments while being in the middle of gathering the whole frame and we stumbled upon DD bit not being set, we terminate the NAPI Rx processing loop and come back later on. Then on next NAPI execution we work on previously stored pgcnt. Imagine that second half of page was used actively by networking stack and by the time we came back, stack is not busy with this page anymore and decremented the refcnt. The page reuse algorithm in this case should be good to reuse the page but given the old refcnt it will not do so and attempt to release the page via page_frag_cache_drain() with pagecnt_bias used as an arg. This in turn will result in negative refcnt on struct page, which was initially observed by Xu Du. Therefore, move the page count storage from ice_get_rx_buf() to a place where we are sure that whole frame has been collected, but before calling XDP program as it internally can also change the page count of fragments belonging to xdp_buff. Fixes: ac0753391195 ("ice: Store page count inside ice_rx_buf") Reported-and-tested-by: Xu Du <xudu@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Co-developed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-31ice: put Rx buffers after being done with current frameMaciej Fijalkowski1-29/+50
Introduce a new helper ice_put_rx_mbuf() that will go through gathered frags from current frame and will call ice_put_rx_buf() on them. Current logic that was supposed to simplify and optimize the driver where we go through a batch of all buffers processed in current NAPI instance turned out to be broken for jumbo frames and very heavy load that was coming from both multi-thread iperf and nginx/wrk pair between server and client. The delay introduced by approach that we are dropping is simply too big and we need to take the decision regarding page recycling/releasing as quick as we can. While at it, address an error path of ice_add_xdp_frag() - we were missing buffer putting from day 1 there. As a nice side effect we get rid of annoying and repetitive three-liner: xdp->data = NULL; rx_ring->first_desc = ntc; rx_ring->nr_frags = 0; by embedding it within introduced routine. Fixes: 1dc1a7e7f410 ("ice: Centrallize Rx buffer recycling") Reported-and-tested-by: Xu Du <xudu@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Co-developed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-30Merge tag 'net-6.14-rc1' of ↵Linus Torvalds28-84/+244
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from IPSec, netfilter and Bluetooth. Nothing really stands out, but as usual there's a slight concentration of fixes for issues added in the last two weeks before the merge window, and driver bugs from 6.13 which tend to get discovered upon wider distribution. Current release - regressions: - net: revert RTNL changes in unregister_netdevice_many_notify() - Bluetooth: fix possible infinite recursion of btusb_reset - eth: adjust locking in some old drivers which protect their state with spinlocks to avoid sleeping in atomic; core protects netdev state with a mutex now Previous releases - regressions: - eth: - mlx5e: make sure we pass node ID, not CPU ID to kvzalloc_node() - bgmac: reduce max frame size to support just 1500 bytes; the jumbo frame support would previously cause OOB writes, but now fails outright - mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted, avoid false detection of MPTCP blackholing Previous releases - always broken: - mptcp: handle fastopen disconnect correctly - xfrm: - make sure skb->sk is a full sock before accessing its fields - fix taking a lock with preempt disabled for RT kernels - usb: ipheth: improve safety of packet metadata parsing; prevent potential OOB accesses - eth: renesas: fix missing rtnl lock in suspend/resume path" * tag 'net-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) MAINTAINERS: add Neal to TCP maintainers net: revert RTNL changes in unregister_netdevice_many_notify() net: hsr: fix fill_frame_info() regression vs VLAN packets doc: mptcp: sysctl: blackhole_timeout is per-netns mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted netfilter: nf_tables: reject mismatching sum of field_len with set key length net: sh_eth: Fix missing rtnl lock in suspend/resume path net: ravb: Fix missing rtnl lock in suspend/resume path selftests/net: Add test for loading devbound XDP program in generic mode net: xdp: Disallow attaching device-bound programs in generic mode tcp: correct handling of extreme memory squeeze bgmac: reduce max frame size to support just MTU 1500 vsock/test: Add test for connect() retries vsock/test: Add test for UAF due to socket unbinding vsock/test: Introduce vsock_connect_fd() vsock/test: Introduce vsock_bind() vsock: Allow retrying on connect() failure vsock: Keep the binding until socket destruction Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming ...
2025-01-30net: sh_eth: Fix missing rtnl lock in suspend/resume pathKory Maincent1-0/+4
Fix the suspend/resume path by ensuring the rtnl lock is held where required. Calls to sh_eth_close, sh_eth_open and wol operations must be performed under the rtnl lock to prevent conflicts with ongoing ndo operations. Fixes: b71af04676e9 ("sh_eth: add more PM methods") Tested-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30net: ravb: Fix missing rtnl lock in suspend/resume pathKory Maincent1-8/+14
Fix the suspend/resume path by ensuring the rtnl lock is held where required. Calls to ravb_open, ravb_close and wol operations must be performed under the rtnl lock to prevent conflicts with ongoing ndo operations. Without this fix, the following warning is triggered: [ 39.032969] ============================= [ 39.032983] WARNING: suspicious RCU usage [ 39.033019] ----------------------------- [ 39.033033] drivers/net/phy/phy_device.c:2004 suspicious rcu_dereference_protected() usage! ... [ 39.033597] stack backtrace: [ 39.033613] CPU: 0 UID: 0 PID: 174 Comm: python3 Not tainted 6.13.0-rc7-next-20250116-arm64-renesas-00002-g35245dfdc62c #7 [ 39.033623] Hardware name: Renesas SMARC EVK version 2 based on r9a08g045s33 (DT) [ 39.033628] Call trace: [ 39.033633] show_stack+0x14/0x1c (C) [ 39.033652] dump_stack_lvl+0xb4/0xc4 [ 39.033664] dump_stack+0x14/0x1c [ 39.033671] lockdep_rcu_suspicious+0x16c/0x22c [ 39.033682] phy_detach+0x160/0x190 [ 39.033694] phy_disconnect+0x40/0x54 [ 39.033703] ravb_close+0x6c/0x1cc [ 39.033714] ravb_suspend+0x48/0x120 [ 39.033721] dpm_run_callback+0x4c/0x14c [ 39.033731] device_suspend+0x11c/0x4dc [ 39.033740] dpm_suspend+0xdc/0x214 [ 39.033748] dpm_suspend_start+0x48/0x60 [ 39.033758] suspend_devices_and_enter+0x124/0x574 [ 39.033769] pm_suspend+0x1ac/0x274 [ 39.033778] state_store+0x88/0x124 [ 39.033788] kobj_attr_store+0x14/0x24 [ 39.033798] sysfs_kf_write+0x48/0x6c [ 39.033808] kernfs_fop_write_iter+0x118/0x1a8 [ 39.033817] vfs_write+0x27c/0x378 [ 39.033825] ksys_write+0x64/0xf4 [ 39.033833] __arm64_sys_write+0x18/0x20 [ 39.033841] invoke_syscall+0x44/0x104 [ 39.033852] el0_svc_common.constprop.0+0xb4/0xd4 [ 39.033862] do_el0_svc+0x18/0x20 [ 39.033870] el0_svc+0x3c/0xf0 [ 39.033880] el0t_64_sync_handler+0xc0/0xc4 [ 39.033888] el0t_64_sync+0x154/0x158 [ 39.041274] ravb 11c30000.ethernet eth0: Link is Down Reported-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Closes: https://lore.kernel.org/netdev/4c6419d8-c06b-495c-b987-d66c2e1ff848@tuxon.dev/ Fixes: 0184165b2f42 ("ravb: add sleep PM suspend/resume support") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Tested-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30bgmac: reduce max frame size to support just MTU 1500Rafał Miłecki1-2/+1
bgmac allocates new replacement buffer before handling each received frame. Allocating & DMA-preparing 9724 B each time consumes a lot of CPU time. Ideally bgmac should just respect currently set MTU but it isn't the case right now. For now just revert back to the old limited frame size. This change bumps NAT masquerade speed by ~95%. Since commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()"), the bgmac driver fails to open its network interface successfully and runs out of memory in the following call stack: bgmac_open -> bgmac_dma_init -> bgmac_dma_rx_skb_for_slot -> netdev_alloc_frag BGMAC_RX_ALLOC_SIZE = 10048 and PAGE_FRAG_CACHE_MAX_SIZE = 32768. Eventually we land into __page_frag_alloc_align() with the following parameters across multiple successive calls: __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=0 __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=10048 __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=20096 __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=30144 So in that case we do indeed have offset + fragsz (40192) > size (32768) and so we would eventually return NULL. Reverting to the older 1500 bytes MTU allows the network driver to be usable again. Fixes: 8c7da63978f1 ("bgmac: configure MTU and add support for frames beyond 8192 byte size") Signed-off-by: Rafał Miłecki <rafal@milecki.pl> [florian: expand commit message about recent commits] Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250127175159.1788246-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28Merge tag 'driver-core-6.14-rc1' of ↵Linus Torvalds5-82/+29
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs updates from Greg KH: "Here is the big set of driver core and debugfs updates for 6.14-rc1. Included in here is a bunch of driver core, PCI, OF, and platform rust bindings (all acked by the different subsystem maintainers), hence the merge conflict with the rust tree, and some driver core api updates to mark things as const, which will also require some fixups due to new stuff coming in through other trees in this merge window. There are also a bunch of debugfs updates from Al, and there is at least one user that does have a regression with these, but Al is working on tracking down the fix for it. In my use (and everyone else's linux-next use), it does not seem like a big issue at the moment. Here's a short list of the things in here: - driver core rust bindings for PCI, platform, OF, and some i/o functions. We are almost at the "write a real driver in rust" stage now, depending on what you want to do. - misc device rust bindings and a sample driver to show how to use them - debugfs cleanups in the fs as well as the users of the fs api for places where drivers got it wrong or were unnecessarily doing things in complex ways. - driver core const work, making more of the api take const * for different parameters to make the rust bindings easier overall. - other small fixes and updates All of these have been in linux-next with all of the aforementioned merge conflicts, and the one debugfs issue, which looks to be resolved "soon"" * tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits) rust: device: Use as_char_ptr() to avoid explicit cast rust: device: Replace CString with CStr in property_present() devcoredump: Constify 'struct bin_attribute' devcoredump: Define 'struct bin_attribute' through macro rust: device: Add property_present() saner replacement for debugfs_rename() orangefs-debugfs: don't mess with ->d_name octeontx2: don't mess with ->d_parent or ->d_parent->d_name arm_scmi: don't mess with ->d_parent->d_name slub: don't mess with ->d_name sof-client-ipc-flood-test: don't mess with ->d_name qat: don't mess with ->d_name xhci: don't mess with ->d_iname mtu3: don't mess wiht ->d_iname greybus/camera - stop messing with ->d_iname mediatek: stop messing with ->d_iname netdevsim: don't embed file_operations into your structs b43legacy: make use of debugfs_get_aux() b43: stop embedding struct file_operations into their objects carl9170: stop embedding file_operations into their objects ...
2025-01-28net: stmmac: Specify hardware capability value when FIFO size isn't specifiedKunihiko Hayashi1-17/+18
When Tx/Rx FIFO size is not specified in advance, the driver checks if the value is zero and sets the hardware capability value in functions where that value is used. Consolidate the check and settings into function stmmac_hw_init() and remove redundant other statements. If FIFO size is zero and the hardware capability also doesn't have upper limit values, return with an error message. Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-28net: stmmac: Limit FIFO size by hardware capabilityKunihiko Hayashi1-0/+15
Tx/Rx FIFO size is specified by the parameter "{tx,rx}-fifo-depth" from stmmac_platform layer. However, these values are constrained by upper limits determined by the capabilities of each hardware feature. There is a risk that the upper bits will be truncated due to the calculation, so it's appropriate to limit them to the upper limit values and display a warning message. This only works if the hardware capability has the upper limit values. Fixes: e7877f52fd4a ("stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree") Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-28net: stmmac: Limit the number of MTL queues to hardware capabilityKunihiko Hayashi1-0/+15
The number of MTL queues to use is specified by the parameter "snps,{tx,rx}-queues-to-use" from stmmac_platform layer. However, the maximum numbers of queues are constrained by upper limits determined by the capability of each hardware feature. It's appropriate to limit the values not to exceed the upper limit values and display a warning message. This only works if the hardware capability has the upper limit values. Fixes: d976a525c371 ("net: stmmac: multiple queues dt configuration") Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-28Merge branch '200GbE' of ↵Jakub Kicinski9-27/+59
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-01-24 (idpf, ice, iavf) For idpf: Emil adds memory barrier when accessing control queue descriptors and restores call to idpf_vc_xn_shutdown() when resetting. Manoj Vishwanathan expands transaction lock to properly protect xn->salt value and adds additional debugging information. Marco Leogrande converts workqueues to be unbound. For ice: Przemek fixes incorrect size use for array. Mateusz removes reporting of invalid parameter and value. For iavf: Michal adjusts some VLAN changes to occur without a PF call to avoid timing issues with the calls. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: iavf: allow changing VLAN state without calling PF ice: remove invalid parameter of equalizer ice: fix ice_parser_rt::bst_key array size idpf: add more info during virtchnl transaction timeout/salt mismatch idpf: convert workqueues to unbound idpf: Acquire the lock before accessing the xn->salt idpf: fix transaction timeouts on reset idpf: add read memory barrier when checking descriptor done bit ==================== Link: https://patch.msgid.link/20250124213213.1328775-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28net: davicom: fix UAF in dm9000_drv_removeChenyuan Yang1-1/+2
dm is netdev private data and it cannot be used after free_netdev() call. Using dm after free_netdev() can cause UAF bug. Fix it by moving free_netdev() at the end of the function. This is similar to the issue fixed in commit ad297cd2db89 ("net: qcom/emac: fix UAF in emac_remove"). This bug is detected by our static analysis tool. Fixes: cf9e60aa69ae ("net: davicom: Fix regulator not turned off on driver removal") Signed-off-by: Chenyuan Yang <chenyuan0y@gmail.com> CC: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Link: https://patch.msgid.link/20250123214213.623518-1-chenyuan0y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28eth: via-rhine: fix calling napi_enable() in atomic contextJakub Kicinski1-1/+10
napi_enable() may sleep now, take netdev_lock() before rp->lock. napi_enable() is hidden inside init_registers(). Note that this patch orders netdev_lock after rp->task_lock, to avoid having to take the netdev_lock() around disable path. Fixes: 413f0271f396 ("net: protect NAPI enablement with netdev_lock()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/dcfd56bc-de32-4b11-9e19-d8bd1543745d@stanley.mountain Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250124031841.1179756-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28eth: niu: fix calling napi_enable() in atomic contextJakub Kicinski1-1/+9
napi_enable() may sleep now, take netdev_lock() before np->lock. Fixes: 413f0271f396 ("net: protect NAPI enablement with netdev_lock()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/dcfd56bc-de32-4b11-9e19-d8bd1543745d@stanley.mountain Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250124031841.1179756-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28eth: 8139too: fix calling napi_enable() in atomic contextJakub Kicinski1-1/+3
napi_enable() may sleep now, take netdev_lock() before tp->lock and tp->rx_lock. Fixes: 413f0271f396 ("net: protect NAPI enablement with netdev_lock()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/dcfd56bc-de32-4b11-9e19-d8bd1543745d@stanley.mountain Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Link: https://patch.msgid.link/20250124031841.1179756-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28eth: forcedeth: fix calling napi_enable() in atomic contextJakub Kicinski1-1/+3
napi_enable() may sleep now, take netdev_lock() before np->lock. Fixes: 413f0271f396 ("net: protect NAPI enablement with netdev_lock()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/dcfd56bc-de32-4b11-9e19-d8bd1543745d@stanley.mountain Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250124031841.1179756-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28eth: forcedeth: remove local wrappers for napi enable/disableJakub Kicinski1-22/+8
The local helpers for calling napi_enable() and napi_disable() don't serve much purpose and they will complicate the fix in the subsequent patch. Remove them, call the core functions directly. Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250124031841.1179756-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28eth: tg3: fix calling napi_enable() in atomic contextJakub Kicinski1-4/+31
tg3 has a spin lock protecting most of the config, switch to taking netdev_lock() explicitly on enable/start paths. Disable/stop paths seem to not be under the spin lock (since napi_disable() already needs to sleep), so leave that side as is. tg3_restart_hw() releases and re-takes the spin lock, we need to do the same because dev_close() needs to take netdev_lock(). Fixes: 413f0271f396 ("net: protect NAPI enablement with netdev_lock()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/dcfd56bc-de32-4b11-9e19-d8bd1543745d@stanley.mountain Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250124031841.1179756-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-28net/mlx5e: add missing cpu_to_node to kvzalloc_node in mlx5e_open_xdpredirect_sqStanislav Fomichev1-1/+1
kvzalloc_node is not doing a runtime check on the node argument (__alloc_pages_node_noprof does have a VM_BUG_ON, but it expands to nothing on !CONFIG_DEBUG_VM builds), so doing any ethtool/netlink operation that calls mlx5e_open on a CPU that's larger that MAX_NUMNODES triggers OOB access and panic (see the trace below). Add missing cpu_to_node call to convert cpu id to node id. [ 165.427394] mlx5_core 0000:5c:00.0 beth1: Link up [ 166.479327] BUG: unable to handle page fault for address: 0000000800000010 [ 166.494592] #PF: supervisor read access in kernel mode [ 166.505995] #PF: error_code(0x0000) - not-present page ... [ 166.816958] Call Trace: [ 166.822380] <TASK> [ 166.827034] ? __die_body+0x64/0xb0 [ 166.834774] ? page_fault_oops+0x2cd/0x3f0 [ 166.843862] ? exc_page_fault+0x63/0x130 [ 166.852564] ? asm_exc_page_fault+0x22/0x30 [ 166.861843] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.871897] ? get_partial_node+0x1c/0x320 [ 166.880983] ? deactivate_slab+0x269/0x2b0 [ 166.890069] ___slab_alloc+0x521/0xa90 [ 166.898389] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.908442] __kmalloc_node_noprof+0x216/0x3f0 [ 166.918302] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.928354] __kvmalloc_node_noprof+0x43/0xd0 [ 166.938021] mlx5e_open_channels+0x5e2/0xc00 [ 166.947496] mlx5e_open_locked+0x3e/0xf0 [ 166.956201] mlx5e_open+0x23/0x50 [ 166.963551] __dev_open+0x114/0x1c0 [ 166.971292] __dev_change_flags+0xa2/0x1b0 [ 166.980378] dev_change_flags+0x21/0x60 [ 166.988887] do_setlink+0x38d/0xf20 [ 166.996628] ? ep_poll_callback+0x1b9/0x240 [ 167.005910] ? __nla_validate_parse.llvm.10713395753544950386+0x80/0xd70 [ 167.020782] ? __wake_up_sync_key+0x52/0x80 [ 167.030066] ? __mutex_lock+0xff/0x550 [ 167.038382] ? security_capable+0x50/0x90 [ 167.047279] rtnl_setlink+0x1c9/0x210 [ 167.055403] ? ep_poll_callback+0x1b9/0x240 [ 167.064684] ? security_capable+0x50/0x90 [ 167.073579] rtnetlink_rcv_msg+0x2f9/0x310 [ 167.082667] ? rtnetlink_bind+0x30/0x30 [ 167.091173] netlink_rcv_skb+0xb1/0xe0 [ 167.099492] netlink_unicast+0x20f/0x2e0 [ 167.108191] netlink_sendmsg+0x389/0x420 [ 167.116896] __sys_sendto+0x158/0x1c0 [ 167.125024] __x64_sys_sendto+0x22/0x30 [ 167.133534] do_syscall_64+0x63/0x130 [ 167.141657] ? __irq_exit_rcu.llvm.17843942359718260576+0x52/0xd0 [ 167.155181] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: bb135e40129d ("net/mlx5e: move XDP_REDIRECT sq to dynamic allocation") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250123000407.3464715-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-24Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds3-8/+40
Pull rdma updates from Jason Gunthorpe: "Lighter that normal, but the now usual collection of driver fixes and small improvements: - Small fixes and minor improvements to cxgb4, bnxt_re, rxe, srp, efa, cxgb4 - Update mlx4 to use the new umem APIs, avoiding direct use of scatterlist - Support ROCEv2 in erdma - Remove various uncalled functions, constify bin_attribute - Provide core infrastructure to catch netdev events and route them to drivers, consolidating duplicated driver code - Fix rare race condition crashes in mlx5 ODP flows" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (63 commits) RDMA/mlx5: Fix implicit ODP use after free RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error RDMA/qib: Constify 'struct bin_attribute' RDMA/hfi1: Constify 'struct bin_attribute' RDMA/rxe: Fix the warning "__rxe_cleanup+0x12c/0x170 [rdma_rxe]" RDMA/cxgb4: Notify rdma stack for IB_EVENT_QP_LAST_WQE_REACHED event RDMA/bnxt_re: Allocate dev_attr information dynamically RDMA/bnxt_re: Pass the context for ulp_irq_stop RDMA/bnxt_re: Add support to handle DCB_CONFIG_CHANGE event RDMA/bnxt_re: Query firmware defaults of CC params during probe RDMA/bnxt_re: Add Async event handling support bnxt_en: Add ULP call to notify async events RDMA/mlx5: Fix indirect mkey ODP page count MAINTAINERS: Update the bnxt_re maintainers RDMA/hns: Clean up the legacy CONFIG_INFINIBAND_HNS RDMA/rtrs: Add missing deinit() call RDMA/efa: Align interrupt related fields to same type RDMA/bnxt_re: Fix to drop reference to the mmap entry in case of error RDMA/mlx5: Fix link status down event for MPV RDMA/erdma: Support create_ah/destroy_ah in non-sleepable contexts ...
2025-01-24iavf: allow changing VLAN state without calling PFMichal Swiatkowski1-2/+17
First case: > ip l a l $VF name vlanx type vlan id 100 > ip l d vlanx > ip l a l $VF name vlanx type vlan id 100 As workqueue can be execute after sometime, there is a window to have call trace like that: - iavf_del_vlan - iavf_add_vlan - iavf_del_vlans (wq) It means that our VLAN 100 will change the state from IAVF_VLAN_ACTIVE to IAVF_VLAN_REMOVE (iavf_del_vlan). After that in iavf_add_vlan state won't be changed because VLAN 100 is on the filter list. The final result is that the VLAN 100 filter isn't added in hardware (no iavf_add_vlans call). To fix that change the state if the filter wasn't removed yet directly to active. It is save as IAVF_VLAN_REMOVE means that virtchnl message wasn't sent yet. Second case: > ip l a l $VF name vlanx type vlan id 100 Any type of VF reset ex. change trust > ip l s $PF vf $VF_NUM trust on > ip l d vlanx > ip l a l $VF name vlanx type vlan id 100 In case of reset iavf driver is responsible for readding all filters that are being used. To do that all VLAN filters state are changed to IAVF_VLAN_ADD. Here is even longer window for changing VLAN state from kernel side, as workqueue isn't called immediately. We can have call trace like that: - changing to IAVF_VLAN_ADD (after reset) - iavf_del_vlan (called from kernel ops) - iavf_del_vlans (wq) Not exsisitng VLAN filters will be removed from hardware. It isn't a bug, ice driver will handle it fine. However, we can have call trace like that: - changing to IAVF_VLAN_ADD (after reset) - iavf_del_vlan (called from kernel ops) - iavf_add_vlan (called from kernel ops) - iavf_del_vlans (wq) With fix for previous case we end up with no VLAN filters in hardware. We have to remove VLAN filters if the state is IAVF_VLAN_ADD and delete VLAN was called. It is save as IAVF_VLAN_ADD means that virtchnl message wasn't sent yet. Fixes: 0c0da0e95105 ("iavf: refactor VLAN filter states") Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24ice: remove invalid parameter of equalizerMateusz Polchlopek3-3/+0
It occurred that in the commit 70838938e89c ("ice: Implement driver functionality to dump serdes equalizer values") the invalid DRATE parameter for reading has been added. The output of the command: $ ethtool -d <ethX> returns the garbage value in the place where DRATE value should be stored. Remove mentioned parameter to prevent return of corrupted data to userspace. Fixes: 70838938e89c ("ice: Implement driver functionality to dump serdes equalizer values") Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24ice: fix ice_parser_rt::bst_key array sizePrzemek Kitszel2-11/+7
Fix &ice_parser_rt::bst_key size. It was wrongly set to 10 instead of 20 in the initial impl commit (see Fixes tag). All usage code assumed it was of size 20. That was also the initial size present up to v2 of the intro series [2], but halved by v3 [3] refactor described as "Replace magic hardcoded values with macros." The introducing series was so big that some ugliness was unnoticed, same for bugs :/ ICE_BST_KEY_TCAM_SIZE and ICE_BST_TCAM_KEY_SIZE were differing by one. There was tmp variable @j in the scope of edited function, but was not used in all places. This ugliness is now gone. I'm moving ice_parser_rt::pg_prio a few positions up, to fill up one of the holes in order to compensate for the added 10 bytes to the ::bst_key, resulting in the same size of the whole as prior to the fix, and minimal changes in the offsets of the fields. Extend also the debug dump print of the key to cover all bytes. To not have string with 20 "%02x" and 20 params, switch to ice_debug_array_w_prefix(). This fix obsoletes Ahmed's attempt at [1]. [1] https://lore.kernel.org/intel-wired-lan/20240823230847.172295-1-ahmed.zaki@intel.com [2] https://lore.kernel.org/intel-wired-lan/20230605054641.2865142-13-junfeng.guo@intel.com [3] https://lore.kernel.org/intel-wired-lan/20230817093442.2576997-13-junfeng.guo@intel.com Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/intel-wired-lan/b1fb6ff9-b69e-4026-9988-3c783d86c2e0@stanley.mountain Fixes: 9a4c07aaa0f5 ("ice: add parser execution main loop") CC: Ahmed Zaki <ahmed.zaki@intel.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24idpf: add more info during virtchnl transaction timeout/salt mismatchManoj Vishwanathan1-4/+7
Add more information related to the transaction like cookie, vc_op, salt when transaction times out and include similar information when transaction salt does not match. Info output for transaction timeout: ------------------- (op:5015 cookie:45fe vc_op:5015 salt:45 timeout:60000ms) ------------------- before it was: ------------------- (op 5015, 60000ms) ------------------- Signed-off-by: Manoj Vishwanathan <manojvishy@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24idpf: convert workqueues to unboundMarco Leogrande1-5/+10
When a workqueue is created with `WQ_UNBOUND`, its work items are served by special worker-pools, whose host workers are not bound to any specific CPU. In the default configuration (i.e. when `queue_delayed_work` and friends do not specify which CPU to run the work item on), `WQ_UNBOUND` allows the work item to be executed on any CPU in the same node of the CPU it was enqueued on. While this solution potentially sacrifices locality, it avoids contention with other processes that might dominate the CPU time of the processor the work item was scheduled on. This is not just a theoretical problem: in a particular scenario misconfigured process was hogging most of the time from CPU0, leaving less than 0.5% of its CPU time to the kworker. The IDPF workqueues that were using the kworker on CPU0 suffered large completion delays as a result, causing performance degradation, timeouts and eventual system crash. Tested: * I have also run a manual test to gauge the performance improvement. The test consists of an antagonist process (`./stress --cpu 2`) consuming as much of CPU 0 as possible. This process is run under `taskset 01` to bind it to CPU0, and its priority is changed with `chrt -pQ 9900 10000 ${pid}` and `renice -n -20 ${pid}` after start. Then, the IDPF driver is forced to prefer CPU0 by editing all calls to `queue_delayed_work`, `mod_delayed_work`, etc... to use CPU 0. Finally, `ktraces` for the workqueue events are collected. Without the current patch, the antagonist process can force arbitrary delays between `workqueue_queue_work` and `workqueue_execute_start`, that in my tests were as high as `30ms`. With the current patch applied, the workqueue can be migrated to another unloaded CPU in the same node, and, keeping everything else equal, the maximum delay I could see was `6us`. Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") Signed-off-by: Marco Leogrande <leogrande@google.com> Signed-off-by: Manoj Vishwanathan <manojvishy@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24idpf: Acquire the lock before accessing the xn->saltManoj Vishwanathan1-1/+2
The transaction salt was being accessed before acquiring the idpf_vc_xn_lock when idpf has to forward the virtchnl reply. Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Signed-off-by: Manoj Vishwanathan <manojvishy@google.com> Signed-off-by: David Decotigny <decot@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24idpf: fix transaction timeouts on resetEmil Tantilov1-1/+10
Restore the call to idpf_vc_xn_shutdown() at the beginning of idpf_vc_core_deinit() provided the function is not called on remove. In the reset path the mailbox is destroyed, leading to all transactions timing out. Fixes: 09d0fb5cb30e ("idpf: deinit virtchnl transaction manager after vport and vectors") Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24idpf: add read memory barrier when checking descriptor done bitEmil Tantilov1-0/+6
Add read memory barrier to ensure the order of operations when accessing control queue descriptors. Specifically, we want to avoid cases where loads can be reordered: 1. Load #1 is dispatched to read descriptor flags. 2. Load #2 is dispatched to read some other field from the descriptor. 3. Load #2 completes, accessing memory/cache at a point in time when the DD flag is zero. 4. NIC DMA overwrites the descriptor, now the DD flag is one. 5. Any fields loaded before step 4 are now inconsistent with the actual descriptor state. Add read memory barrier between steps 1 and 2, so that load #2 is not executed until load #1 has completed. Fixes: 8077c727561a ("idpf: add controlq init and reset checks") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Suggested-by: Lance Richardson <rlance@google.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-23net: mvneta: fix locking in mvneta_cpu_online()Harshit Mogalapalli1-0/+1
When port is stopped, unlock before returning Fixes: 413f0271f396 ("net: protect NAPI enablement with netdev_lock()") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250121005002.3938236-1-harshit.m.mogalapalli@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>