summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-07-17Merge branch 's390-bpf-fix-bpf_arch_text_poke-with-new_addr-null-again'Alexei Starovoitov2-1/+76
Ilya Leoshkevich says: ==================== s390/bpf: Fix bpf_arch_text_poke() with new_addr == NULL again This series fixes a regression causing perf on s390 to trigger a kernel panic. Patch 1 fixes the issue, patch 2 adds a test to make sure this doesn't happen again. ==================== Link: https://patch.msgid.link/20250716194524.48109-1-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17selftests/bpf: Stress test attaching a BPF prog to another BPF progIlya Leoshkevich1-0/+67
Add a test that invokes a BPF prog in a loop, while concurrently attaching and detaching another BPF prog to and from it. This helps identifying race conditions in bpf_arch_text_poke(). Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20250716194524.48109-3-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17s390/bpf: Fix bpf_arch_text_poke() with new_addr == NULL againIlya Leoshkevich1-1/+9
Commit 7ded842b356d ("s390/bpf: Fix bpf_plt pointer arithmetic") has accidentally removed the critical piece of commit c730fce7c70c ("s390/bpf: Fix bpf_arch_text_poke() with new_addr == NULL"), causing intermittent kernel panics in e.g. perf's on_switch() prog to reappear. Restore the fix and add a comment. Fixes: 7ded842b356d ("s390/bpf: Fix bpf_plt pointer arithmetic") Cc: stable@vger.kernel.org Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20250716194524.48109-2-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17bpf: Update iterators.lskel-big-endian.hIlya Leoshkevich1-237/+255
The last iterators update (commit 515ee52b2224 ("bpf: make preloaded map iterators to display map elements count")) missed the big-endian skeleton. Update it by running "make big" with Debian clang version 21.0.0 (++20250706105601+01c97b4953e8-1~exp1~20250706225612.1558). Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20250710100907.45880-1-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17Merge branch 'bpf-arm64-relax-constraint-in-bpf-jit-compiler'Alexei Starovoitov2-6/+0
Alexis Lothoré (eBPF Foundation) says: ==================== this series follows up on the one introducing 9+ args for tracing programs [1]. It has been observed with this series that there are cases for which we can not identify accurately the location of the target function arguments to prepare correctly the corresponding BPF trampoline. This is the case for example if: - the function consumes a struct variable _by value_ - it is passed on the stack (no more register available for it) - it has some __packed__ or __aligned(X)__ attribute As a consequence, a small restrictive check has been added to the ARM64 side, highlighting that other arch supporting 9+ args in BPF trampolines are already suffering from the same issue. After a bit of discussions and attempts, the chosen solution is, rather than applying the same constraint to all JIT compilers, to prevent such function from being encoded at all in BTF info([2]). As the pahole side is closed to be integrated, we can now remove the restrictive check from kernel side. [1] https://lore.kernel.org/bpf/20250527-many_args_arm64-v3-0-3faf7bb8e4a2@bootlin.com/ [2] https://lore.kernel.org/bpf/20250707-btf_skip_structs_on_stack-v3-0-29569e086c12@bootlin.com/ Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> --- Alexis Lothoré (eBPF Foundation) (2): bpf, arm64: remove structs on stack constraint selftests/bpf: enable tracing_struct tests for arm64 arch/arm64/net/bpf_jit_comp.c | 5 ----- tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 - 2 files changed, 6 deletions(-) --- base-commit: 8da1e37fc84868b50ba6a7cdf082aa3b0d11e006 change-id: 20250708-arm64_relax_jit_comp-e8889647d8d2 Best regards, ==================== Link: https://patch.msgid.link/20250709-arm64_relax_jit_comp-v1-0-3850fe189092@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17selftests/bpf: enable tracing_struct tests for arm64Alexis Lothoré (eBPF Foundation)1-1/+0
Now that the constraint preventing attachment to functions consuming struct on stack has been removed from the kernel (and moved to pahole, with a slightly smarter detection, to prevent only those that are packed), re-enable the tracing_struct tests for arm64. Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Link: https://lore.kernel.org/r/20250709-arm64_relax_jit_comp-v1-2-3850fe189092@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17bpf, arm64: remove structs on stack constraintAlexis Lothoré (eBPF Foundation)1-5/+0
While introducing support for 9+ arguments for tracing programs on ARM64, commit 9014cf56f13d ("bpf, arm64: Support up to 12 function arguments") has also introduced a constraint preventing BPF trampolines from being generated if the target function consumes a struct argument passed on stack, because of uncertainties around the exact struct location: if the struct has been marked as packed or with a custom alignment, this info is not reflected in BTF data, and so generated tracing trampolines could read the target function arguments at wrong offsets. This issue is not specific to ARM64: there has been an attempt (see [1]) to bring the same constraint to other architectures JIT compilers. But discussions following this attempt led to the move of this constraint out of the kernel (see [2]): instead of preventing the kernel from generating trampolines for those functions consuming structs on stack, it is simpler to just make sure that those functions with uncertain struct arguments location are not encoded in BTF information, and so that one can not even attempt to attach a tracing program to such function. The task is then deferred to pahole (see [3]). Now that the constraint is handled by pahole, remove it from the arm64 JIT compiler to keep it simple. [1] https://lore.kernel.org/bpf/20250613-deny_trampoline_structs_on_stack-v1-0-5be9211768c3@bootlin.com/ [2] https://lore.kernel.org/bpf/CAADnVQ+sj9XhscN9PdmTzjVa7Eif21noAUH3y1K6x5bWcL-5pg@mail.gmail.com/ [3] https://lore.kernel.org/bpf/20250707-btf_skip_structs_on_stack-v3-0-29569e086c12@bootlin.com/ Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Link: https://lore.kernel.org/r/20250709-arm64_relax_jit_comp-v1-1-3850fe189092@bootlin.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17sched/ext: Prevent update_locked_rq() calls with NULL rqBreno Leitao1-4/+8
Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL in the SCX_CALL_OP and SCX_CALL_OP_RET macros. Previously, calling update_locked_rq(NULL) with preemption enabled could trigger the following warning: BUG: using __this_cpu_write() in preemptible [00000000] This happens because __this_cpu_write() is unsafe to use in preemptible context. rq is NULL when an ops invoked from an unlocked context. In such cases, we don't need to store any rq, since the value should already be NULL (unlocked). Ensure that update_locked_rq() is only called when rq is non-NULL, preventing calling __this_cpu_write() on preemptible context. Suggested-by: Peter Zijlstra <peterz@infradead.org> Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v6.15
2025-07-17Merge branch 'net-mlx5e-add-support-for-pcie-congestion-events'Jakub Kicinski10-1/+381
Tariq Toukan says: ==================== net/mlx5e: Add support for PCIe congestion events Dragos says: PCIe congestion events are events generated by the firmware when the device side has sustained PCIe inbound or outbound traffic above certain thresholds. The high and low threshold are hysteresis thresholds to prevent flapping: once the high threshold has been reached, a low threshold event will be triggered only after the bandwidth usage went below the low threshold. This series adds support for receiving and exposing such events as ethtool counters. 2 new pairs of counters are exposed: pci_bw_in/outbound_high/low. These should help the user understand if the device PCI is under pressure. Planned followup patches: - Allow configuration of thresholds through devlink. - Add ethtool counter for wakeups which did not result in any state change. ==================== Link: https://patch.msgid.link/1752589821-145787-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net/mlx5e: Add device PCIe congestion ethtool statsDragos Tatulea5-0/+212
Implement the PCIe Congestion Event notifier which triggers a work item to query the PCIe Congestion Event object. The result of the congestion state is reflected in the new ethtool stats: * pci_bw_inbound_high: the device has crossed the high threshold for inbound PCIe traffic. * pci_bw_inbound_low: the device has crossed the low threshold for inbound PCIe traffic * pci_bw_outbound_high: the device has crossed the high threshold for outbound PCIe traffic. * pci_bw_outbound_low: the device has crossed the low threshold for outbound PCIe traffic The high and low thresholds are currently configured at 90% and 75%. These are hysteresis thresholds which help to check if the PCI bus on the device side is in a congested state. If low + 1 = high then the device is in a congested state. If low == high then the device is not in a congested state. The counters are also documented. A follow-up patch will make the thresholds configurable. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752589821-145787-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net/mlx5e: Create/destroy PCIe Congestion Event objectDragos Tatulea6-1/+169
Add initial infrastructure to create and destroy the PCIe Congestion Event object if the object is supported. The verb for the object creation function is "set" instead of "create" because the function will accommodate the modify operation as well in a subsequent patch. The next patches will hook it up to the event handler and will add actual functionality. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752589821-145787-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17s390/net: Remove NETIUCV device driverNagamani PV4-2117/+0
The netiucv driver creates TCP/IP interfaces over IUCV between Linux guests on z/VM and other z/VM entities. Rationale for removal: - NETIUCV connections are only supported for compatibility with earlier versions and not to be used for new network setups, since at least Linux kernel 4.0. - No known active users, use cases, or product dependencies - The driver is no longer relevant for z/VM networking; preferred methods include: * Device pass-through (e.g., OSA, RoCE) * z/VM Virtual Switch (VSWITCH) The IUCV mechanism itself remains supported and is actively used via AF_IUCV, hvc_iucv, and smsg_iucv. Signed-off-by: Nagamani PV <nagamani@linux.ibm.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250715074210.3999296-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge branch 'expose-refclk-for-rmii-and-enable-rmii'Jakub Kicinski2-2/+16
Ryan Wanner says: ==================== Expose REFCLK for RMII and enable RMII This set allows the REFCLK property to be exposed as a dt-property to properly reflect the correct RMII layout. RMII can take an external or internal provided REFCLK, since this is not SoC dependent but board dependent this must be exposed as a DT property for the macb driver. This set also enables RMII mode for the SAMA7 SoCs gigabit mac. v1: https://lore.kernel.org/cover.1750346271.git.Ryan.Wanner@microchip.com ==================== Link: https://patch.msgid.link/cover.1752510727.git.Ryan.Wanner@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN flagRyan Wanner1-2/+1
Remove USARIO_CLKEN flag since this is now a device tree argument and not fixed to the SoC. This will instead be selected by the "cdns,refclk-ext" device tree property. Signed-off-by: Ryan Wanner <Ryan.Wanner@microchip.com> Link: https://patch.msgid.link/1e7a8c324526f631f279925aa8a6aa937d55c796.1752510727.git.Ryan.Wanner@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: cadence: macb: Enable RMII for SAMA7 gemRyan Wanner1-0/+1
This macro enables the RMII mode bit in the USRIO register when RMII mode is requested. Signed-off-by: Ryan Wanner <Ryan.Wanner@microchip.com> Link: https://patch.msgid.link/6698836e4ee7df5f6bee181f0d2e38d4b8e4cec2.1752510727.git.Ryan.Wanner@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: cadence: macb: Expose REFCLK as a device tree propertyRyan Wanner1-0/+7
The RMII and RGMII can both support internal or external provided REFCLKs 50MHz and 125MHz respectively. Since this is dependent on the board that the SoC is on this needs to be set via the device tree. This property flag is checked in the MACB DT node so the REFCLK cap is configured the correct way for the RMII or RGMII is configured on the board. Signed-off-by: Ryan Wanner <Ryan.Wanner@microchip.com> Link: https://patch.msgid.link/7f9b65896d6b7b48275bc527b72a16347f8ce10a.1752510727.git.Ryan.Wanner@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17dt-bindings: net: cdns,macb: Add external REFCLK propertyRyan Wanner1-0/+7
REFCLK can be provided by an external source so this should be exposed by a DT property. The REFCLK is used for RMII and in some SoCs that use this driver the RGMII 125MHz clk can also be provided by an external source. Signed-off-by: Ryan Wanner <Ryan.Wanner@microchip.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://patch.msgid.link/d558467c4d5b27fb3135ffdead800b14cd9c6c0a.1752510727.git.Ryan.Wanner@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge branch 'selftest-net-add-selftest-for-netpoll'Jakub Kicinski4-2/+434
Breno Leitao says: ==================== selftest: net: Add selftest for netpoll I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] ==================== Link: https://patch.msgid.link/20250714-netpoll_test-v7-0-c0220cfaa63e@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17selftests: net: add netpoll basic functionality testBreno Leitao2-0/+397
Add a basic selftest for the netpoll polling mechanism, specifically targeting the netpoll poll() side. The test creates a scenario where network transmission is running at maximum speed, and netpoll needs to poll the NIC. This is achieved by: 1. Configuring a single RX/TX queue to create contention 2. Generating background traffic to saturate the interface 3. Sending netconsole messages to trigger netpoll polling 4. Using dynamic netconsole targets via configfs 5. Delete and create new netconsole targets after some messages 6. Start a bpftrace in parallel to make sure netpoll_poll_dev() is called 7. If bpftrace exists and netpoll_poll_dev() was called, stop. The test validates a critical netpoll code path by monitoring traffic flow and ensuring netpoll_poll_dev() is called when the normal TX path is blocked. This addresses a gap in netpoll test coverage for a path that is tricky for the network stack. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250714-netpoll_test-v7-3-c0220cfaa63e@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17selftests: drv-net: Strip '@' prefix from bpftrace map keysBreno Leitao1-0/+2
The '@' prefix in bpftrace map keys is specific to bpftrace and can be safely removed when processing results. This patch modifies the bpftrace utility to strip the '@' from map keys before storing them in the result dictionary, making the keys more consistent with Python conventions. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250714-netpoll_test-v7-2-c0220cfaa63e@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17selftests: drv-net: add helper/wrapper for bpftraceJakub Kicinski2-2/+35
bpftrace is very useful for low level driver testing. perf or trace-cmd would also do for collecting data from tracepoints, but they require much more post-processing. Add a wrapper for running bpftrace and sanitizing its output. bpftrace has JSON output, which is great, but it prints loose objects and in a slightly inconvenient format. We have to read the objects line by line, and while at it return them indexed by the map name. Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250714-netpoll_test-v7-1-c0220cfaa63e@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17doc: xdp: Clarify driver implementation for XDP Rx metadataSong Yoong Siang1-0/+33
Clarify that drivers must remove device-reserved metadata from the data_meta area before passing frames to XDP programs. Additionally, expand the explanation of how userspace and BPF programs should coordinate the use of METADATA_SIZE, and add a detailed diagram to illustrate pointer adjustments and metadata layout. Also describe the requirements and constraints enforced by bpf_xdp_adjust_meta(). Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250716154846.3513575-1-yoong.siang.song@intel.com
2025-07-17Merge branch '100GbE' of ↵Jakub Kicinski5-5/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-07-15 (ixgbe, fm10k, i40e, ice) Arnd Bergmann resolves compile issues with large NR_CPUS for ixgbe, fm10k, and i40e. For ice: Dave adds a NULL check for LAG netdev. Michal corrects a pointer check in debugfs initialization. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: check correct pointer in fwlog debugfs ice: add NULL check in eswitch lag check ethernet: intel: fix building with large NR_CPUS ==================== Link: https://patch.msgid.link/20250715202948.3841437-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: airoha: fix potential use-after-free in airoha_npu_get()Alok Tiwari1-1/+2
np->name was being used after calling of_node_put(np), which releases the node and can lead to a use-after-free bug. Previously, of_node_put(np) was called unconditionally after of_find_device_by_node(np), which could result in a use-after-free if pdev is NULL. This patch moves of_node_put(np) after the error check to ensure the node is only released after both the error and success cases are handled appropriately, preventing potential resource issues. Fixes: 23290c7bc190 ("net: airoha: Introduce Airoha NPU support") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250715143102.3458286-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ipv6: mcast: Simplify mld_clear_{report|query}()Yue Haibing1-8/+2
Use __skb_queue_purge() instead of re-implementing it. Note that it uses kfree_skb_reason() instead of kfree_skb() internally, and pass SKB_DROP_REASON_QUEUE_PURGE drop reason to the kfree_skb tracepoint. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250715120709.3941510-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17vsock/test: fix vsock_ioctl_int() check for unsupported ioctlStefano Garzarella1-1/+1
`vsock_do_ioctl` returns -ENOIOCTLCMD if an ioctl support is not implemented, like for SIOCINQ before commit f7c722659275 ("vsock: Add support for SIOCINQ ioctl"). In net/socket.c, -ENOIOCTLCMD is re-mapped to -ENOTTY for the user space. So, our test suite, without that commit applied, is failing in this way: 34 - SOCK_STREAM ioctl(SIOCINQ) functionality...ioctl(21531): Inappropriate ioctl for device Return false in vsock_ioctl_int() to skip the test in this case as well, instead of failing. Fixes: 53548d6bffac ("test/vsock: Add retry mechanism to ioctl wrapper") Cc: niuxuewei.nxw@antgroup.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Link: https://patch.msgid.link/20250715093233.94108-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17tcp: fix UaF in tcp_prune_ofo_queue()Paolo Abeni1-1/+1
The CI reported a UaF in tcp_prune_ofo_queue(): BUG: KASAN: slab-use-after-free in tcp_prune_ofo_queue+0x55d/0x660 Read of size 4 at addr ffff8880134729d8 by task socat/20348 CPU: 0 UID: 0 PID: 20348 Comm: socat Not tainted 6.16.0-rc5-virtme #1 PREEMPT(full) Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x82/0xd0 print_address_description.constprop.0+0x2c/0x400 print_report+0xb4/0x270 kasan_report+0xca/0x100 tcp_prune_ofo_queue+0x55d/0x660 tcp_try_rmem_schedule+0x855/0x12e0 tcp_data_queue+0x4dd/0x2260 tcp_rcv_established+0x5e8/0x2370 tcp_v4_do_rcv+0x4ba/0x8c0 __release_sock+0x27a/0x390 release_sock+0x53/0x1d0 tcp_sendmsg+0x37/0x50 sock_write_iter+0x3c1/0x520 vfs_write+0xc09/0x1210 ksys_write+0x183/0x1d0 do_syscall_64+0xc1/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fcf73ef2337 Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffd4f924708 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcf73ef2337 RDX: 0000000000002000 RSI: 0000555f11d1a000 RDI: 0000000000000008 RBP: 0000555f11d1a000 R08: 0000000000002000 R09: 0000000000000000 R10: 0000000000000040 R11: 0000000000000246 R12: 0000000000000008 R13: 0000000000002000 R14: 0000555ee1a44570 R15: 0000000000002000 </TASK> Allocated by task 20348: kasan_save_stack+0x24/0x50 kasan_save_track+0x14/0x30 __kasan_slab_alloc+0x59/0x70 kmem_cache_alloc_node_noprof+0x110/0x340 __alloc_skb+0x213/0x2e0 tcp_collapse+0x43f/0xff0 tcp_try_rmem_schedule+0x6b9/0x12e0 tcp_data_queue+0x4dd/0x2260 tcp_rcv_established+0x5e8/0x2370 tcp_v4_do_rcv+0x4ba/0x8c0 __release_sock+0x27a/0x390 release_sock+0x53/0x1d0 tcp_sendmsg+0x37/0x50 sock_write_iter+0x3c1/0x520 vfs_write+0xc09/0x1210 ksys_write+0x183/0x1d0 do_syscall_64+0xc1/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 20348: kasan_save_stack+0x24/0x50 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x38/0x50 kmem_cache_free+0x149/0x330 tcp_prune_ofo_queue+0x211/0x660 tcp_try_rmem_schedule+0x855/0x12e0 tcp_data_queue+0x4dd/0x2260 tcp_rcv_established+0x5e8/0x2370 tcp_v4_do_rcv+0x4ba/0x8c0 __release_sock+0x27a/0x390 release_sock+0x53/0x1d0 tcp_sendmsg+0x37/0x50 sock_write_iter+0x3c1/0x520 vfs_write+0xc09/0x1210 ksys_write+0x183/0x1d0 do_syscall_64+0xc1/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff888013472900 which belongs to the cache skbuff_head_cache of size 232 The buggy address is located 216 bytes inside of freed 232-byte region [ffff888013472900, ffff8880134729e8) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13472 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x80000000000040(head|node=0|zone=1) page_type: f5(slab) raw: 0080000000000040 ffff88800198fb40 ffffea0000347b10 ffffea00004f5290 raw: 0000000000000000 0000000000120012 00000000f5000000 0000000000000000 head: 0080000000000040 ffff88800198fb40 ffffea0000347b10 ffffea00004f5290 head: 0000000000000000 0000000000120012 00000000f5000000 0000000000000000 head: 0080000000000001 ffffea00004d1c81 00000000ffffffff 00000000ffffffff head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888013472880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888013472900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888013472980: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc ^ ffff888013472a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888013472a80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb Indeed tcp_prune_ofo_queue() is reusing the skb dropped a few lines above. The caller wants to enqueue 'in_skb', lets check space vs the latter. Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: syzbot+865aca08c0533171bf6a@syzkaller.appspotmail.com Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/b78d2d9bdccca29021eed9a0e7097dd8dc00f485.1752567053.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17rust: time: Pass correct timer mode ID to hrtimer_start_range_nsLyude Paul1-1/+1
While rebasing rvkms I noticed that timers I was setting seemed to have pretty random timer values that amounted slightly over 2x the time value I set each time. After a lot of debugging, I finally managed to figure out why: it seems that since we moved to Instant and Delta, we mistakenly began passing the clocksource ID to hrtimer_start_range_ns, when we should be passing the timer mode instead. Presumably, this works fine for simple relative timers - but immediately breaks on other types of timers. So, fix this by passing the ID for the timer mode instead. Signed-off-by: Lyude Paul <lyude@redhat.com> Acked-by: Andreas Hindborg <a.hindborg@kernel.org> Reviewed-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Fixes: e0c0ab04f678 ("rust: time: Make HasHrTimer generic over HrTimerMode") Link: https://lore.kernel.org/r/20250710225129.670051-1-lyude@redhat.com [ Removed cast, applied `rustfmt`, fixed `Fixes:` tag. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-07-17io_uring/zcrx: account area memoryPavel Begunkov2-0/+28
zcrx areas can be quite large and need to be accounted and checked against RLIMIT_MEMLOCK. In practise it shouldn't be a big issue as the inteface already requires cap_net_admin. Cc: stable@vger.kernel.org Fixes: cf96310c5f9a0 ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4b53f0c575bd062f63d12bec6cac98037fc66aeb.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17io_uring: export io_[un]account_memPavel Begunkov2-2/+4
Export pinned memory accounting helpers, they'll be used by zcrx shortly. Cc: stable@vger.kernel.org Fixes: cf96310c5f9a0 ("io_uring/zcrx: add io_zcrx_area") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a61e54bd89289b39570ae02fe620e12487439e4.1752699568.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-17net/mlx5: Correctly set gso_size when LRO is usedChristoph Paasch1-4/+8
gso_size is expected by the networking stack to be the size of the payload (thus, not including ethernet/IP/TCP-headers). However, cqe_bcnt is the full sized frame (including the headers). Dividing cqe_bcnt by lro_num_seg will then give incorrect results. For example, running a bpftrace higher up in the TCP-stack (tcp_event_data_recv), we commonly have gso_size set to 1450 or 1451 even though in reality the payload was only 1448 bytes. This can have unintended consequences: - In tcp_measure_rcv_mss() len will be for example 1450, but. rcv_mss will be 1448 (because tp->advmss is 1448). Thus, we will always recompute scaling_ratio each time an LRO-packet is received. - In tcp_gro_receive(), it will interfere with the decision whether or not to flush and thus potentially result in less gro'ed packets. So, we need to discount the protocol headers from cqe_bcnt so we can actually divide the payload by lro_num_seg to get the real gso_size. v2: - Use "(unsigned char *)tcp + tcp->doff * 4 - skb->data)" to compute header-len (Tariq Toukan <tariqt@nvidia.com>) - Improve commit-message (Gal Pressman <gal@nvidia.com>) Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files") Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250715-cpaasch-pf-925-investigate-incorrect-gso_size-on-cx-7-nic-v2-1-e06c3475f3ac@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17selftests: packetdrill: correct the expected timing in tcp_rcv_big_endseqJakub Kicinski1-1/+1
Commit f5fda1a86884 ("selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt") added this test recently, but it's failing with: # tcp_rcv_big_endseq.pkt:41: error handling packet: timing error: expected outbound packet at 1.230105 sec but happened at 1.190101 sec; tolerance 0.005046 sec # script packet: 1.230105 . 1:1(0) ack 54001 win 0 # actual packet: 1.190101 . 1:1(0) ack 54001 win 0 It's unclear why the test expects the ack to be delayed. Correct it. Link: https://patch.msgid.link/20250715142849.959444-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ethtool: Don't check for RXFH fields conflict when no input_xfrm is requestedGal Pressman2-1/+6
The requirement of ->get_rxfh_fields() in ethtool_set_rxfh() is there to verify that we have no conflict of input_xfrm with the RSS fields options, there is no point in doing it if input_xfrm is not supported/requested. This is under the assumption that a driver that supports input_xfrm will also support ->get_rxfh_fields(), so add a WARN_ON() to ethtool_check_ops() to verify it, and remove the op NULL check. This fixes the following error in mlx4_en, which doesn't support getting/setting RXFH fields. $ ethtool --set-rxfh-indir eth2 hfunc xor Cannot set RX flow hash configuration: Operation not supported Fixes: 72792461c8e8 ("net: ethtool: don't mux RXFH via rxnfc callbacks") Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250715140754.489677-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17selftests: rtnetlink: fix addrlft test flakiness on power-saving systemsHangbin Liu1-3/+13
Jakub reported that the rtnetlink test for the preferred lifetime of an address has become quite flaky. The issue started appearing around the 6.16 merge window in May, and the test fails with: FAIL: preferred_lft addresses remaining The flakiness might be related to power-saving behavior, as address expiration is handled by a "power-efficient" workqueue. To address this, use slowwait to check more frequently whether the address still exists. This reduces the likelihood of the system entering a low-power state during the test, improving reliability. Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250715043459.110523-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge tag 'rust-timekeeping-for-v6.17' of ↵Miguel Ojeda9-186/+466
https://github.com/Rust-for-Linux/linux into rust-next Pull timekeeping updates from Andreas Hindborg: - Make 'Instant' generic over clock source. This allows the compiler to assert that arithmetic expressions involving the 'Instant' use 'Instants' based on the same clock source. - Make 'HrTimer' generic over the timer mode. 'HrTimer' timers take a 'Duration' or an 'Instant' when setting the expiry time, depending on the timer mode. With this change, the compiler can check the type matches the timer mode. - Add an abstraction for 'fsleep'. 'fsleep' is a flexible sleep function that will select an appropriate sleep method depending on the requested sleep time. - Avoid 64-bit divisions on 32-bit hardware when calculating timestamps. - Seal the 'HrTimerMode' trait. This prevents users of the 'HrTimerMode' from implementing the trait on their own types. * tag 'rust-timekeeping-for-v6.17' of https://github.com/Rust-for-Linux/linux: rust: time: Add wrapper for fsleep() function rust: time: Seal the HrTimerMode trait rust: time: Remove Ktime in hrtimer rust: time: Make HasHrTimer generic over HrTimerMode rust: time: Add HrTimerExpires trait rust: time: Replace HrTimerMode enum with trait-based mode types rust: time: Add ktime_get() to ClockSource trait rust: time: Make Instant generic over ClockSource rust: time: Replace ClockId enum with ClockSource trait rust: time: Avoid 64-bit integer division on 32-bit architectures
2025-07-17rust: net::phy Change module_phy_driver macro to use module_device_table macroFUJITA Tomonori1-27/+24
Change module_phy_driver macro to build device tables which are exported to userspace by using module_device_table macro. Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Trevor Gross <tmgross@umich.edu> Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Link: https://lore.kernel.org/r/20250711040947.1252162-4-fujita.tomonori@gmail.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2025-07-17rust: net::phy represent DeviceId as transparent wrapper over mdio_device_idFUJITA Tomonori1-27/+28
Refactor the DeviceId struct to be a #[repr(transparent)] wrapper around the C struct bindings::mdio_device_id. This refactoring is a preparation for enabling the PHY abstractions to use the RawDeviceId trait. Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Trevor Gross <tmgross@umich.edu> Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Link: https://lore.kernel.org/r/20250711040947.1252162-3-fujita.tomonori@gmail.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2025-07-17rust: device_id: split out index support into a separate traitFUJITA Tomonori6-47/+104
Introduce a new trait `RawDeviceIdIndex`, which extends `RawDeviceId` to provide support for device ID types that include an index or context field (e.g., `driver_data`). This separates the concerns of layout compatibility and index-based data embedding, and allows `RawDeviceId` to be implemented for types that do not contain a `driver_data` field. Several such structures are defined in include/linux/mod_devicetable.h. Refactor `IdArray::new()` into a generic `build()` function, which takes an optional offset. Based on the presence of `RawDeviceIdIndex`, index writing is conditionally enabled. A new `new_without_index()` constructor is also provided for use cases where no index should be written. This refactoring is a preparation for enabling the PHY abstractions to use the RawDeviceId trait. The changes to acpi.rs and driver.rs were made by Danilo. Reviewed-by: Trevor Gross <tmgross@umich.edu> Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20250711040947.1252162-2-fujita.tomonori@gmail.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2025-07-17device: rust: rename Device::as_ref() to Device::from_raw()Alice Ryhl9-11/+11
The prefix as_* should not be used for a constructor. Constructors usually use the prefix from_* instead. Some prior art in the stdlib: Box::from_raw, CString::from_raw, Rc::from_raw, Arc::from_raw, Waker::from_raw, File::from_raw_fd. There is also prior art in the kernel crate: cpufreq::Policy::from_raw, fs::File::from_raw_file, Kuid::from_raw, ARef::from_raw, SeqFile::from_raw, VmaNew::from_raw, Io::from_raw. Link: https://lore.kernel.org/r/aCd8D5IA0RXZvtcv@pollux Signed-off-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Benno Lossin <lossin@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20250711-device-as-ref-v2-1-1b16ab6402d7@google.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2025-07-17bcachefs: Fix bch2_maybe_casefold() when CONFIG_UTF8=nKent Overstreet1-8/+7
maybe_casefold() shouldn't have been nooped, just bch2_casefold(). Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-17bcachefs: Fix build when CONFIG_UNICODE=nKent Overstreet2-0/+13
94426e4201fb, which added the killswitch for casefolding, accidentally removed some of the ifdefs we need to avoid build errors. It appears we need better build testing for different configurations, it took two weeks for the robots to catch this one. Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-17bcachefs: Fix reference to invalid bucket in copygcKent Overstreet1-1/+1
Use bch2_dev_bucket_tryget() instead of bch2_dev_tryget() before checking the bucket bitmap. Reported-by: syzbot+3168625f36f4a539237e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-17bcachefs: Don't build aux search tree when still repairing nodeKent Overstreet1-3/+3
bch2_btree_node_drop_keys_outside_node() will (re)build aux search trees, because it's also called by topology repair. bch2_btree_node_read_done() was calling it before validating individual keys; invalid ones have to be dropped. If we call drop_keys_outside_node() first, then bch2_bset_build_aux_tree() doesn't run because the node already has an aux search tree - which was invalidated by the repair. Reported-by: syzbot+c5e7a66b3b23ae65d44f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-17bcachefs: Tweak threshold for allocator triggering discardsKent Overstreet1-1/+2
The allocator path has a "if we're really low on free buckets, check if we should issue discards" - tweak this to also trigger discards if more than 1/128th of the device is in need_discard state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-17bcachefs: Fix triggering of discard by the journal pathKent Overstreet1-0/+1
It becomes possible to do discards after a journal flush, which naturally the journal code is reponsible for. A prior refactoring seems to have broken this - which went unnoticed because the foreground allocator path can also trigger discards. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-17gfs2: No more self recoveryAndreas Gruenbacher1-20/+11
When a node withdraws and it turns out that it is the only node that has the filesystem mounted, gfs2 currently tries to replay the local journal to bring the filesystem back into a consistent state. Not only is that a very bad idea, it has also never worked because gfs2_recover_func() will refuse to do anything during a withdraw. However, before even getting to this point, gfs2_recover_func() dereferences sdp->sd_jdesc->jd_inode. This was a use-after-free before commit 04133b607a78 ("gfs2: Prevent double iput for journal on error") and is a NULL pointer dereference since then. Simply get rid of self recovery to fix that. Fixes: 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish") Reported-by: Chunjie Zhu <chunjie.zhu@cloud.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-17thermal/drivers/mediatek/lvts_thermal: Add mt7988 lvts commandsMason Chang1-4/+12
These commands are necessary to avoid severely abnormal and inaccurate temperature readings that are caused by using the default commands. Signed-off-by: Mason Chang <mason-cw.chang@mediatek.com> Link: https://lore.kernel.org/r/20250526102659.30225-4-mason-cw.chang@mediatek.com Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2025-07-17thermal/drivers/mediatek/lvts_thermal: Add lvts commands and their sizes to ↵Mason Chang1-13/+52
driver data Add LVTS commands and their sizes to driver data in preparation for adding different commands. Signed-off-by: Mason Chang <mason-cw.chang@mediatek.com> Link: https://lore.kernel.org/r/20250526102659.30225-3-mason-cw.chang@mediatek.com Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2025-07-17thermal/drivers/mediatek/lvts_thermal: Change lvts commands array to static ↵Mason Chang1-14/+15
const Change the LVTS commands array to static const in preparation for adding different commands. Signed-off-by: Mason Chang <mason-cw.chang@mediatek.com> Link: https://lore.kernel.org/r/20250526102659.30225-2-mason-cw.chang@mediatek.com Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2025-07-16drm/amdgpu/gfx8: reset compute ring wptr on the GPU on resumeEeli Haapalainen1-0/+1
Commit 42cdf6f687da ("drm/amdgpu/gfx8: always restore kcq MQDs") made the ring pointer always to be reset on resume from suspend. This caused compute rings to fail since the reset was done without also resetting it for the firmware. Reset wptr on the GPU to avoid a disconnect between the driver and firmware wptr. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3911 Fixes: 42cdf6f687da ("drm/amdgpu/gfx8: always restore kcq MQDs") Signed-off-by: Eeli Haapalainen <eeli.haapalainen@protonmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 2becafc319db3d96205320f31cc0de4ee5a93747) Cc: stable@vger.kernel.org