summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
4 daysMerge tag 'nf-26-06-10' of ↵Paolo Abeni18-51/+150
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Revalidate bridge ports, add missing NULL checks to fetch the bridge device by the port. From Florian Westphal. 2) Fix netdevice refcount leak in the error path of nft_fwd hardware offload function, also from Florian. 3) Unregister helper expectfn callback on conntrack helper module removal, otherwise dangling pointer remains in place, from Weiming Shi. 4) Fix possible pointer infoleak in getsockopt() IPT_SO_GET_ENTRIES, From Kyle Zeng. 5) Validate that device MAC header is present before nf_syslog accesses it. From Xiang Mei. 6-8) Three patches to address a possible infoleak of stale stack data in three nf_tables expressions, due to mismatch in the _init() and _eval() function which is possible since 14fb07130c7d. From Davide Ornaghi and Florian Westphal. netfilter pull request 26-06-10 * tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register netfilter: nft_fib: fix stale stack leak via the OIFNAME register netfilter: nft_exthdr: fix register tracking for F_PRESENT flag netfilter: nf_log: validate MAC header was set before dumping it netfilter: x_tables: avoid leaking percpu counter pointers netfilter: nf_conntrack: destroy stale expectfn expectations on unregister netfilter: nf_tables_offload: drop device refcount on error netfilter: revalidate bridge ports ==================== Link: https://patch.msgid.link/20260610161629.214092-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 daysMerge tag 'ipsec-2026-06-10' of ↵Paolo Abeni5-27/+35
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-06-10 1) xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags() Propagate SKBFL_SHARED_FRAG when paged fragments are moved between skbs so ESP can decide whether in-place crypto is safe. 2) xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload Replace the unlocked read of xtfs->ra_newskb with a local flag so a concurrent reassembly can no longer free first_skb between spin_unlock and the post-loop check. 3) xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx() Prune the inexact bin under xfrm_policy_lock so a concurrent xfrm_hash_rebuild() can no longer free it before xfrm_policy_kill() dereferences it. 4) xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state() Move hrtimer_cancel() for the output and drop timers ahead of their spinlocks, breaking the softirq/lock cycle that could deadlock against the timer callbacks on SMP. 5) xfrm: espintcp: do not reuse an in-progress partial send Fail a new send when espintcp_push_msgs() returns with emsg->len still set, so a blocking caller can no longer overwrite ctx->partial while a previous transfer still owns it. 6) esp: fix page frag reference leak on skb_to_sgvec failure Add a flag to esp_ssg_unref() to unconditionally unref the source scatterlist, releasing the old page references that are otherwise leaked when the second skb_to_sgvec() in esp_output_tail() fails. Please pull or let me know if there are problems. ipsec-2026-06-10 * tag 'ipsec-2026-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: esp: fix page frag reference leak on skb_to_sgvec failure xfrm: espintcp: do not reuse an in-progress partial send xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state() xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx() xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags() ==================== Link: https://patch.msgid.link/20260610140800.2562818-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 daysipv6: Fix a potential NPD in cleanup_prefix_route()Ido Schimmel1-2/+4
addrconf_get_prefix_route() can return the fib6_null_entry sentinel entry which has a NULL fib6_table pointer. Therefore, before setting the route's expiration time, check that we are not working with this entry, as otherwise a NPD will be triggered [1]. Note that the other callers of addrconf_get_prefix_route() are not susceptible to this bug: 1. addrconf_prefix_rcv(): Requests a route with the 'RTF_ADDRCONF | RTF_PREFIX_RT' flags which are not set on fib6_null_entry. 2. modify_prefix_route(): Fixed by commit a747e02430df ("ipv6: avoid possible NULL deref in modify_prefix_route()"). 3. __ipv6_ifa_notify(): Calls ip6_del_rt() which specifically checks for fib6_null_entry and returns an error. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [...] Call Trace: <TASK> __kasan_check_byte (mm/kasan/common.c:573) lock_acquire.part.0 (kernel/locking/lockdep.c:5842 (discriminator 1)) _raw_spin_lock_bh (kernel/locking/spinlock.c:182 (discriminator 1)) cleanup_prefix_route (net/ipv6/addrconf.c:1280) ipv6_del_addr (net/ipv6/addrconf.c:1342) inet6_addr_del.isra.0 (net/ipv6/addrconf.c:3119) inet6_rtm_deladdr (net/ipv6/addrconf.c:4812) rtnetlink_rcv_msg (net/core/rtnetlink.c:6997) netlink_rcv_skb (net/netlink/af_netlink.c:2555) netlink_unicast (net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1899) __sock_sendmsg (net/socket.c:802 (discriminator 4)) ____sys_sendmsg (net/socket.c:2698) ___sys_sendmsg (net/socket.c:2752) __sys_sendmsg (net/socket.c:2784) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: 5eb902b8e719 ("net/ipv6: Remove expired routes with a separated list of routes.") Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Reviewed-by: David Ahern <dahern@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609145448.768318-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 daysnetfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR registerDavide Ornaghi1-0/+2
NFT_META_BRI_IIFHWADDR declares its destination register with len = ETH_ALEN (6 bytes), which the register-init tracking rounds up to two 32-bit registers (8 bytes). nft_meta_bridge_get_eval() then does memcpy(dest, br_dev->dev_addr, ETH_ALEN), writing only 6 bytes and leaving the upper 2 bytes of the second register as uninitialised nft_do_chain() stack. A downstream load of that register span leaks those stale bytes to userspace. Zero the second register before the memcpy so the full declared span is written. Fixes: cbd2257dc96e ("netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support") Cc: stable@vger.kernel.org Signed-off-by: Davide Ornaghi <d.ornaghi97@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5 daysnetfilter: nft_fib: fix stale stack leak via the OIFNAME registerDavide Ornaghi3-2/+8
For NFT_FIB_RESULT_OIFNAME the destination register is declared with len = IFNAMSIZ (four 32-bit registers), but on the lookup-fail, RTN_LOCAL and oif-mismatch paths nft_fib{4,6}_eval() only writes one register via "*dest = 0". The remaining three registers are left as whatever was on the stack in nft_do_chain()'s struct nft_regs, and a downstream expression that loads the register span can leak that uninitialised kernel stack to userspace. The NFTA_FIB_F_PRESENT existence check has the same shape: it is only meaningful for NFT_FIB_RESULT_OIF, yet it was accepted for any result type while the eval stores a single byte via nft_reg_store8(), leaving the rest of the declared span stale. Fix both: - replace the bare "*dest = 0" in the eval with nft_fib_store_result(), which strscpy_pad()s the whole IFNAMSIZ for OIFNAME (and is already used on the other early-return path), and - restrict NFTA_FIB_F_PRESENT to NFT_FIB_RESULT_OIF and declare its destination as a single u8, so the marked span matches the one byte the eval writes. Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Suggested-by: Florian Westphal <fw@strlen.de> Cc: stable@vger.kernel.org Signed-off-by: Davide Ornaghi <d.ornaghi97@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5 daysnetfilter: nft_exthdr: fix register tracking for F_PRESENT flagFlorian Westphal1-0/+3
nft_exthdr_init() passes user-controlled priv->len to nft_parse_register_store(), which marks that many bytes in the register bitmap as initialized. However, when NFT_EXTHDR_F_PRESENT is set, the eval paths write only 1 byte (nft_reg_store8) or 4 bytes (*dest = 0 on TCP/DCCP error path). When len > 4, registers beyond the first are never written, retaining uninitialized stack data from nft_regs. Bail out if userspace requests too much data when F_PRESENT is set. Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Fixes: c078ca3b0c5b ("netfilter: nft_exthdr: Add support for existence check") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5 daysnetfilter: nf_log: validate MAC header was set before dumping itXiang Mei1-2/+2
The fallback path of dump_mac_header() guards the MAC header access only with "skb->mac_header != skb->network_header", without checking skb_mac_header_was_set(). When the MAC header is unset, mac_header is 0xffff, so the test passes and skb_mac_header(skb) returns skb->head + 0xffff, ~64 KiB past the buffer; the loop then reads dev->hard_header_len bytes out of bounds into the kernel log. This is reachable via the netdev logger: nf_log_unknown_packet() calls dump_mac_header() unconditionally, and an skb sent through AF_PACKET with PACKET_QDISC_BYPASS reaches the egress hook with mac_header still unset (__dev_queue_xmit(), which would reset it, is bypassed). Add the skb_mac_header_was_set() check the ARPHRD_ETHER path already uses, and replace the open-coded MAC header length test with skb_mac_header_len(). Only skbs with an unset MAC header are affected; valid ones are dumped as before. BUG: KASAN: slab-out-of-bounds in dump_mac_header (net/netfilter/nf_log_syslog.c:831) Read of size 1 at addr ffff88800ea49d3f by task exploit/148 Call Trace: kasan_report (mm/kasan/report.c:595) dump_mac_header (net/netfilter/nf_log_syslog.c:831) nf_log_netdev_packet (net/netfilter/nf_log_syslog.c:938 net/netfilter/nf_log_syslog.c:963) nf_log_packet (net/netfilter/nf_log.c:260) nft_log_eval (net/netfilter/nft_log.c:60) nft_do_chain (net/netfilter/nf_tables_core.c:285) nft_do_chain_netdev (net/netfilter/nft_chain_filter.c:307) nf_hook_slow (net/netfilter/core.c:619) nf_hook_direct_egress (net/packet/af_packet.c:257) packet_xmit (net/packet/af_packet.c:280) packet_sendmsg (net/packet/af_packet.c:3114) __sys_sendto (net/socket.c:2265) Fixes: 7eb9282cd0ef ("netfilter: ipt_LOG/ip6t_LOG: add option to print decoded MAC header") Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5 daysnetfilter: x_tables: avoid leaking percpu counter pointersKyle Zeng3-27/+18
The native and compat get-entries paths copy the fixed rule entry header from the kernelized rule blob to userspace before overwriting the entry's counter fields with a sanitized counter snapshot. On SMP kernels, entry->counters.pcnt contains the percpu allocation address used by x_tables rule counters. A caller can provide a userspace buffer that faults during the initial fixed-header copy after pcnt has been copied but before the later sanitized counter copy runs. The syscall then returns -EFAULT while leaving the raw percpu pointer in userspace. Copy only the fixed entry prefix before counters from the kernelized rule blob, then copy the sanitized counter snapshot into the counter field. Apply this ordering to the IPv4, IPv6, and ARP native and compat get-entries implementations so a fault cannot expose the internal percpu counter pointer. Fixes: 71ae0dff02d7 ("netfilter: xtables: use percpu rule counters") Signed-off-by: Kyle Zeng <kylebot@openai.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5 daysnetfilter: nf_conntrack: destroy stale expectfn expectations on unregisterWeiming Shi4-0/+24
NAT helpers such as nf_nat_h323 store a raw pointer to module text in exp->expectfn (e.g. ip_nat_q931_expect). nf_ct_helper_expectfn_unregister() only unlinks the callback descriptor and never walks the expectation table, so an expectation pending at module removal survives with a dangling exp->expectfn into freed module text. When the expected connection arrives, init_conntrack() invokes exp->expectfn(), now a stale pointer into the unloaded module. Reproduced on a KASAN build by loading the H.323 helpers, creating a Q.931 expectation, unloading nf_nat_h323, then connecting to the expected port: Oops: int3: 0000 [#1] SMP KASAN NOPTI RIP: 0010:0xffffffffa06102d1 init_conntrack.isra.0 (net/netfilter/nf_conntrack_core.c:1862) nf_conntrack_in (net/netfilter/nf_conntrack_core.c:2049) ipv4_conntrack_local (net/netfilter/nf_conntrack_proto.c:223) nf_hook_slow (net/netfilter/core.c:619) __ip_local_out (net/ipv4/ip_output.c:120) __tcp_transmit_skb (net/ipv4/tcp_output.c:1715) tcp_connect (net/ipv4/tcp_output.c:4374) tcp_v4_connect (net/ipv4/tcp_ipv4.c:345) __sys_connect (net/socket.c:2167) Modules linked in: nf_conntrack_h323 [last unloaded: nf_nat_h323] Reaching the dangling state requires CAP_SYS_MODULE in the initial user namespace to remove a NAT helper that still has live expectations, so this is a robustness fix; leaving an expectation pointing at freed text is wrong regardless. Add nf_ct_helper_expectfn_destroy(), which walks the expectation table and drops every expectation whose ->expectfn matches the descriptor being torn down. Call it from each NAT helper's exit path after the existing RCU grace period, so no expectation outlives the code it points at and no extra synchronize_rcu() is introduced. With the fix, the same reproducer runs to completion without the Oops. Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port") Reported-by: Xiang Mei <xmei5@asu.edu> Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5 daysnetfilter: nf_tables_offload: drop device refcount on errorFlorian Westphal1-2/+4
Reported by sashiko: If nft_flow_action_entry_next() returns NULL, dev reference leaks. Fixes: c6f85577584b ("netfilter: nf_tables_offload: add nft_flow_action_entry_next() and use it") Reported-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5 daysnetfilter: revalidate bridge portsFlorian Westphal4-18/+89
ebt_redirect_tg() dereferences br_port_get_rcu() return without a NULL check, causing a kernel panic when the bridge port has been removed between the original hook invocation and an NFQUEUE reinject. A mere NULL check isn't sufficient, however. As sashiko review points out userspace can not only remove the port from the bridge, it could also place the device in a different virtual device, e.g. macvlan. If this happens, we must drop the packet, there is no way for us to reinject it into the bridge path. Switch to _upper API, we don't need the bridge port structure. Also, this fix keeps another bug intact: Both nfnetlink_log and nfnetlink_queue use CONFIG_BRIDGE_NETFILTER too aggressive, which prevents certain logging features when queueing in bridge family: NETFILTER_FAMILY_BRIDGE can be enabled while the old CONFIG_BRIDGE_NETFILTER cruft is off. Fixes tag is a common ancestor, this was always broken. Fixes: f350a0a87374 ("bridge: use rx_handler_data pointer to store net_bridge_port pointer") Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5 daysrds: mark snapshot pages dirty in rds_info_getsockopt()Breno Leitao1-1/+1
rds_info_getsockopt() pins the destination user pages with FOLL_WRITE and the RDS_INFO_* producers memcpy the snapshot into them through kmap_atomic(). Because that copy goes through the kernel direct map, the dirty bit on the user PTE is never set, so unpin_user_pages() releases the pages without marking them dirty. A file-backed destination page can then be reclaimed without writeback, silently discarding the copied data. Use unpin_user_pages_dirty_lock() with make_dirty=true so the modified pages are marked dirty before they are unpinned. Fixes: a8c879a7ee98 ("RDS: Info and stats") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260608-rds_fix-v1-1-006c88543408@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysip6_vti: fix incorrect tunnel matching in vti6_tnl_lookup()Eric Dumazet1-0/+2
In vti6_tnl_lookup(), when an exact match for a tunnel fails, the code falls back to searching for wildcard tunnels: - Tunnels matching the packet's local address, with any remote address wildcard remote). - Tunnels matching the packet's remote address, with any local address (wildcard local). However, vti6 stores all these different types of tunnels in the same hash table (ip6n->tnls_r_l) prone to hash collisions. The bug is that the fallback search loops in vti6_tnl_lookup() were missing checks to ensure that the candidate tunnel actually has a wildcard address. Fixes: fbe68ee87522 ("vti6: Add a lookup method for tunnels with wildcard endpoints.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20260608164613.933023-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet/rds: fix NULL deref in rds_ib_send_cqe_handler() on masked atomic completionWeiming Shi1-0/+2
rds_ib_xmit_atomic() always programs a masked atomic opcode (IB_WR_MASKED_ATOMIC_CMP_AND_SWP or IB_WR_MASKED_ATOMIC_FETCH_AND_ADD) for every RDS atomic cmsg. But the completion-side switch in rds_ib_send_unmap_op() only handles the non-masked opcodes, so a masked atomic completion falls through to default and returns rm == NULL while send->s_op is left set. rds_ib_send_cqe_handler() then dereferences the NULL rm via rm->m_final_op, oopsing in softirq context. An unprivileged AF_RDS sendmsg() of an atomic cmsg over an active RDS/IB connection triggers it; on hardware that natively accepts masked atomics (mlx4, mlx5) no extra setup is needed. RDS/IB: rds_ib_send_unmap_op: unexpected opcode 0xd in WR! Oops: general protection fault [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000190-0x0000000000000197] RIP: rds_ib_send_cqe_handler+0x25c/0xb10 (net/rds/ib_send.c:282) Call Trace: <IRQ> rds_ib_send_cqe_handler (net/rds/ib_send.c:282) poll_scq (net/rds/ib_cm.c:274) rds_ib_tasklet_fn_send (net/rds/ib_cm.c:294) tasklet_action_common (kernel/softirq.c:943) handle_softirqs (kernel/softirq.c:573) run_ksoftirqd (kernel/softirq.c:479) </IRQ> Kernel panic - not syncing: Fatal exception in interrupt Handle the masked atomic opcodes in the same case as the non-masked ones: they map to the same struct rds_message.atomic union member, so the existing container_of()/rds_ib_send_unmap_atomic() body is correct for them. Fixes: 20c72bd5f5f9 ("RDS: Implement masked atomic operations") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260606192447.1179255-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet: guard timestamp cmsgs to real error queue skbsKyle Zeng2-8/+9
skb_is_err_queue() treats PACKET_OUTGOING as the sole marker for an skb from sk_error_queue. That assumption is not true for AF_PACKET sockets: outgoing packet taps are also delivered to packet sockets with skb->pkt_type == PACKET_OUTGOING, but their skb->cb is owned by AF_PACKET instead of struct sock_exterr_skb. If such an skb is received with timestamping enabled, the generic timestamp cmsg path can read AF_PACKET control-buffer state as sock_exterr_skb::opt_stats. With SO_RXQ_OVFL enabled, the packet drop counter overlaps opt_stats. An odd drop count makes the path emit SCM_TIMESTAMPING_OPT_STATS with skb->len and skb->data. For non-linear skbs this copies past the linear head and can trigger hardened usercopy or disclose adjacent heap contents. Keep skb_is_err_queue() local to net/socket.c, but make it verify that the PACKET_OUTGOING marker is paired with the sock_rmem_free destructor installed by sock_queue_err_skb(). AF_PACKET receive skbs use normal receive ownership and no longer pass as error-queue skbs, while legitimate sk_error_queue entries keep the PACKET_OUTGOING marker and sock_rmem_free ownership. Fixes: 8605330aac5a ("tcp: fix SCM_TIMESTAMPING_OPT_STATS for normal skbs") Signed-off-by: Kyle Zeng <kylebot@openai.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260607021819.49698-1-kylebot@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 dayssctp: validate embedded INIT chunk and address list lengths in cookieXin Long2-3/+17
sctp_unpack_cookie() only checked that the embedded INIT chunk length did not exceed the remaining cookie payload, but did not ensure that the INIT chunk is large enough to contain a complete INIT header. A malformed COOKIE_ECHO can therefore carry a truncated INIT chunk whose length field is smaller than sizeof(struct sctp_init_chunk). Later, sctp_process_init() accesses INIT parameters unconditionally, which may lead to out-of-bounds reads. In addition, raw_addr_list_len is not fully validated against the remaining cookie payload. When cookie authentication is disabled, an attacker can supply an oversized raw_addr_list_len and cause sctp_raw_to_bind_addrs() to read beyond the end of the cookie. The address parser also lacks sufficient bounds checks for parameter headers and lengths, allowing malformed address parameters to trigger out-of-bounds reads. Fix this by: - requiring the embedded INIT chunk length to be at least sizeof(struct sctp_init_chunk); - validating that the INIT chunk and raw address list together fit within the cookie payload; - verifying sufficient data exists for each address parameter header and payload before parsing it. Note that sctp_verify_init() must be called after sctp_unpack_cookie() and before sctp_process_init() when cookie authentication is disabled. This will be addressed in a separate patch. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/75af23a89adf881a0895d511775e4770da367cbf.1780873427.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysip6_vti: set netns_immutable on the fallback device.Eric Dumazet1-0/+1
john1988 and Noam Rathaus reported that vti6_init_net() does not set the netns_immutable flag on the per-netns fallback tunnel device (ip6_vti0). Other similar tunnel drivers (like ip6_tunnel, sit, ip6_gre, and ip_tunnel) correctly set this flag during their fallback device initialization to prevent them from being moved to another network namespace. Fixes: 61220ab34948 ("vti6: Enable namespace changing") Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20260608155918.787644-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 dayssctp: fix uninit-value in __sctp_rcv_asconf_lookup()Michael Bommarito1-0/+8
__sctp_rcv_asconf_lookup() in net/sctp/input.c only checks that the ASCONF chunk can hold the ADDIP header and a parameter header, then calls af->from_addr_param(), which reads the full address (16 bytes for IPv6) trusting the parameter's declared length. An unauthenticated peer can send a truncated trailing ASCONF chunk that declares an IPv6 address parameter but stops after the 4-byte parameter header; reached from the no-association lookup path, from_addr_param() then reads uninitialized bytes past the parameter. Impact: an unauthenticated SCTP peer makes the receive path read up to 16 bytes of uninitialized memory past a truncated ASCONF address parameter. The sibling __sctp_rcv_init_lookup() bounds parameters with sctp_walk_params(); this path open-codes the fetch and omits the bound. Verify the whole address parameter lies within the chunk before from_addr_param() reads it, the same class of fix as commit 51e5ad549c43 ("net: sctp: fix KMSAN uninit-value in sctp_inq_pop"). Fixes: df2185771439 ("[SCTP]: Update association lookup to look at ASCONF chunks as well") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20260608122234.459098-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 dayssctp: stream: fully roll back denied add-stream stateWyatt Feng1-1/+5
When ADD_OUT_STREAMS is denied, SCTP only shrinks the queued chunks and then lowers outcnt. That leaves removed stream metadata behind, so a later re-add can reuse a stale ext and hit a null-pointer dereference in the scheduler get path. Fix the rollback by tearing down the removed stream state the same way other stream resizes do. Unschedule the current scheduler state, drop the removed stream ext state with sctp_stream_outq_migrate(), and then reschedule the remaining streams. This keeps scheduler-private RR/FC/PRIO lists consistent while fully rolling back denied outgoing stream additions. Fixes: 637784ade221 ("sctp: introduce priority based stream scheduler") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/d78954ecd94954653ee299400e98d74a03a6f7d3.1780603399.git.bronzed_45_vested@icloud.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysesp: fix page frag reference leak on skb_to_sgvec failureAlessandro Schino2-12/+22
In esp_output_tail(), when esp->inplace is false, the old skb page frags are replaced with a new page from the xfrm page_frag cache The source scatterlist (sg) is built from the old frags before the replacement, and esp_ssg_unref() is responsible for releasing the old page references after the crypto operation completes However, if the second skb_to_sgvec() call (which builds the destination scatterlist from the new page) fails, the code jumps to error_free which only calls kfree(tmp). The old page frag references captured in the source scatterlist are never released: 1 sg[] is built from old frags via skb_to_sgvec() (no extra get_page) 2 nr_frags is set to 1 and frag[0] is replaced with the new page 3 Second skb_to_sgvec() fails -> goto error_free Fix this by adding a bool parameter to esp_ssg_unref() that, when true, unconditionally unrefs the source scatterlist frags. Since req->src is not yet initialized by aead_request_set_crypt() at the point of the error, the source scatterlist is obtained directly via esp_req_sg() Existing callers pass false to preserve the original behavior The same issue exists in both esp4 and esp6 as the code is identical Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Alessandro Schino <7991aleschino@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
6 daysrxrpc: Fix the ACK parser to extract the SACK table for parsingDavid Howells1-9/+17
Fix modification of the received skbuff in rxrpc_input_soft_acks() and a potential incorrect access of the buffer in a fragmented UDP packet (the packet would probably have to be deliberately pre-generated as fragmented) when AF_RXRPC tries to extract the contents of the SACK table by copying out the contents of the SACK table into a buffer before attempting to parse AF_RXRPC assumes that it can just call skb_condense() and then validly access the SACK table from skb->data and that it will be a flat buffer - but skb_condense() can silently fail to do anything under some circumstances. Note that whilst rxrpc_input_soft_acks() should be able to parse extended ACKs, the rest of AF_RXRPC doesn't currently support that. Further, there's then no need to call skb_condense() in rxrpc_input_ack(), so don't. Fixes: d57a3a151660 ("rxrpc: Save last ACK's SACK table rather than marking txbufs") Reported-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://lore.kernel.org/r/20260513180907.2061972-1-michael.bommarito@gmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Jeffrey Altman <jaltman@auristor.com> cc: Eric Dumazet <edumazet@google.com> cc: "David S. Miller" <davem@davemloft.net> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org cc: stable@kernel.org Link: https://patch.msgid.link/105362.1780573560@warthog.procyon.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 daysnet: openvswitch: fix possible kfree_skb of ERR_PTRAdrian Moreno1-0/+1
After the patch in the "Fixes" tag, the allocation of the "reply" skb can happen either before or after locking the ovs_mutex. However, error cleanups still follow the classical reversed order, assuming "reply" is allocated before locking: it is freed after unlocking. If "reply" allocation happens after locking the mutex and it fails, "reply" is left with an ERR_PTR, and execution jumps to the correspondent cleanup stage which will try to free an invalid pointer. Fix this by setting the pointer to NULL after having saved its error value. Fixes: 893f139b9a6c ("openvswitch: Minimize ovs_flow_cmd_new|set critical sections.") Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://patch.msgid.link/20260604121946.942164-1-amorenoz@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysipv6: sit: reload inner IPv6 header after GSO offloadsKyle Zeng1-0/+1
ipip6_tunnel_xmit() caches the inner IPv6 header pointer at function entry and continues using it after iptunnel_handle_offloads(). For GSO skbs, iptunnel_handle_offloads() calls skb_header_unclone(). When the skb header is cloned, skb_header_unclone() can call pskb_expand_head(), which may move the skb head. The pskb_expand_head() contract requires pointers into the skb header to be reloaded after the call. If the later skb_realloc_headroom() branch is not taken, SIT uses the stale iph6 pointer to read the inner hop limit and DS field. That can read from a freed skb head after the old head's remaining clone is released. Reload iph6 after the offload helper succeeds and before subsequent reads from the inner IPv6 header. Keep the existing reload after skb_realloc_headroom(), since that branch can also replace the skb. Fixes: 14909664e4e1 ("sit: Setup and TX path for sit/UDP foo-over-udp encapsulation") Signed-off-by: Kyle Zeng <kylebot@openai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+6eb9ca986d80f6f88cf9@syzkaller.appspotmail.com Link: https://patch.msgid.link/20260605073448.6524-1-kylebot@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet: qrtr: fix refcount saturation and potential UAF in qrtr_port_removeMingyu Wang1-2/+2
In qrtr_port_remove(), the socket reference count is decremented via __sock_put() before the port is removed from the qrtr_ports XArray and before the RCU grace period elapses. This breaks the fundamental RCU update paradigm. It exposes a race window where a concurrent RCU reader (such as qrtr_reset_ports() or qrtr_port_lookup()) can obtain a pointer to the socket from the XArray, and attempt to call sock_hold() on a socket whose reference count has already dropped to zero. This exact race condition was hit during syzkaller fuzzing, leading to the following refcount saturation warning and a potential Use-After-Free: refcount_t: saturated; leaking memory. WARNING: CPU: 3 PID: 1273 at lib/refcount.c:22 refcount_warn_saturate+0xae/0x1d0 Modules linked in: qrtr(+) bochs drm_shmem_helper ... Call Trace: <TASK> qrtr_reset_ports net/qrtr/af_qrtr.c:768 [inline] [qrtr] __qrtr_bind.isra.0+0x48b/0x570 net/qrtr/af_qrtr.c:805 [qrtr] qrtr_bind+0x17d/0x210 net/qrtr/af_qrtr.c:901 [qrtr] kernel_bind+0xe4/0x120 net/socket.c:3592 qrtr_ns_init+0x1a6/0x380 net/qrtr/ns.c:715 [qrtr] qrtr_proto_init+0x3b/0xff0 net/qrtr/af_qrtr.c:169 [qrtr] do_one_initcall+0xf5/0x5e0 init/main.c:1283 ... </TASK> Fix this by deferring the reference count decrement until after the xa_erase() and the synchronize_rcu() complete. (Note: The v1 of this patch incorrectly replaced __sock_put() with sock_put(). As Simon Horman pointed out, the callers of qrtr_port_remove() still hold a reference to the socket, so freeing the socket memory here would lead to a subsequent UAF in the caller. Thus, the __sock_put() is kept, but only repositioned to close the RCU race.) Fixes: bdabad3e363d ("net: Add Qualcomm IPC router") Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260604064801.1180388-1-w15303746062@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnetdev: fix double-free in netdev_nl_bind_rx_doit()Jakub Kicinski1-3/+1
Sashiko flags that genlmsg_reply() always consumes the skb. The error path calls nlmsg_free(rsp) so we can't jump directly to it. Let's not unbind, just propagate the error to the user. This is the typical way of handling genlmsg_reply() failures. They shouldn't happen unless user does something silly like calling the kernel with an already-full rcvbuf. Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 170aafe35cb9 ("netdev: support binding dma-buf to netdevice") Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet: phonet: free phonet_device after RCU grace periodSantosh Kalluri1-1/+1
phonet_device_destroy() removes a phonet_device from the per-net device list with list_del_rcu(), but frees it immediately. RCU readers walking the same list can still hold a pointer to the object after it has been removed, leading to a slab-use-after-free. Use kfree_rcu(), matching the lifetime rule already used by phonet_address_del() for the same object type. Fixes: eeb74a9d45f7 ("Phonet: convert devices list to RCU") Cc: stable@vger.kernel.org Signed-off-by: Santosh Kalluri <santosh.kalluri129@gmail.com> Acked-by: Rémi Denis-Courmont <remi@remlab.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet: add pskb_may_pull() to skb_gro_receive_list()HanQuan1-0/+5
skb_gro_receive_list() calls skb_pull(skb, skb_gro_offset(skb)) without first ensuring the data is in the linear area via pskb_may_pull(). When the skb arrives via napi_gro_frags(), skb_headlen can be 0 (all data in page fragments) while skb_gro_offset is non-zero (after IP+TCP header parsing). The skb_pull() then decrements skb->len by skb_gro_offset but skb->data_len stays unchanged, hitting BUG_ON(skb->len < skb->data_len) in __skb_pull(). The UDP fraglist GRO path already contains this guard at udp_offload.c:749. Adding it to skb_gro_receive_list() itself provides centralized protection for all callers (TCP, UDP, and any future protocols), and ensures the precondition of skb_pull() is satisfied before it is called. On pskb_may_pull() failure, set NAPI_GRO_CB(skb)->flush = 1 so the skb is not held as a new GRO head and is instead delivered through the normal receive path, matching the UDP handling. Fixes: 8d95dc474f85 ("net: add code for TCP fraglist GRO") Reported-by: HanQuan <eilaimemedsnaimel@gmail.com> Reported-by: MingXuan <bwnie0730@outlook.com> Signed-off-by: HanQuan <eilaimemedsnaimel@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daystcp: restrict SO_ATTACH_FILTER to priv usersEric Dumazet1-0/+5
This patch restricts the use of SO_ATTACH_FILTER (cBPF) on TCP sockets to users with CAP_NET_ADMIN capability. This blocks potential side-channel attack where an unprivileged application attaches a filter to leak TCP sequence/acknowledgment numbers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Tamir Shahar <tamirthesis@gmail.com> Reported-by: Amit Klein <aksecurity@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysnetlabel: validate unlabeled address and mask attribute lengthsChenguang Zhao1-20/+10
netlbl_unlabel_addrinfo_get() used the address attribute length to determine whether the attribute data could be read as an IPv4 or IPv6 address, but did not independently validate the corresponding mask attribute length. A crafted Generic Netlink request could therefore provide a valid IPv4/IPv6 address attribute with a shorter mask attribute, which would later be read as a full struct in_addr or struct in6_addr. NLA_BINARY policy lengths are maximum lengths by default, so use NLA_POLICY_EXACT_LEN() for the unlabeled IPv4/IPv6 address and mask attributes. This rejects short attributes during policy validation and also exposes the exact length requirements through policy introspection. Fixes: 8cc44579d1bd ("NetLabel: Introduce static network labels for unlabeled connections") Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysxfrm: espintcp: do not reuse an in-progress partial sendWyatt Feng1-0/+4
espintcp keeps a single in-flight transmit in ctx->partial. Before building a new sk_msg, espintcp_sendmsg() first tries to flush that state through espintcp_push_msgs(). For blocking callers, espintcp_push_msgs() may return success even when the previous partial send is still pending. espintcp_sendmsg() would then reinitialize emsg->skmsg and reuse ctx->partial while the old transfer still owns that state. Do not rebuild the send message when ctx->partial is still in progress. If espintcp_push_msgs() returns with emsg->len still set, fail the new send instead of overwriting the live partial state. This is a memory-safety fix: reusing the live partial-send state can leave a stale offset attached to a new sk_msg and lead to an out-of- bounds read in the send path. tcp_sendmsg_locked() already handles waiting for send buffer memory, so the fix here is just to preserve espintcp's one-message-at-a-time transmit state. Fixes: e27cca96cd68 ("xfrm: add espintcp (RFC 8229)") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:GPT-5.4 Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
10 daysxfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()Tristan Madani1-3/+2
iptfs_destroy_state() calls hrtimer_cancel() while holding a spinlock that the timer callback also acquires, leading to an ABBA deadlock on SMP systems. For the output timer (iptfs_timer): - iptfs_destroy_state() holds x->lock, calls hrtimer_cancel() - iptfs_delay_timer() callback takes x->lock For the drop timer (drop_timer): - iptfs_destroy_state() holds drop_lock, calls hrtimer_cancel() - iptfs_drop_timer() callback takes drop_lock Both timers use HRTIMER_MODE_REL_SOFT, so their callbacks run in softirq context. When hrtimer_cancel() is called for a soft timer that is currently executing on another CPU, hrtimer_cancel_wait_running() spins on softirq_expiry_lock -- the same lock held by the softirq running the callback. If the callback is blocked waiting for the spinlock held by the caller of hrtimer_cancel(), a circular dependency forms: CPU 0: holds lock_A -> waits for softirq_expiry_lock CPU 1: holds softirq_expiry_lock -> waits for lock_A Fix by calling hrtimer_cancel() before acquiring the respective locks. hrtimer_cancel() is safe to call without holding any lock and will wait for any in-progress callback to complete. For the output timer, the lock is still acquired afterwards to drain the packet queue. For the drop timer, the lock/unlock pair is removed entirely since it only existed to serialize with the timer callback, which hrtimer_cancel() already guarantees. Found by source code audit. Fixes: 4b3faf610cc6 ("xfrm: iptfs: add new iptfs xfrm mode impl") Cc: Christian Hopps <chopps@labn.net> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: stable@vger.kernel.org Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
11 daysinet: frags: fix use-after-free caused by the fqdir_pre_exit() flushHyunwoo Kim2-3/+3
On netns teardown, fqdir_pre_exit() walks the fqdir rhashtable and flushes every fragment queue that is not yet complete using inet_frag_queue_flush(). That helper frees all the skbs queued on the fragment queue but does not set INET_FRAG_COMPLETE, and leaves q->fragments_tail and q->last_run_head pointing at the freed skbs. The queue itself stays in the rhashtable. fqdir_pre_exit() first lowers high_thresh to 0 to stop new queue lookups, but it cannot stop a fragment that already obtained the queue through inet_frag_find() earlier and stalled just before taking the queue lock. Once that fragment resumes after the flush and takes the queue lock, it passes the INET_FRAG_COMPLETE check and then dereferences the freed fragments_tail. inet_frag_queue_insert() reads FRAG_CB() and ->len of that pointer and, on the append path, writes ->next_frag, causing a slab use-after-free. IPv6, nf_conntrack_reasm6 and 6lowpan reassembly share the same flush path and are affected as well. Reset rb_fragments, fragments_tail and last_run_head in inet_frag_queue_flush() so a flushed queue no longer points at the freed skbs. A fragment that resumes after the flush and takes the queue lock then finds an empty queue and starts a new run instead of dereferencing the freed fragments_tail. ip_frag_reinit() already performed this reset after its own flush, so drop the now duplicate code there. Cc: stable@vger.kernel.org Fixes: 006a5035b495 ("inet: frags: flush pending skbs in fqdir_pre_exit()") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Link: https://patch.msgid.link/ah6ukYq5G98LshdA@v4bel Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysudp: clear skb->dev before running a sockmap verdictSechang Lim1-0/+8
On the UDP receive path skb->dev is repurposed as dev_scratch (the truesize/state cache set by udp_set_dev_scratch()), through the union { struct net_device *dev; unsigned long dev_scratch; } in sk_buff. When a UDP socket is in a sockmap, sk_data_ready is sk_psock_verdict_data_ready(), which calls udp_read_skb() -> recv_actor() (sk_psock_verdict_recv) to run the attached SK_SKB verdict program in softirq. If that program calls a socket-lookup helper (bpf_sk_lookup_tcp/udp, bpf_skc_lookup_tcp), bpf_skc_lookup() does: if (skb->dev) caller_net = dev_net(skb->dev); skb->dev still holds the dev_scratch value (a non-NULL integer), so dev_net() dereferences it as a struct net_device * and the kernel takes a general protection fault on a non-canonical address in softirq: Oops: general protection fault, probably for non-canonical address 0x1010000800004a0 CPU: 1 UID: 0 PID: 1406 Comm: syz.2.19 Not tainted 7.1.0-rc6 #1 PREEMPT(full) RIP: 0010:bpf_skc_lookup net/core/filter.c:7033 [inline] RIP: 0010:bpf_sk_lookup+0x45/0x160 net/core/filter.c:7047 Call Trace: <IRQ> bpf_prog_4675cb904b7071f8+0x12e/0x14e bpf_prog_run_pin_on_cpu+0xc6/0x1f0 sk_psock_verdict_recv+0x1ba/0x350 udp_read_skb+0x31a/0x370 sk_psock_verdict_data_ready+0x2e3/0x600 __udp_enqueue_schedule_skb+0x4c8/0x650 udpv6_queue_rcv_one_skb+0x3ec/0x740 udp6_unicast_rcv_skb+0x11d/0x140 ip6_protocol_deliver_rcu+0x61e/0x950 ip6_input_finish+0xa9/0x150 NF_HOOK+0x286/0x2f0 ip6_input+0x117/0x220 NF_HOOK+0x286/0x2f0 __netif_receive_skb+0x85/0x200 process_backlog+0x374/0x9a0 __napi_poll+0x4f/0x1c0 net_rx_action+0x3b0/0x770 handle_softirqs+0x15a/0x460 do_softirq+0x57/0x80 </IRQ> The rmem charge that dev_scratch accounted for is released by skb_recv_udp() on dequeue, just above, so the scratch is dead by the time recv_actor() runs. Clear skb->dev so bpf_skc_lookup() falls back to sock_net(skb->sk), which skb_set_owner_sk_safe() set just above. Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()") Cc: stable@vger.kernel.org Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260603162737.697215-1-rhkrqnwk98@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 dayssctp: purge outqueue on stale COOKIE-ECHO handlingXin Long1-5/+1
sctp_stream_update() is only invoked when the association is moved into COOKIE_WAIT during association setup/reconfiguration. In this path, the outbound stream scheduler state (stream->out_curr) is expected to be clean, since no user data should have been transmitted yet unless the state machine has already partially progressed. However, a corner case exists in sctp_sf_do_5_2_6_stale(): when a Stale Cookie ERROR is received, the association is rolled back from COOKIE_ECHOED to COOKIE_WAIT. In this scenario, user data may already have been queued and even bundled with the COOKIE-ECHO chunk. During the rollback, sctp_stream_update() frees the old stream table and installs a new one, but it does not invalidate stream->out_curr. As a result, out_curr may still point to a freed sctp_stream_out entry from the previous stream state. Later, SCTP scheduler dequeue paths (FCFS, RR, PRIO, etc.) rely on stream->out_curr->ext, which can lead to use-after-free once the old stream state has been released via sctp_stream_free(). This results in crashes such as (reported by Yuqi): BUG: KASAN: slab-use-after-free in sctp_sched_fcfs_dequeue+0x13a/0x140 Read of size 8 at addr ff1100004d4d3208 by task mini_poc/9312 CPU: 1 UID: 1001 PID: 9312 Comm: mini_poc Not tainted 7.1.0-rc1-00305-gbd3a4795d574 #5 PREEMPT(full) sctp_sched_fcfs_dequeue+0x13a/0x140 sctp_outq_flush+0x1603/0x33e0 sctp_do_sm+0x31c9/0x5d30 sctp_assoc_bh_rcv+0x392/0x6f0 sctp_inq_push+0x1db/0x270 sctp_rcv+0x138d/0x3c10 Fix this by fully purging the association outqueue when handling the Stale Cookie case. This ensures all pending transmit and retransmit state is dropped, and any scheduler cached pointers are invalidated, making it safe to rebuild stream state during COOKIE_WAIT restart. Updating only stream->out_curr would be insufficient, since queued and retransmittable data would still reference the old stream state and trigger later use-after-free in dequeue paths. Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Yuqi Xu <xuyq21@lenovo.com> Reported-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/94318159b9052907a6cbb7256aee8b5f8dfbfccb.1780510304.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysnet/802/mrp: fix vector attribute parsing in mrp_pdu_parse_vecattrYizhou Zhao1-0/+9
In mrp_pdu_parse_vecattr(), vector attribute events are encoded three per byte and valen tracks the number of events left to process. The parser decrements valen after processing the first and second events from each event byte, but not after processing the third one. When valen is exactly a multiple of three, the loop continues after the last valid event and consumes the next byte as a new event byte, applying a spurious event to the MRP applicant state. Additionally, when valen is zero the parser unconditionally consumes attrlen bytes as FirstValue and advances the offset, even though per IEEE 802.1ak a VectorAttribute with only a LeaveAllEvent has valen of zero and no FirstValue or Vector fields. This corrupts the offset for subsequent PDU parsing. Also, when valen exceeds three the loop crosses byte boundaries but the attribute value is not incremented between the last event of one byte and the first event of the next. This causes the first event of the next byte to use the same attribute value as the third event rather than the next consecutive value. Decrement valen after processing the third event, skip FirstValue consumption when valen is zero, and increment the attribute value at the end of each loop iteration. Fixes: febf018d2234 ("net/802: Implement Multiple Registration Protocol (MRP)") Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Link: https://patch.msgid.link/20260603060016.21522-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysieee802154: 6lowpan: only accept IPv6 packets in lowpan_xmit()Eric Dumazet1-0/+5
The aoe driver (or similar) generates a non-IPv6 packet (e.g., ETH_P_AOE) and queues it for transmission via dev_queue_xmit() on a 6LoWPAN interface (configured by the user or test case). Since the packet is not IPv6, the 6LoWPAN header_ops->create function (lowpan_header_create or header_create) returns early without initializing the lowpan_addr_info structure in the skb headroom. In the transmit function (lowpan_xmit), the driver calls lowpan_header (or setup_header) which unconditionally copies and uses the lowpan_addr_info from the headroom, which contains uninitialized data. Fix this by dropping non IPv6 packets. A similar fix is needed in net/bluetooth/6lowpan.c bt_xmit(). Fixes: 4dc315e267fe ("ieee802154: 6lowpan: move transmit functionality") Reported-by: syzbot+f13c19f75e1097abd116@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a1fd763.278b5b03.2bcf39.0049.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://patch.msgid.link/20260603072955.4032221-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysipv6: mcast: Fix use-after-free when processing MLD queriesIdo Schimmel1-4/+4
When processing an MLD query, a pointer to the multicast group address is retrieved when initially parsing the packet. This pointer is later dereferenced without being reloaded despite the fact that the skb header might have been reallocated following the pskb_may_pull() calls, leading to a use-after-free [1]. Fix by copying the multicast group address when the packet is initially parsed. [1] BUG: KASAN: slab-use-after-free in __mld_query_work (net/ipv6/mcast.c:1512) Read of size 8 at addr ffff8881154b8e90 by task kworker/4:1/118 Workqueue: mld mld_query_work Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120) print_address_description.constprop.0 (mm/kasan/report.c:378) print_report (mm/kasan/report.c:482) kasan_report (mm/kasan/report.c:595) __mld_query_work (net/ipv6/mcast.c:1512) mld_query_work (net/ipv6/mcast.c:1563) process_one_work (kernel/workqueue.c:3314) worker_thread (kernel/workqueue.c:3397 kernel/workqueue.c:3478) kthread (kernel/kthread.c:436) ret_from_fork (arch/x86/kernel/process.c:158) ret_from_fork_asm (arch/x86/entry/entry_64.S:245) </TASK> [...] Freed by task 118: kasan_save_stack (mm/kasan/common.c:57) kasan_save_track (mm/kasan/common.c:78) kasan_save_free_info (mm/kasan/generic.c:584) __kasan_slab_free (mm/kasan/common.c:253 mm/kasan/common.c:285) kfree (./include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6251 mm/slub.c:6566) pskb_expand_head (net/core/skbuff.c:2335) __pskb_pull_tail (net/core/skbuff.c:2878 (discriminator 4)) __mld_query_work (net/ipv6/mcast.c:1495 (discriminator 1)) mld_query_work (net/ipv6/mcast.c:1563) process_one_work (kernel/workqueue.c:3314) worker_thread (kernel/workqueue.c:3397 kernel/workqueue.c:3478) kthread (kernel/kthread.c:436) ret_from_fork (arch/x86/kernel/process.c:158) ret_from_fork_asm (arch/x86/entry/entry_64.S:245) Fixes: 97300b5fdfe2 ("[MCAST] IPv6: Check packet size when process Multicast") Reported-by: Leo Lin <leo@depthfirst.com> Reviewed-by: David Ahern <dahern@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260603101811.612594-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 dayssctp: validate cached peer INIT chunk length in COOKIE_ECHO processingXin Long1-0/+5
When a listening SCTP server processes a COOKIE_ECHO chunk, the cached peer INIT chunk embedded after the cookie is parsed and its parameters are later walked by sctp_process_init() using sctp_walk_params(). However, the chunk header length of this cached INIT chunk was not validated against the remaining buffer in the COOKIE_ECHO payload. If the length field is inflated, the parameter walk can run beyond the actual received data, leading to out-of-bounds reads and potential memory corruption during later parameter handling (e.g. STATE_COOKIE processing and kmemdup() copies). Add a bounds check in sctp_unpack_cookie() to ensure the cached INIT chunk length does not exceed the available data in the COOKIE_ECHO buffer before it is used. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Brian Geffon <bgeffon@google.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/eb60825fa22d6f9e663c7d4dbb69f397b5d34d42.1780362366.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysnet/sched: fix pedit partial COW leading to page cache corruptionRajat Gupta1-36/+41
tcf_pedit_act() computes the COW range for skb_ensure_writable() once before the key loop using tcfp_off_max_hint, but the hint does not account for the runtime header offset added by typed keys. This can leave part of the write region un-COW'd. Fix by moving skb_ensure_writable() inside the per-key loop where the actual write offset is known, and add overflow checking on the offset arithmetic. For negative offsets (e.g. Ethernet header edits at ingress), use skb_cow() to COW the headroom instead. Guard offset_valid() against INT_MIN, where negation is undefined. Fixes: 8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable") Reported-by: Yiming Qian <yimingqian591@gmail.com> Reported-by: Keenan Dong <keenanat2000@gmail.com> Reported-by: Han Guidong <2045gemini@gmail.com> Reported-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: Han Guidong <2045gemini@gmail.com> Tested-by: Han Guidong <2045gemini@gmail.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Link: https://patch.msgid.link/20260531123221.48732-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysvsock/vmci: fix sk_ack_backlog leak on failed handshakeRaf Dickson1-1/+3
When vmci_transport_recv_connecting_server() returns an error, vmci_transport_recv_listen() calls vsock_remove_pending() but never calls sk_acceptq_removed(). This leaves sk_ack_backlog incremented permanently. Repeated handshake failures (malformed packets, queue pair alloc failure, event subscribe failure) cause sk_ack_backlog to climb toward sk_max_ack_backlog. Once it reaches the limit the listener permanently refuses all new connections with -ECONNREFUSED, a silent denial of service requiring a process restart to recover. The two existing sk_acceptq_removed() calls in af_vsock.c do not cover this path: line 764 checks vsock_is_pending() which returns false after vsock_remove_pending(), and line 1889 is only reached on successful accept(). Fix by balancing sk_acceptq_added() with sk_acceptq_removed() on the error path. Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Cc: stable@vger.kernel.org Signed-off-by: Raf Dickson <rafdog35@gmail.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260526104356.469928-1-rafdog35@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
11 daysxfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx()Sanghyun Park1-11/+2
Fix the race by pruning the bin while still holding xfrm_policy_lock, before dropping it. Use __xfrm_policy_inexact_prune_bin() directly since the lock is already held. The wrapper xfrm_policy_inexact_prune_bin() becomes unused and is removed. Race: CPU0 (XFRM_MSG_DELPOLICY) CPU1 (XFRM_MSG_NEWSPDINFO) ========================== ========================== xfrm_policy_bysel_ctx(): spin_lock_bh(xfrm_policy_lock) bin = xfrm_policy_inexact_lookup() __xfrm_policy_unlink(pol) spin_unlock_bh(xfrm_policy_lock) xfrm_policy_kill(ret) // wide window, lock not held xfrm_hash_rebuild(): spin_lock_bh(xfrm_policy_lock) __xfrm_policy_inexact_flush(): kfree_rcu(bin) // bin freed spin_unlock_bh(xfrm_policy_lock) xfrm_policy_inexact_prune_bin(bin) // UAF: bin is freed Fixes: 6be3b0db6db8 ("xfrm: policy: add inexact policy search tree infrastructure") Signed-off-by: Sanghyun Park <sanghyun.park.cnu@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
12 daysMerge tag 'for-net-2026-06-03' of ↵Jakub Kicinski9-75/+232
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_core: fix memory leak in error path of hci_alloc_dev() - hci_sync: reject oversized Broadcast Announcement prepend - MGMT: Fix backward compatibility with userspace - MGMT: validate advertising TLV before type checks - L2CAP: reject BR/EDR signaling packets over MTUsig - RFCOMM: validate skb length in MCC handlers - RFCOMM: hold listener socket in rfcomm_connect_ind() - ISO: Fix not releasing hdev reference on iso_conn_big_sync - ISO: Fix a use-after-free of the hci_conn pointer - ISO: Fix data-race on iso_pi fields in hci_get_route calls - SCO: Fix data-race on sco_pi fields in sco_connect - BNEP: reject short frames before parsing * tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: MGMT: Fix backward compatibility with userspace Bluetooth: SCO: Fix data-race on sco_pi fields in sco_connect Bluetooth: ISO: Fix data-race on iso_pi fields in hci_get_route calls Bluetooth: ISO: Fix a use-after-free of the hci_conn pointer Bluetooth: ISO: Fix not releasing hdev reference on iso_conn_big_sync Bluetooth: fix memory leak in error path of hci_alloc_dev() Bluetooth: bnep: reject short frames before parsing Bluetooth: hci_sync: reject oversized Broadcast Announcement prepend Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig Bluetooth: RFCOMM: validate skb length in MCC handlers Bluetooth: MGMT: validate advertising TLV before type checks Bluetooth: RFCOMM: hold listener socket in rfcomm_connect_ind() ==================== Link: https://patch.msgid.link/20260603162714.342496-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysMerge tag 'wireless-2026-06-03' of ↵Jakub Kicinski3-2/+20
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Things are finally quieting down: - iwlwifi: - FW reset handshake removal for older devices - NIC access fix in fast resume - avoid too large command for some BIOSes - fix TX power constraints in AP mode - cfg80211: - fix netlink parse overflow - fix potential 6 GHz scan memory leak - enforce HE/EHT consistency to avoid mac80211 crash - mac80211: guard radiotap antenna parsing * tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: cfg80211: enforce HE/EHT cap/oper consistency wifi: fix leak if split 6 GHz scanning fails wifi: mac80211: limit injected antenna index in ieee80211_parse_tx_radiotap wifi: nl80211: reject oversized EMA RNR lists wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used wifi: iwlwifi: mvm: avoid oversized UATS command copy wifi: iwlwifi: mld: send tx power constraints before link activation wifi: iwlwifi: mvm: don't support the reset handshake for old firmwares ==================== Link: https://patch.msgid.link/20260603113208.171874-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysmptcp: add-addr: always drop other suboptionsMatthieu Baerts (NGI0)3-38/+14
When an ADD_ADDR needs to be sent, it could be prepared if there is enough remaining space and even if the packet is not a pure ACK. But it would be dropped soon after. Indeed, in mptcp_pm_add_addr_signal(), there is enough space to fit a DSS of 20 octets and an ADD_ADDR echo containing an IPv4 address on 8 octets for example. In this case, the packet would be prepared, the MPTCP_ADD_ADDR_ECHO bit would be removed from pm->addr_signal, but the option would be silently dropped in mptcp_established_options_add_addr() not to override DSS info in the union from 'struct mptcp_out_options', and also because mptcp_write_options() will enforce mutually exclusion with DSS. Instead, don't even try to send an ADD_ADDR if it is not a pure ACK. Retry for each new packet until a pure-ACK is emitted. That's fine to do that, because each time an ADD_ADDR (echo) is scheduled, a pure ACK is queued. This also simplifies the code, and the skb checks can be done earlier, before the lock. Note: also, since commit 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets"), opts->ahmac would not have been set to 0 when other suboptions were not dropped, and when sending an ADD_ADDR echo. That would have resulted in sending an ADD_ADDR using garbage info, where there was not enough space, instead of an echo one without the ADD_ADDR HMAC. Fixes: 1bff1e43a30e ("mptcp: optimize out option generation") Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-11-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysmptcp: fix uninit-value in mptcp_established_optionsPaolo Abeni1-1/+5
syzbot reported the following uninit splat: BUG: KMSAN: uninit-value in mptcp_write_data_fin net/mptcp/options.c:542 [inline] BUG: KMSAN: uninit-value in mptcp_established_options_dss net/mptcp/options.c:590 [inline] BUG: KMSAN: uninit-value in mptcp_established_options+0x112f/0x3530 net/mptcp/options.c:874 mptcp_write_data_fin net/mptcp/options.c:542 [inline] mptcp_established_options_dss net/mptcp/options.c:590 [inline] mptcp_established_options+0x112f/0x3530 net/mptcp/options.c:874 tcp_established_options+0x312/0xcc0 net/ipv4/tcp_output.c:1192 __tcp_transmit_skb+0x5dc/0x5fe0 net/ipv4/tcp_output.c:1575 __tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499 tcp_send_ack+0x3d/0x60 net/ipv4/tcp_output.c:4505 mptcp_subflow_shutdown+0x164/0x690 net/mptcp/protocol.c:3137 mptcp_check_send_data_fin+0x31b/0x3d0 net/mptcp/protocol.c:3218 __mptcp_wr_shutdown net/mptcp/protocol.c:3234 [inline] __mptcp_close+0x860/0x1360 net/mptcp/protocol.c:3313 mptcp_close+0x42/0x260 net/mptcp/protocol.c:3367 inet_release+0x1ee/0x2a0 net/ipv4/af_inet.c:442 __sock_release net/socket.c:722 [inline] sock_close+0xd6/0x2f0 net/socket.c:1514 __fput+0x60e/0x1010 fs/file_table.c:510 ____fput+0x25/0x30 fs/file_table.c:538 task_work_run+0x208/0x2b0 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] exit_to_user_mode_loop+0x306/0x1b60 kernel/entry/common.c:98 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:238 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline] __do_fast_syscall_32+0x2c7/0x460 arch/x86/entry/syscall_32.c:310 do_fast_syscall_32+0x37/0x80 arch/x86/entry/syscall_32.c:332 do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:370 entry_SYSENTER_compat_after_hwframe+0x84/0x8e Local variable opts created at: __tcp_transmit_skb+0x4d/0x5fe0 net/ipv4/tcp_output.c:1536 __tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499 The output path currently omits initializing the mptcp extension `use_map` flag in a few corner cases. Address the issue always zeroing all the extensions flags before eventually initializing the individual bits. To that extent, introduce and use a struct_group to avoid multiple bitwise operations. Fixes: cfcceb7a39fc ("tcp: shrink per-packet memset in __tcp_transmit_skb()") Cc: stable@vger.kernel.org Reported-by: syzbot+ff020673c5e3d94d9478@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ff020673c5e3d94d9478 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-10-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysmptcp: check desc->count in read_sockGang Yan1-0/+2
__tcp_read_sock() checks desc->count after each skb is consumed and breaks the loop when it reaches 0. The MPTCP variant lacks this check. This is a functional bug, other subsystems also rely on this check: TLS strparser sets desc->count to 0 once a full TLS record is assembled and depends on this break to stop reading. Add the same desc->count check to __mptcp_read_sock(), mirroring __tcp_read_sock(). Fixes: 250d9766a984 ("mptcp: implement .read_sock") Cc: stable@vger.kernel.org Co-developed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-9-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysmptcp: sockopt: set sockopt on all subflowsMatthieu Baerts (NGI0)1-3/+4
The mptcp_setsockopt_all_sf(), currently used only with TCP_MAXSEG, stopped when one subflow returned an error. Even if it is not wrong, this is different from the other helpers trying to set the option on all subflows, and then returning an error if at least one of them had an issue. Follow this behaviour, for a question of uniformity. Fixes: 51c5fd09e1b4 ("mptcp: add TCP_MAXSEG sockopt support") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-8-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysmptcp: sockopt: check timestamping ret valueMatthieu Baerts (NGI0)1-2/+6
sock_set_timestamping() can fail for different reasons. The returned value should then be checked. If sock_set_timestamping() fails for at least one subflow, the first error is now reported to the userspace, similar to what is done with other socket options. Fixes: 9061f24bf82e ("mptcp: sockopt: propagate timestamp request to subflows") Cc: stable@vger.kernel.org Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Closes: https://lore.kernel.org/willemdebruijn.kernel.178a41a53d041@gmail.com Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-7-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysmptcp: pm: fix extra_subflows underflow on userspace PM subflow creationTao Cui1-6/+8
The userspace PM increments extra_subflows after __mptcp_subflow_connect() succeeds, but __mptcp_subflow_connect() calls mptcp_pm_close_subflow() on failure to roll back the pre-increment done by the kernel PM's fill_*() helpers. Because the userspace PM hasn't incremented yet at that point, this decrement is spurious and causes extra_subflows to underflow. Fix it by aligning the userspace PM with the kernel PM: increment extra_subflows before calling __mptcp_subflow_connect(), so the existing error path in subflow.c correctly rolls it back on failure. Also simplify the error handling by taking pm.lock only when needed for cleanup. Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos") Cc: stable@vger.kernel.org Signed-off-by: Tao Cui <cuitao@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-5-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysmptcp: allow subflow rcv wnd to shrinkPaolo Abeni1-0/+7
In MPTCP connection, the `window` field in the TCP header refers to the MPTCP-level rcv_nxt and it's right edge should not move backward. Such constraint is enforced at DSS option generation time. At the same time, the TCP stack ensures independently that the TCP-level rcv wnd right's edge does not move backward. That in turn causes artificial inflating of the MPTCP rcv window when the incoming data is acked at the TCP level and is OoO in the MPTCP sequence space (or lands in the backlog). As a consequence, the incoming traffic can exceed the receiver rcvbuf size even when the sender is not misbehaving. Prevent such scenario forcibly allowing the TCP subflow to shrink the TCP-level rcv wnd regardless of the current netns setting. Fixes: f3589be0c420 ("mptcp: never shrink offered window") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-4-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>