kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
5 days	Merge tag 'nf-26-06-10' of ↵	Paolo Abeni	2	-10/+7
	git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Revalidate bridge ports, add missing NULL checks to fetch the bridge device by the port. From Florian Westphal. 2) Fix netdevice refcount leak in the error path of nft_fwd hardware offload function, also from Florian. 3) Unregister helper expectfn callback on conntrack helper module removal, otherwise dangling pointer remains in place, from Weiming Shi. 4) Fix possible pointer infoleak in getsockopt() IPT_SO_GET_ENTRIES, From Kyle Zeng. 5) Validate that device MAC header is present before nf_syslog accesses it. From Xiang Mei. 6-8) Three patches to address a possible infoleak of stale stack data in three nf_tables expressions, due to mismatch in the _init() and _eval() function which is possible since 14fb07130c7d. From Davide Ornaghi and Florian Westphal. netfilter pull request 26-06-10 * tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register netfilter: nft_fib: fix stale stack leak via the OIFNAME register netfilter: nft_exthdr: fix register tracking for F_PRESENT flag netfilter: nf_log: validate MAC header was set before dumping it netfilter: x_tables: avoid leaking percpu counter pointers netfilter: nf_conntrack: destroy stale expectfn expectations on unregister netfilter: nf_tables_offload: drop device refcount on error netfilter: revalidate bridge ports ==================== Link: https://patch.msgid.link/20260610161629.214092-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 days	Merge tag 'ipsec-2026-06-10' of ↵	Paolo Abeni	1	-6/+11
	git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-06-10 1) xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags() Propagate SKBFL_SHARED_FRAG when paged fragments are moved between skbs so ESP can decide whether in-place crypto is safe. 2) xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload Replace the unlocked read of xtfs->ra_newskb with a local flag so a concurrent reassembly can no longer free first_skb between spin_unlock and the post-loop check. 3) xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx() Prune the inexact bin under xfrm_policy_lock so a concurrent xfrm_hash_rebuild() can no longer free it before xfrm_policy_kill() dereferences it. 4) xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state() Move hrtimer_cancel() for the output and drop timers ahead of their spinlocks, breaking the softirq/lock cycle that could deadlock against the timer callbacks on SMP. 5) xfrm: espintcp: do not reuse an in-progress partial send Fail a new send when espintcp_push_msgs() returns with emsg->len still set, so a blocking caller can no longer overwrite ctx->partial while a previous transfer still owns it. 6) esp: fix page frag reference leak on skb_to_sgvec failure Add a flag to esp_ssg_unref() to unconditionally unref the source scatterlist, releasing the old page references that are otherwise leaked when the second skb_to_sgvec() in esp_output_tail() fails. Please pull or let me know if there are problems. ipsec-2026-06-10 * tag 'ipsec-2026-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: esp: fix page frag reference leak on skb_to_sgvec failure xfrm: espintcp: do not reuse an in-progress partial send xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state() xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx() xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags() ==================== Link: https://patch.msgid.link/20260610140800.2562818-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 days	ipv6: Fix a potential NPD in cleanup_prefix_route()	Ido Schimmel	1	-2/+4
	addrconf_get_prefix_route() can return the fib6_null_entry sentinel entry which has a NULL fib6_table pointer. Therefore, before setting the route's expiration time, check that we are not working with this entry, as otherwise a NPD will be triggered [1]. Note that the other callers of addrconf_get_prefix_route() are not susceptible to this bug: 1. addrconf_prefix_rcv(): Requests a route with the 'RTF_ADDRCONF \| RTF_PREFIX_RT' flags which are not set on fib6_null_entry. 2. modify_prefix_route(): Fixed by commit a747e02430df ("ipv6: avoid possible NULL deref in modify_prefix_route()"). 3. __ipv6_ifa_notify(): Calls ip6_del_rt() which specifically checks for fib6_null_entry and returns an error. [1] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [...] Call Trace: <TASK> __kasan_check_byte (mm/kasan/common.c:573) lock_acquire.part.0 (kernel/locking/lockdep.c:5842 (discriminator 1)) _raw_spin_lock_bh (kernel/locking/spinlock.c:182 (discriminator 1)) cleanup_prefix_route (net/ipv6/addrconf.c:1280) ipv6_del_addr (net/ipv6/addrconf.c:1342) inet6_addr_del.isra.0 (net/ipv6/addrconf.c:3119) inet6_rtm_deladdr (net/ipv6/addrconf.c:4812) rtnetlink_rcv_msg (net/core/rtnetlink.c:6997) netlink_rcv_skb (net/netlink/af_netlink.c:2555) netlink_unicast (net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1899) __sock_sendmsg (net/socket.c:802 (discriminator 4)) ____sys_sendmsg (net/socket.c:2698) ___sys_sendmsg (net/socket.c:2752) __sys_sendmsg (net/socket.c:2784) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) Fixes: 5eb902b8e719 ("net/ipv6: Remove expired routes with a separated list of routes.") Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Reviewed-by: David Ahern <dahern@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609145448.768318-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 days	netfilter: nft_fib: fix stale stack leak via the OIFNAME register	Davide Ornaghi	1	-1/+1
	For NFT_FIB_RESULT_OIFNAME the destination register is declared with len = IFNAMSIZ (four 32-bit registers), but on the lookup-fail, RTN_LOCAL and oif-mismatch paths nft_fib{4,6}_eval() only writes one register via "dest = 0". The remaining three registers are left as whatever was on the stack in nft_do_chain()'s struct nft_regs, and a downstream expression that loads the register span can leak that uninitialised kernel stack to userspace. The NFTA_FIB_F_PRESENT existence check has the same shape: it is only meaningful for NFT_FIB_RESULT_OIF, yet it was accepted for any result type while the eval stores a single byte via nft_reg_store8(), leaving the rest of the declared span stale. Fix both: - replace the bare "dest = 0" in the eval with nft_fib_store_result(), which strscpy_pad()s the whole IFNAMSIZ for OIFNAME (and is already used on the other early-return path), and - restrict NFTA_FIB_F_PRESENT to NFT_FIB_RESULT_OIF and declare its destination as a single u8, so the marked span matches the one byte the eval writes. Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Suggested-by: Florian Westphal <fw@strlen.de> Cc: stable@vger.kernel.org Signed-off-by: Davide Ornaghi <d.ornaghi97@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
6 days	netfilter: x_tables: avoid leaking percpu counter pointers	Kyle Zeng	1	-9/+6
	The native and compat get-entries paths copy the fixed rule entry header from the kernelized rule blob to userspace before overwriting the entry's counter fields with a sanitized counter snapshot. On SMP kernels, entry->counters.pcnt contains the percpu allocation address used by x_tables rule counters. A caller can provide a userspace buffer that faults during the initial fixed-header copy after pcnt has been copied but before the later sanitized counter copy runs. The syscall then returns -EFAULT while leaving the raw percpu pointer in userspace. Copy only the fixed entry prefix before counters from the kernelized rule blob, then copy the sanitized counter snapshot into the counter field. Apply this ordering to the IPv4, IPv6, and ARP native and compat get-entries implementations so a fault cannot expose the internal percpu counter pointer. Fixes: 71ae0dff02d7 ("netfilter: xtables: use percpu rule counters") Signed-off-by: Kyle Zeng <kylebot@openai.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
6 days	ip6_vti: fix incorrect tunnel matching in vti6_tnl_lookup()	Eric Dumazet	1	-0/+2
	In vti6_tnl_lookup(), when an exact match for a tunnel fails, the code falls back to searching for wildcard tunnels: - Tunnels matching the packet's local address, with any remote address wildcard remote). - Tunnels matching the packet's remote address, with any local address (wildcard local). However, vti6 stores all these different types of tunnels in the same hash table (ip6n->tnls_r_l) prone to hash collisions. The bug is that the fallback search loops in vti6_tnl_lookup() were missing checks to ensure that the candidate tunnel actually has a wildcard address. Fixes: fbe68ee87522 ("vti6: Add a lookup method for tunnels with wildcard endpoints.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20260608164613.933023-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 days	ip6_vti: set netns_immutable on the fallback device.	Eric Dumazet	1	-0/+1
	john1988 and Noam Rathaus reported that vti6_init_net() does not set the netns_immutable flag on the per-netns fallback tunnel device (ip6_vti0). Other similar tunnel drivers (like ip6_tunnel, sit, ip6_gre, and ip_tunnel) correctly set this flag during their fallback device initialization to prevent them from being moved to another network namespace. Fixes: 61220ab34948 ("vti6: Enable namespace changing") Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20260608155918.787644-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days	esp: fix page frag reference leak on skb_to_sgvec failure	Alessandro Schino	1	-6/+11
	In esp_output_tail(), when esp->inplace is false, the old skb page frags are replaced with a new page from the xfrm page_frag cache The source scatterlist (sg) is built from the old frags before the replacement, and esp_ssg_unref() is responsible for releasing the old page references after the crypto operation completes However, if the second skb_to_sgvec() call (which builds the destination scatterlist from the new page) fails, the code jumps to error_free which only calls kfree(tmp). The old page frag references captured in the source scatterlist are never released: 1 sg[] is built from old frags via skb_to_sgvec() (no extra get_page) 2 nr_frags is set to 1 and frag[0] is replaced with the new page 3 Second skb_to_sgvec() fails -> goto error_free Fix this by adding a bool parameter to esp_ssg_unref() that, when true, unconditionally unrefs the source scatterlist frags. Since req->src is not yet initialized by aead_request_set_crypt() at the point of the error, the source scatterlist is obtained directly via esp_req_sg() Existing callers pass false to preserve the original behavior The same issue exists in both esp4 and esp6 as the code is identical Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Alessandro Schino <7991aleschino@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
7 days	ipv6: sit: reload inner IPv6 header after GSO offloads	Kyle Zeng	1	-0/+1
	ipip6_tunnel_xmit() caches the inner IPv6 header pointer at function entry and continues using it after iptunnel_handle_offloads(). For GSO skbs, iptunnel_handle_offloads() calls skb_header_unclone(). When the skb header is cloned, skb_header_unclone() can call pskb_expand_head(), which may move the skb head. The pskb_expand_head() contract requires pointers into the skb header to be reloaded after the call. If the later skb_realloc_headroom() branch is not taken, SIT uses the stale iph6 pointer to read the inner hop limit and DS field. That can read from a freed skb head after the old head's remaining clone is released. Reload iph6 after the offload helper succeeds and before subsequent reads from the inner IPv6 header. Keep the existing reload after skb_realloc_headroom(), since that branch can also replace the skb. Fixes: 14909664e4e1 ("sit: Setup and TX path for sit/UDP foo-over-udp encapsulation") Signed-off-by: Kyle Zeng <kylebot@openai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+6eb9ca986d80f6f88cf9@syzkaller.appspotmail.com Link: https://patch.msgid.link/20260605073448.6524-1-kylebot@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 days	ipv6: mcast: Fix use-after-free when processing MLD queries	Ido Schimmel	1	-4/+4
	When processing an MLD query, a pointer to the multicast group address is retrieved when initially parsing the packet. This pointer is later dereferenced without being reloaded despite the fact that the skb header might have been reallocated following the pskb_may_pull() calls, leading to a use-after-free [1]. Fix by copying the multicast group address when the packet is initially parsed. [1] BUG: KASAN: slab-use-after-free in __mld_query_work (net/ipv6/mcast.c:1512) Read of size 8 at addr ffff8881154b8e90 by task kworker/4:1/118 Workqueue: mld mld_query_work Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120) print_address_description.constprop.0 (mm/kasan/report.c:378) print_report (mm/kasan/report.c:482) kasan_report (mm/kasan/report.c:595) __mld_query_work (net/ipv6/mcast.c:1512) mld_query_work (net/ipv6/mcast.c:1563) process_one_work (kernel/workqueue.c:3314) worker_thread (kernel/workqueue.c:3397 kernel/workqueue.c:3478) kthread (kernel/kthread.c:436) ret_from_fork (arch/x86/kernel/process.c:158) ret_from_fork_asm (arch/x86/entry/entry_64.S:245) </TASK> [...] Freed by task 118: kasan_save_stack (mm/kasan/common.c:57) kasan_save_track (mm/kasan/common.c:78) kasan_save_free_info (mm/kasan/generic.c:584) __kasan_slab_free (mm/kasan/common.c:253 mm/kasan/common.c:285) kfree (./include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6251 mm/slub.c:6566) pskb_expand_head (net/core/skbuff.c:2335) __pskb_pull_tail (net/core/skbuff.c:2878 (discriminator 4)) __mld_query_work (net/ipv6/mcast.c:1495 (discriminator 1)) mld_query_work (net/ipv6/mcast.c:1563) process_one_work (kernel/workqueue.c:3314) worker_thread (kernel/workqueue.c:3397 kernel/workqueue.c:3478) kthread (kernel/kthread.c:436) ret_from_fork (arch/x86/kernel/process.c:158) ret_from_fork_asm (arch/x86/entry/entry_64.S:245) Fixes: 97300b5fdfe2 ("[MCAST] IPv6: Check packet size when process Multicast") Reported-by: Leo Lin <leo@depthfirst.com> Reviewed-by: David Ahern <dahern@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260603101811.612594-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 days	ipv6: anycast: insert aca into global hash under idev->lock	Jiayuan Chen	1	-8/+8
	syzbot reported a splat [1]: a slab-use-after-free in ipv6_chk_acast_addr(), which walks the global inet6_acaddr_lst[] hash under RCU and dereferences a struct ifacaddr6 that has already been freed while still linked in the hash, so a later reader walks into a dangling node. In __ipv6_dev_ac_inc() the aca is allocated with refcount 1, then aca_get() bumps it to 2 to keep it alive across the unlocked region. It is published to idev->ac_list under idev->lock, but ipv6_add_acaddr_hash() runs after write_unlock_bh(). A concurrent teardown (ipv6_ac_destroy_dev() from addrconf_ifdown(), under RTNL) can slip into that window: CPU0 __ipv6_dev_ac_inc CPU1 ipv6_ac_destroy_dev (RTNL) ------------------------------ ------------------------------------ aca_alloc() refcnt 1 aca_get() refcnt 2 write_lock_bh(idev->lock) add aca to ac_list write_unlock_bh(idev->lock) write_lock_bh(idev->lock) pull aca off ac_list write_unlock_bh(idev->lock) ipv6_del_acaddr_hash(aca) hlist_del_init_rcu() is a no-op, aca is not in the hash yet aca_put() refcnt 2->1 ipv6_add_acaddr_hash(aca) aca now inserted into the hash aca_put() refcnt 1->0 call_rcu(aca_free_rcu) -> kfree(aca) The hash removal becomes a no-op because the insertion has not happened yet, so once CPU0 inserts and drops the last reference, the aca is freed while still linked in inet6_acaddr_lst[], and readers dereference freed memory after the slab slot is reused. This window opened once RTNL stopped serializing the join path against device teardown. Move ipv6_add_acaddr_hash() inside the idev->lock section so the ac_list and hash insertions are atomic with respect to teardown: a racing remover now either misses the aca entirely or finds it in both lists. acaddr_hash_lock is now nested under idev->lock, which is acquired in softirq context, so switch all acaddr_hash_lock sites to spin_lock_bh() to avoid the irq lock inversion reported in [2]. [1] https://syzkaller.appspot.com/bug?extid=a01df04303c131efbf3a [2] https://lore.kernel.org/netdev/6a194ef7.ba3b1513.1890b4.0000.GAE@google.com/ Reported-by: syzbot+819eb928d120d2bdad0e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6a191f87.ce022c6e.138e56.0003.GAE@google.com/T/ Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Fixes: eb1ac9ff6c4a ("ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST.") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260529152219.235475-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-01	netfilter: nft_fib_ipv6: bail out of sibling walk if rt got unlinked	Jiayuan Chen	1	-0/+3
	This was reported by Sashiko [1]. The RCU walk over rt->fib6_siblings can spin forever if rt is unlinked mid-iteration: rt->fib6_siblings.next still points into the old ring, so the loop never meets &rt->fib6_siblings as its terminator. fib6_purge_rt() always does WRITE_ONCE(rt->fib6_nsiblings, 0) before list_del_rcu(), so readers can use rt->fib6_nsiblings == 0 as the detach signal. The same pattern is used in fib6_info_uses_dev() and rt6_nlmsg_size(). [1]: https://sashiko.dev/#/patchset/20260520023411.391233-1-jiayuan.chen%40linux.dev Suggested-by: Florian Westphal <fw@strlen.de> Fixes: 1c32b24c234b ("netfilter: nft_fib_ipv6: switch to fib6_lookup") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-29	Revert "ipv6: preserve insertion order for same-scope addresses"	Fernando Fernandez Mancera	1	-1/+1
	Chris Adams reported that preserving insertion order for same-scope addresses is causing SSH connections to be dropped after stopping a VM while running NetworkManager. NetworkManager caches the IPv6 address configuration, when a RA arrives, it determines the list of addresses to configure and checks if the addresses are already in the right order in the kernel. If they aren't, NetworkManager removes and re-adds them to achieve the desired order. As the order changes, NetworkManager is confused and reconfigures the addresses on every update. In addition, this would also affect to cloud tooling that relies on IPv6 addresses order to identify primary and secondaries addresses. This reverts commit cb3de96eea66f5e4a580086c6a1be46e765f97f4. Fixes: cb3de96eea66 ("ipv6: preserve insertion order for same-scope addresses") Reported-by: Chris Adams <linux@cmadams.net> Closes: https://lore.kernel.org/netdev/20260521135310.GC977@cmadams.net/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260529112357.5079-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-29	Merge tag 'ipsec-2026-05-29' of ↵	Jakub Kicinski	2	-3/+3
	git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-05-29 1) xfrm: route MIGRATE notifications to caller's netns Thread the caller's netns through km_migrate() so that MIGRATE notifications go to the issuing netns, fixing both the init_net listener leak and MOBIKE notifications inside non-init netns. From Maoyi Xie. 2) xfrm: ipcomp: Free destination pages on acomp errors Move the out_free_req label up so that allocated destination pages are released on decompression errors, not only on success. From Herbert Xu. 3) xfrm: Check for underflow in xfrm_state_mtu Reject configurations that cause xfrm_state_mtu() to underflow, preventing a negative TFCPAD value from becoming a memset size that triggers an out-of-bounds write of several terabytes. From David Ahern. 4) xfrm: ah: use skb_to_full_sk in async output callbacks Convert the possibly-incomplete skb->sk to a full socket pointer in async AH callbacks so that a request_sock or timewait_sock never reaches xfrm_output_resume() downstream consumers. From Michael Bommarito. 5) Add and revert: esp: fix page frag reference leak on skb_to_sgvec failure The patch does not fix te issue completely. 6) xfrm: esp: restore combined single-frag length gate Check the aligned post-trailer combined length against a page limit in the fast path, preventing skb_page_frag_refill() from falling back to a page too small for the destination scatterlist. From Jingguo Tan. 7) xfrm: iptfs: reset runtime state when cloning SAs Reinitialise the clone's mode_data runtime objects before publishing it, preventing queued skbs from being freed with list state copied from the original SA when migration fails. From Shaomin Chen. 8) xfrm: move policy_bydst RCU sync from per-netns .exit to .pre_exit Flush policy tables and drain the workqueue in a .pre_exit handler so that cleanup_net() pays one RCU grace period per batch instead of one per namespace, fixing stalls at high CLONE_NEWNET rates. From Usama Arif. 9) xfrm: input: hold netns during deferred transport reinjection Take a netns reference when queueing deferred transport reinjection work and drop it after the callback completes, keeping the skb->cb net pointer valid until the deferred work runs. From Zhengchuan Liang. * tag 'ipsec-2026-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: Revert "esp: fix page frag reference leak on skb_to_sgvec failure" xfrm: input: hold netns during deferred transport reinjection xfrm: move policy_bydst RCU sync from per-netns .exit to .pre_exit xfrm: iptfs: reset runtime state when cloning SAs xfrm: esp: restore combined single-frag length gate esp: fix page frag reference leak on skb_to_sgvec failure xfrm: ah: use skb_to_full_sk in async output callbacks xfrm: Check for underflow in xfrm_state_mtu xfrm: ipcomp: Free destination pages on acomp errors xfrm: route MIGRATE notifications to caller's netns ==================== Link: https://patch.msgid.link/20260529092648.3878973-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-29	ipv6: fix possible infinite loop in fib6_select_path()	Jiayuan Chen	1	-0/+3
	Found while auditing the same pattern Sashiko reported in rt6_fill_node() [1]. Apply the same fix as commit f8d8ce1b515a ("ipv6: fix possible infinite loop in fib6_info_uses_dev()"). Writers holding tb6_lock can list_del_rcu(&first->fib6_siblings) without waiting for RCU readers; first->fib6_siblings.next then still points into the old ring and this softirq-side walker never reaches &first->fib6_siblings as its terminator. fib6_purge_rt() always WRITE_ONCE()s first->fib6_nsiblings to 0 before list_del_rcu(), so an inside-loop check is a reliable detach signal. [1] https://sashiko.dev/#/patchset/20260526020227.4857-1-jiayuan.chen%40linux.dev Fixes: d9ccb18f83ea ("ipv6: Fix soft lockups in fib6_select_path under high next hop churn") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260527053133.180695-2-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-29	ipv6: fix possible infinite loop in rt6_fill_node()	Jiayuan Chen	1	-0/+2
	Sashiko reported this issue [1]. Apply the same fix as commit f8d8ce1b515a ("ipv6: fix possible infinite loop in fib6_info_uses_dev()"). Writers holding tb6_lock can list_del_rcu(&rt->fib6_siblings) without waiting for RCU readers; rt->fib6_siblings.next then still points into the old ring and this softirq-side walker never reaches &rt->fib6_siblings, causing a CPU stall. fib6_del_route() always WRITE_ONCE()s rt->fib6_nsiblings to 0 before list_del_rcu(), so an inside-loop check is a reliable detach signal. [1] https://sashiko.dev/#/patchset/20260526020227.4857-1-jiayuan.chen%40linux.dev Fixes: d9ccb18f83ea ("ipv6: Fix soft lockups in fib6_select_path under high next hop churn") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260527053133.180695-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-29	Revert "esp: fix page frag reference leak on skb_to_sgvec failure"	Steffen Klassert	1	-7/+5
	This reverts commit 2982e599fff6faa21c8df147d96fc7af6c1a2f24. The patch does not fully fix the issue and the Author does not match the 'Signed-off-by:' tag, so revert it for now. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-05-28	ipv6: rpl: fix hdrlen overflow in ipv6_rpl_srh_decompress()	Rahul Chandelkar	1	-1/+1
	ipv6_rpl_srh_decompress() computes: outhdr->hdrlen = (((n + 1) * sizeof(struct in6_addr)) >> 3); hdrlen is __u8. For n >= 127 the result exceeds 255 and silently truncates. With n=127 (cmpri=15, cmpre=15, pad=0, hdrlen=16): (128 * 16) >> 3 = 256, truncated to 0 as __u8 The caller in ipv6_rpl_srh_rcv() then places the compressed header at buf + ((ohdr->hdrlen + 1) << 3). With hdrlen=0 this is buf + 8, but the decompressed region occupies buf[0..2055] (8-byte header plus 128 full addresses). The compressed header overlaps the decompressed data, and ipv6_rpl_srh_compress() writes into this overlap, corrupting the routing header of the forwarded packet. The existing guard at exthdrs.c:546 checks (n + 1) > 255, which prevents n+1 from overflowing unsigned char (the segments_left field), but does not prevent the computed hdrlen from overflowing __u8. n=127 passes because 128 <= 255, yet hdrlen=256 does not fit. Tighten the bound to (n + 1) > 127. This caps n at 126, giving hdrlen = (127 * 16) >> 3 = 254, which fits in __u8. The compressed header then lands at buf + ((254 + 1) << 3) = buf + 2040, exactly past the decompressed region (buf[0..2039]). No overlap. 127 segments is well beyond any realistic RPL deployment. Fixes: 8610c7c6e3bd ("net: ipv6: add support for rpl sr exthdr") Signed-off-by: Rahul Chandelkar <rc@rexion.ai> Link: https://patch.msgid.link/20260525154031.2290876-1-rc@rexion.ai Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-27	ipv6: validate extension header length before copying to cmsg	Qi Tang	1	-8/+46
	ip6_datagram_recv_specific_ctl() builds IPV6_{HOPOPTS,DSTOPTS,RTHDR} cmsgs (and their IPV6_2292* legacy counterparts) by trusting the on-wire hdrlen byte (ptr[1]) when computing the put_cmsg() length. The length was validated only at parse time (ipv6_parse_hopopts(), etc.). An nftables payload-write expression can rewrite hdrlen after parsing and before the skb reaches recvmsg; the write itself is in-bounds but put_cmsg() then reads up to ((hdrlen+1) << 3) = 2040 bytes from an 8-byte header. nftables is reachable from an unprivileged user namespace, so this is an unprivileged slab-out-of-bounds read: BUG: KASAN: slab-out-of-bounds in put_cmsg+0x3ac/0x540 put_cmsg+0x3ac/0x540 udpv6_recvmsg+0xca0/0x1250 sock_recvmsg+0xdf/0x190 ____sys_recvmsg+0x1b1/0x620 Add ipv6_get_exthdr_len() which validates that at least two bytes are accessible before reading the hdrlen field, then checks the computed length against skb_tail_pointer(skb), returning 0 on failure. Extension headers are kept in the linear skb area by pskb_may_pull() during input, so skb_tail_pointer() is the correct bound. Use ipv6_get_exthdr_len() at all non-AH call sites: the five standalone cmsg blocks (HbH, 2292HbH, 2292DSTOPTS x2, 2292RTHDR) and the three standard cases in the extension-header walk loop (DSTOPTS, ROUTING, default). AH retains an inline bounds check because its length formula differs ((ptr[1]+2)<<2). The walk loop also gets a pre-read bounds check at the top to validate ptr before any case accesses ptr[0] or ptr[1]. When the walk loop detects a corrupted header, return from the function instead of continuing to process later socket options. Cc: stable@vger.kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260523143245.2281415-1-tpluszz77@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-26	ip6: vti: Use ip6_tnl.net in vti6_siocdevprivate().	Maoyi Xie	1	-2/+9
	After patch 1/2 in this series, vti6_update() unlinks and relinks the tunnel through t->net. vti6_siocdevprivate() still uses dev_net(dev) for the collision lookup. For a tunnel moved through IFLA_NET_NS_FD, dev_net(dev) is the new netns, not t->net. SIOCCHGTUNNEL on a migrated tunnel then runs: net = dev_net(dev) /* migrated netns / t = vti6_locate(net, &p1, false) / misses target in t->net / ... t = netdev_priv(dev) vti6_update(t, &p1, false) / mutates t->net's hash */ A caller in the migrated netns picks params that match a tunnel in the creation netns. The lookup in dev_net(dev) finds nothing. vti6_update() prepends the migrated tunnel at the head of the creation netns hash bucket for those params. Later lookups in the creation netns resolve to the migrated device. xfrm receive delivers the matched packets through a device the caller controls. Reachable from an unprivileged user namespace (unshare --user --map-root-user --net). Cross tenant scope on container hosts. Switch the SIOCCHGTUNNEL path on a non fallback device to use t->net for the lookup. The lookup now matches the netns vti6_update() operates on. Also add ns_capable(self->net->user_ns, CAP_NET_ADMIN) before the lookup. The check at the top of the case is against dev_net(dev)->user_ns, which after migration is the attacker's netns. A caller there can pick params absent from self->net, the lookup returns NULL, t becomes self, and vti6_update() inserts the device into the creation netns hash. The new check requires CAP_NET_ADMIN in the creation netns user_ns too. SIOCADDTUNNEL and SIOCCHGTUNNEL on the fallback device keep dev_net(dev), which equals init_net there. Fixes: 61220ab34948 ("vti6: Enable namespace changing") Suggested-by: Jakub Kicinski <kuba@kernel.org> Suggested-by: Xiao Liang <shaw.leon@gmail.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Link: https://patch.msgid.link/20260521130555.3421684-3-maoyixie.tju@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-26	ip6: vti: Use ip6_tnl.net in vti6_changelink().	Kuniyuki Iwashima	1	-5/+7
	ip netns add ns1 ip netns add ns2 ip -n ns1 link add vti6_test type vti6 remote ::1 local ::2 key 7 ip -n ns1 link set vti6_test netns ns2 ip -n ns2 link set vti6_test type vti6 remote ::3 local ::4 key 9 ip netns del ns2 ip netns del ns1 [ 132.495484] ------------[ cut here ]------------ [ 132.497609] kernel BUG at net/core/dev.c:12376! Commit 61220ab34948 ("vti6: Enable namespace changing") dropped NETIF_F_NETNS_LOCAL from vti6 devices. A vti6 tunnel can then move through IFLA_NET_NS_FD. After the move dev_net(dev) points at the new netns while t->net stays at the creation netns. vti6_changelink() and vti6_update() still use dev_net(dev) and dev_net(t->dev). They unlink from one per netns hash and relink into another. The creation netns is left with a stale entry. cleanup_net() of that netns later walks freed memory. Reachable from an unprivileged user namespace (unshare --user --map-root-user --net). Cross tenant scope on container hosts. Fixes: 61220ab34948 ("vti6: Enable namespace changing") Reported-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260521130555.3421684-2-maoyixie.tju@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-25	ipv6: exthdrs: refresh nh pointer after ipv6_hop_jumbo()	Justin Iurman	1	-0/+2
	ipv6_hop_jumbo() calls pskb_trim_rcsum(), which can change skb pointers. Let's recompute nh pointer to make sure any change won't mess things up. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Justin Iurman <justin.iurman@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260522112013.12342-1-justin.iurman@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-25	ipv6: exthdrs: refresh nh after handling HAO option	Zhengchuan Liang	1	-0/+2
	ip6_parse_tlv() caches skb_network_header(skb) in nh while walking IPv6 TLVs. ipv6_dest_hao() may call pskb_expand_head() for a cloned skb, which can move the skb head and invalidate the cached network header pointer. Refresh nh after ipv6_dest_hao() returns so any trailing padding or TLVs are parsed from the current skb head. This matches the existing pattern used in ip6_parse_tlv() after helpers that can modify skb header storage. Fixes: a831f5bbc89a ("[IPV6] MIP6: Add inbound interface of home address option.") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Co-developed-by: Luxing Yin <tr0jan@lzu.edu.cn> Signed-off-by: Luxing Yin <tr0jan@lzu.edu.cn> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/7aba1debc2196189172499e5769802b026f8caf8.1779247873.git.zcliangcn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-22	netfilter: nft_fib_ipv6: handle routes via external nexthop	Jiayuan Chen	1	-0/+16
	fib6_info has a union: union { struct list_head fib6_siblings; struct list_head nh_list; }; Old-style multipath (ip -6 route add ... nexthop ... nexthop ...) uses fib6_siblings. External nexthop (ip -6 route add ... nhid N) uses nh_list, linked into &nh->f6i_list. nft_fib6_info_nh_uses_dev() blindly walks &rt->fib6_siblings, causing an OOB read past the struct nexthop slab when rt->nh is set: ================================================================== BUG: KASAN: slab-out-of-bounds in nft_fib6_eval+0x1362/0x16c0 Read of size 8 at addr ffff888103a099d0 by task ping/386 CPU: 2 UID: 0 PID: 386 Comm: ping Not tainted 7.1.0-rc3+ #251 PREEMPT Call Trace: <IRQ> dump_stack_lvl+0x76/0xa0 print_report+0xd1/0x5f0 kasan_report+0xe7/0x130 __asan_report_load8_noabort+0x14/0x30 nft_fib6_eval+0x1362/0x16c0 nft_do_chain+0x279/0x18c0 nft_do_chain_ipv6+0x1a8/0x230 nf_hook_slow+0xad/0x200 ipv6_rcv+0x152/0x380 __netif_receive_skb_one_core+0x118/0x1c0 ================================================================== Branch by route shape: when rt->nh is set, walk via nexthop_for_each_fib6_nh() (also covers nh groups, which the original code missed); otherwise walk fib6_siblings, guarded by READ_ONCE() of rt->fib6_nsiblings as required by commit 31d7d67ba127 ("ipv6: annotate data-races around rt->fib6_nsiblings"). Fixes: 1c32b24c234b ("netfilter: nft_fib_ipv6: switch to fib6_lookup") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-05-22	netfilter: nft_fib_ipv6: walk fib6_siblings under RCU	Jiayuan Chen	1	-1/+1
	nft_fib6_info_nh_uses_dev() runs from nft_fib6_eval() in softirq under rcu_read_lock(). fib6_siblings is modified by writers that hold tb6_lock but do not wait for RCU readers, so the sibling walk should use list_for_each_entry_rcu(): it adds READ_ONCE() on the ->next pointer and lets CONFIG_PROVE_RCU_LIST validate the locking. No functional change for non-debug builds. Fixes: 1c32b24c234b ("netfilter: nft_fib_ipv6: switch to fib6_lookup") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-05-22	xfrm: esp: restore combined single-frag length gate	Jingguo Tan	1	-2/+2
	The ESP out-of-place fast path appends the trailer in esp_output_head() before esp_output_tail() allocates the destination page frag. The head-side gate currently checks skb->data_len and tailen separately, but the tail code allocates a single destination frag from the combined post-trailer skb->data_len. Reject the page-frag fast path when the combined aligned length exceeds a page. Otherwise skb_page_frag_refill() may fall back to a single page while the destination sg still spans the combined skb->data_len. Restore this combined-length page gate for both IPv4 and IPv6. Fixes: 5bd8baab087d ("esp: limit skb_page_frag_refill use to a single page") Cc: stable@vger.kernel.org Signed-off-by: Lin Ma <malin89@huawei.com> Signed-off-by: Chenyuan Mi <michenyuan@huawei.com> Signed-off-by: Jingguo Tan <tanjingguo@huawei.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-05-22	esp: fix page frag reference leak on skb_to_sgvec failure	e521588	1	-5/+7
	In esp_output_tail(), when esp->inplace is false, the old skb page frags are replaced with a new page from the xfrm page_frag cache. The source scatterlist (sg) is built from the old frags before the replacement, and esp_ssg_unref() is responsible for releasing the old page references after the crypto operation completes. However, if the second skb_to_sgvec() call (which builds the destination scatterlist from the new page) fails, the code jumps to error_free which only calls kfree(tmp). The old page frag references captured in the source scatterlist are never released: 1. sg[] is built from old frags via skb_to_sgvec() (no extra get_page) 2. nr_frags is set to 1 and frag[0] is replaced with the new page 3. Second skb_to_sgvec() fails -> goto error_free 4. kfree(tmp) frees the sg[] memory but old frags are not unref'd 5. kfree_skb() only releases frag[0] (the new page), not the old ones Fix this by adding a bool parameter to esp_ssg_unref() that, when true, unconditionally unrefs the source scatterlist frags without checking req->src and req->dst, since those fields are not yet initialized by aead_request_set_crypt() at the point of the error. Existing callers pass false to preserve the original behavior. The same issue exists in both esp4 and esp6 as the code is identical. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Alessandro Schino <7991aleschino@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-05-21	ipv6: ioam: refresh hdr pointer before ioam6_event()	Justin Iurman	1	-3/+3
	Reported by Sashiko: In ipv6_hop_ioam(), the hdr pointer is initialized to point into the skb's linear data buffer. Later, the code calls skb_ensure_writable(), which might reallocate the buffer: if (skb_ensure_writable(skb, optoff + 2 + hdr->opt_len)) goto drop; /* Trace pointer may have changed / trace = (struct ioam6_trace_hdr )(skb_network_header(skb) + optoff + sizeof(hdr)); ioam6_fill_trace_data(skb, ns, trace, true); ioam6_event(IOAM6_EVENT_TRACE, dev_net(skb->dev), GFP_ATOMIC, (void )trace, hdr->opt_len - 2); If the skb is cloned or lacks sufficient linear headroom, skb_ensure_writable() will invoke pskb_expand_head(), which reallocates the skb's data buffer and frees the old one, invalidating pointers to it. While the code recalculates the trace pointer immediately after the call to skb_ensure_writable(), it fails to recalculate the hdr pointer. This patch fixes the above by recalculating the hdr pointer before passing hdr->opt_len to ioam6_event(), so that we avoid any UaF. Fixes: f655c78d6225 ("net: exthdrs: ioam6: send trace event") Cc: stable@vger.kernel.org Signed-off-by: Justin Iurman <justin.iurman@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260520124242.32320-1-justin.iurman@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-21	ipv6: route: Unregister netdevice notifier on BPF init failure	Yuho Choi	1	-1/+5
	ip6_route_init() registers ip6_route_dev_notifier before registering the IPv6 route BPF iterator target. If bpf_iter_register() fails after the notifier has been registered, the error path currently jumps to out_register_late_subsys and unwinds the RTNL handlers and pernet route state without removing the notifier from the netdevice notifier chain. This leaves ip6_route_dev_notify() callable after the IPv6 route state it uses has been torn down. Add a separate unwind label for the BPF iterator failure path and unregister the netdevice notifier before continuing with the existing cleanup. Fixes: 138d0be35b14 ("net: bpf: Add netlink and ipv6_route bpf_iter targets") Signed-off-by: Yuho Choi <dbgh9129@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260520030329.1061183-1-dbgh9129@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-21	tcp: fix stale per-CPU tcp_tw_isn leak enabling ISN prediction	Eric Dumazet	1	-1/+2
	Blamed commit moved the TIME_WAIT-derived ISN from the skb control block to a per-CPU variable, assuming the value would always be consumed by tcp_conn_request() for the same packet that wrote it. That assumption is violated by multiple drop paths between the producer (__this_cpu_write(tcp_tw_isn, isn) in tcp_v{4,6}_rcv()) and the consumer (tcp_conn_request()): - min_ttl / min_hopcount check - xfrm policy check - tcp_inbound_hash() MD5/AO mismatch - tcp_filter() eBPF/SO_ATTACH_FILTER drop - th->syn && th->fin discard in tcp_rcv_state_process() TCP_LISTEN - psp_sk_rx_policy_check() in tcp_v{4,6}_do_rcv() - tcp_checksum_complete() in tcp_v{4,6}_do_rcv() - tcp_v{4,6}_cookie_check() returning NULL When a packet is dropped on any of these paths, tcp_tw_isn is left set. The next SYN processed on the same CPU then consumes the non zero value in tcp_conn_request(), receiving a potentially predictable ISN. This patch moves back tcp_tw_isn to skb->cb[], getting rid of the per-cpu variable. Note that tcp_v{4,6}_fill_cb() do not set it. Very litle impact on overall code size/complexity: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 2/1 up/down: 8/-15 (-7) Function old new delta tcp_v6_rcv 3038 3042 +4 tcp_v4_rcv 3035 3039 +4 tcp_conn_request 2938 2923 -15 Total: Before=24436060, After=24436053, chg -0.00% Fixes: 41eecbd712b7 ("tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260519084611.2485277-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-20	ipv6: ioam: add NULL check for idev in ipv6_hop_ioam()	Justin Iurman	1	-2/+13
	Reported by Sashiko: The function ipv6_hop_ioam() accesses __in6_dev_get(skb->dev)->cnf.ioam6_enabled without validating the returned idev pointer. Because addrconf_ifdown() can concurrently clear dev->ip6_ptr via RCU, __in6_dev_get() can return NULL during interface teardown, which could cause a NULL pointer dereference when processing an IOAM Hop-by-Hop option. Let's add a check and use SKB_DROP_REASON_IPV6DISABLED accordingly. Fixes: 9ee11f0fff20 ("ipv6: ioam: Data plane support for Pre-allocated Trace") Cc: stable@vger.kernel.org Signed-off-by: Justin Iurman <justin.iurman@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260517183059.29140-1-justin.iurman@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-16	netfilter: ip6t_hbh: reject oversized option lists	Zhengchuan Liang	1	-0/+4
	struct ip6t_opts stores at most IP6T_OPTS_OPTSNR option descriptors, but hbh_mt6_check() does not reject larger optsnr values supplied from userspace. Validate optsnr in the rule setup path so only match data that fits the fixed-size opts array can be installed. This follows the existing xtables pattern of rejecting invalid user-provided counts in checkentry() and keeps the packet matching path unchanged. `struct ip6t_opts` has a fixed `opts[IP6T_OPTS_OPTSNR]` array, where `IP6T_OPTS_OPTSNR` is 16, then off-by-one array access is possible: [ 137.924693][ T8692] UBSAN: array-index-out-of-bounds in ../net/ipv6/netfilter/ip6t_hbh.c:110:29 [ 137.926167][ T8692] index 16 is out of range for type '__u16 [16]' Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-16	xfrm: ah: use skb_to_full_sk in async output callbacks	Michael Bommarito	1	-1/+1
	When AH output is offloaded to an asynchronous crypto provider (hardware accelerators such as AMD CCP, or a forced-async software shim used for testing), the digest completion fires ah_output_done() / ah6_output_done() on a workqueue. The egress skb at that point may have been originated by a TCP listener sending a SYN-ACK, which sets skb->sk to a request_sock via skb_set_owner_edemux(); it may also have been originated by an inet_timewait_sock retransmit. Neither is a full struct sock, and passing the raw skb->sk to xfrm_output_resume() then forwards a non-full socket through the rest of the xfrm output chain. xfrm_output_resume() and its downstream consumers expect a full sk where they dereference at all. The natural egress path through ah_output_done() does not crash today because the consumers that read past sock_common are either gated by sk_fullsock() or short-circuit on flags that are clear on a fresh request_sock; an exhaustive walk of the 50 most plausible consumers under sch_fq, dev_queue_xmit, netfilter, tc-egress and cgroup-egress BPF found no current unguarded deref. The bug is still a real type confusion that future consumer changes could turn into a memory-corruption primitive. This is the same bug class fixed for ESP in commit 1620c88887b1 ("xfrm: Fix the usage of skb->sk"). Apply the analogous fix to AH: convert skb->sk to a full socket pointer (or NULL) via skb_to_full_sk() before handing it to xfrm_output_resume(). The same async AH callbacks were touched recently for an independent ESN-related ICV layout bug in commit ec54093e6a8f ("xfrm: ah: account for ESN high bits in async callbacks"); the sk type-confusion addressed here is orthogonal. This patch is part of an ongoing audit of the AH callback paths; an ah_output ihl-validation hardening series is also currently under review on netdev. Reproduced under UML + KASAN + lockdep with a forced-async hmac(sha1) shim that registers at priority 9999 and wraps the sync in-tree hmac-sha1-lib. With the shim loaded, ah_output_done runs on every SYN-ACK egress through a transport-mode AH SA and skb->sk arrives as a request_sock (TCP_NEW_SYN_RECV); after this patch, xfrm_output_resume() receives the listener (the result of sk_to_full_sk()) and consumer derefs land on full-sock fields as intended. Fixes: 9ab1265d5231 ("xfrm: Use actual socket sk instead of skb socket for xfrm_output_resume") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-05-09	Merge tag 'nf-26-05-08' of ↵	Jakub Kicinski	6	-100/+66
	git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: 1) Allow initial x_tables table replacement without emitting an audit log message. Delay the register message until after hooks are wired up to avoid unnecessary unregister logs during error unwinding. 2) Fix a NULL dereference by allocating hook ops before adding the table to the per-netns list. Use `synchronize_rcu()` during error unwinding to ensure the table stops processing packets before teardown. Defer audit log register message until all operations succeed. 3) Refactor xtables to use a single `xt_unregister_table_pre_exit` function. Eliminate code duplication by centralizing table unregistration logic within the xtables core. ebtables cannot be changed due to incompatibility. 4) Unregister xtables templates before module removal. This prevents a race condition where userspace instantiates a new table after the pernet unreg removed the current table. 5) Add `xtables_unregister_table_exit` to fully unregister netfilter tables during module removal. Unlink the table from dying lists, then free hook operations. 6) Implement a two-stage removal scheme for ebtables following the x_tables pattern. Assign table->ops while holding the ebt mutex to prevent exposing partially-filled structures. 7) Fix ebtables module initialization race. Register the template last in table initialization functions. Prevent table instantiation before pernet operations are available. 8) Fix a race condition in x_tables module initialization. Ensure pernet ops are fully set up before exposing the table to userspace. 9) Fix a race condition in ebtables module initialization, similar to previous patch. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Validate that the expectation tuple and mask netlink attributes are present when adding expectation via nfqueue, this fixes a possible null-ptr-deref. 12) Fix possible rare memleak in the SIP helper in case helper has been detached from conntrack entry, from Li Xiasong. 13) Fix refcount leak in nft_ct when creating custom expectation, also from Li Xiason. Patches 1-9 from Florian Westphal. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Check that tuple and mask netlink attributes are set when creating an expectation via nfqueue. * tag 'nf-26-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_ct: fix missing expect put in obj eval netfilter: nf_conntrack_sip: get helper before allocating expectation netfilter: ctnetlink: check tuple and mask in expectations created via nfqueue netfilter: nf_conntrack_expect: restore helper propagation via expectation netfilter: bridge: eb_tables: close module init race netfilter: x_tables: close dangling table module init race netfilter: ebtables: close dangling table module init race netfilter: ebtables: move to two-stage removal scheme netfilter: x_tables: add and use xtables_unregister_table_exit netfilter: x_tables: unregister the templates first netfilter: x_tables: add and use xt_unregister_table_pre_exit netfilter: x_tables: allocate hook ops while under mutex netfilter: x_tables: allow initial table replace without emitting audit log message ==================== Link: https://patch.msgid.link/20260507234509.603182-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-09	ipv6: flowlabel: enforce per-netns limit for unprivileged callers	Maoyi Xie	1	-3/+11
	fl_size, fl_ht and ip6_fl_lock in net/ipv6/ip6_flowlabel.c are file scope and shared across netns. mem_check() reads fl_size to decide whether to deny non-CAP_NET_ADMIN callers. capable() runs against init_user_ns, so an unprivileged user in any non-init userns can push fl_size past FL_MAX_SIZE - FL_MAX_SIZE / 4 and starve every other unprivileged userns on the host. Add struct netns_ipv6::flowlabel_count, bumped and decremented next to fl_size in fl_intern, ip6_fl_gc and ip6_fl_purge. The new field fills the existing 4-byte hole after ipmr_seq, so struct netns_ipv6 stays the same size on 64-bit builds. Bump FL_MAX_SIZE from 4096 to 8192. It has been 4096 since the file was added. Machines and connection counts have grown. mem_check() folds an extra per-netns ceiling into the existing non-CAP_NET_ADMIN conditional. The ceiling is half of the total budget that unprivileged callers have ever been able to use, i.e. (FL_MAX_SIZE - FL_MAX_SIZE / 4) / 2 = 3072 entries. With FL_MAX_SIZE doubled, this preserves the original per-user reach of 3K (what an unprivileged caller could already obtain before this change), while forcing an attacker to spread allocations across at least two netns to exhaust the global non-CAP_NET_ADMIN budget. CAP_NET_ADMIN against init_user_ns still bypasses both caps. The previous patch took ip6_fl_lock across mem_check and fl_intern, so the new flowlabel_count read in mem_check and the new flowlabel_count++ in fl_intern run under the same critical section. flowlabel_count is therefore plain int, like fl_size. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-3-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-09	ipv6: flowlabel: take ip6_fl_lock across mem_check and fl_intern	Maoyi Xie	1	-13/+21
	mem_check() in net/ipv6/ip6_flowlabel.c reads fl_size without holding ip6_fl_lock. fl_intern() takes the lock immediately afterwards. The two checks therefore race against concurrent fl_intern, ip6_fl_gc and ip6_fl_purge writers, which makes the mem_check budget check approximate. Move spin_lock_bh(&ip6_fl_lock) and the matching unlock from fl_intern() into its only caller ipv6_flowlabel_get(). The mem_check() call now runs under the same critical section as the fl_intern() insert, so the budget check is exact. With all writers and the read of fl_size under ip6_fl_lock, convert fl_size from atomic_t to plain int. The four sites that update or read fl_size are fl_intern (insert path), ip6_fl_gc (garbage collector, the !sched check and the per-entry decrement), ip6_fl_purge (per-netns purge), and mem_check (budget check), and all four now run under ip6_fl_lock. This is a prerequisite for adding a per-netns budget alongside fl_size. The follow-up patch adds netns_ipv6::flowlabel_count and folds it into mem_check(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-2-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08	netfilter: x_tables: close dangling table module init race	Florian Westphal	4	-43/+45
	Similar to the previous ebtables patch: template add exposes the table to userspace, we must do this last to rnsure the pernet ops are set up (contain the destructors). Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-08	netfilter: x_tables: add and use xtables_unregister_table_exit	Florian Westphal	2	-6/+8
	Previous change added xtables_unregister_table_pre_exit to detach the table from the packetpath and to unlink it from the active table list. In case of rmmod, userspace that is doing set/getsockopt for this table will not be able to re-instantiate the table: 1. The larval table has been removed already 2. existing instantiated table is no longer on the xt pernet table list. This adds the second stage helper: unlink the table from the dying list, free the hook ops (if any) and do the audit notification. It replaces xt_unregister_table(). Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Reported-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Closes: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-08	netfilter: x_tables: unregister the templates first	Florian Westphal	4	-4/+4
	When the module is going away we need to zap the template first. Else there is a small race window where userspace could instantiate a new table after the pernet exit function has removed the current table. Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Reported-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Closes: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-08	netfilter: x_tables: add and use xt_unregister_table_pre_exit	Florian Westphal	6	-13/+5
	Remove the copypasted variants of _pre_exit and add one single function in the xtables core. ebtables is not compatible with x_tables and therefore unchanged. This is a preparation patch to reduce noise in the followup bug fixes. Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-08	netfilter: x_tables: allocate hook ops while under mutex	Florian Westphal	1	-34/+4
	arp/ip(6)t_register_table() add the table to the per-netns list via xt_register_table() before allocating the per-netns hook ops copy via kmemdup_array(). This leaves a window where the table is visible in the list with ops=NULL. If the pernet exit happens runs concurrently the pre_exit callback finds the table via xt_find_table() and passes the NULL ops pointer to nf_unregister_net_hooks(), causing a NULL dereference: general protection fault in nf_unregister_net_hooks+0xbc/0x150 RIP: nf_unregister_net_hooks (net/netfilter/core.c:613) Call Trace: ipt_unregister_table_pre_exit iptable_mangle_net_pre_exit ops_pre_exit_list cleanup_net Fix by moving the ops allocation into the xtables core so the table is never in the list without valid ops. Also ensure the table is no longer processing packets before its torn down on error unwind. nf_register_net_hooks might have published at least one hook; call synchronize_rcu() if there was an error. audit log register message gets deferred until all operations have passed, this avoids need to emit another ureg message in case of error unwinding. Based on earlier patch by Tristan Madani. Fixes: f9006acc8dfe5 ("netfilter: arp_tables: pass table pointer via nf_hook_ops") Fixes: ee177a54413a ("netfilter: ip6_tables: pass table pointer via nf_hook_ops") Fixes: ae689334225f ("netfilter: ip_tables: pass table pointer via nf_hook_ops") Link: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-07	tcp: Fix dst leak in tcp_v6_connect().	Kuniyuki Iwashima	1	-1/+3
	If a socket is bound to a wildcard address, tcp_v[46]_connect() updates it with a non-wildcard address based on the route lookup. After bhash2 was introduced in the cited commit, we must call inet_bhash2_update_saddr() to update the bhash2 entry as well. If inet_bhash2_update_saddr() fails, we must release the refcount for dst by ip_route_connect() or ip6_dst_lookup_flow(). While tcp_v4_connect() calls ip_rt_put() in the error path, tcp_v6_connect() does not call dst_release(). Let's call dst_release() when inet_bhash2_update_saddr() fails in tcp_v6_connect(). Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260506070443.1699879-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-07	tcp: tcp_child_process() related UAF	Eric Dumazet	1	-5/+8
	tcp_child_process( .. child ...) currently calls sock_put(child). Unfortunately @child (named @nsk in callers) can be used after this point to send a RST packet. To fix this UAF, I remove the sock_put() from tcp_child_process() and let the callers handle this after it is safe. Remove @rsk variable in tcp_v4_do_rcv() and change tcp_v6_do_rcv() so that both functions look the same. Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260505153927.3435532-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-07	ipv6: fix potential UAF caused by ip6_forward_proxy_check()	Eric Dumazet	1	-0/+3
	ip6_forward_proxy_check() calls pskb_may_pull() which might re-allocate skb->head. Reload ipv6_hdr() after the pskb_may_pull() call to avoid using the freed memory. Fixes: e21e0b5f19ac ("[IPV6] NDISC: Handle NDP messages to proxied addresses.") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260505130056.2927197-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-07	Merge tag 'ipsec-2026-05-05' of ↵	Jakub Kicinski	4	-4/+19
	git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-05-05 1. Fix an IPv6 encapsulation error path that leaked route references when UDPv6 ESP decapsulation resolved to an error route. From Yilin Zhu. 2. Fix AH with ESN on async crypto paths by accounting for the extra high-order sequence number when reconstructing the temporary authentication layout in the completion callbacks. From Michael Bomarito. 3. Fix XFRM output so it does not overwrite already-correct inner header pointers when a tunnel layer such as VXLAN has already saved them. The fix comes with new selftests. From Cosmin Ratiu. 4. Add the missing native payload size entry for XFRM_MSG_MAPPING in the compat translation path. From Ruijie Li. 5. Harden __xfrm_state_delete() against repeated or inconsistent unhashing of state list nodes by keying the removal on actual list membership and using delete-and-init helpers. From Michal Kosiorek. 6. Prevent ESP from decrypting shared splice-backed skb fragments in place by marking UDP splice frags as shared and forcing copy-on-write in ESP input when needed. From Kuan-Ting Chen. * tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: esp: avoid in-place decrypt on shared skb frags xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete xfrm: provide message size for XFRM_MSG_MAPPING xfrm: Don't clobber inner headers when already set tools/selftests: Add a VXLAN+IPsec traffic test tools/selftests: Use a sensible timeout value for iperf3 client xfrm: ah: account for ESN high bits in async callbacks ipv6: xfrm6: release dst on error in xfrm6_rcv_encap() ==================== Link: https://patch.msgid.link/20260505132326.1362733-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-06	ipv6: Fix null-ptr-deref in fib6_mtu().	Kuniyuki Iwashima	1	-0/+4
	syzbot reported null-ptr-deref in fib6_mtu(). [0] When res->f6i->fib6_pmtu is 0 in fib6_mtu(), it fetches MTU from __in6_dev_get(nh->fib_nh_dev)->cnf.mtu6. However, __in6_dev_get() could return NULL when the device is being unregistered. Let's return 0 MTU if __in6_dev_get() returns NULL in fib6_mtu(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc00000000bc: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x00000000000005e0-0x00000000000005e7] CPU: 0 UID: 0 PID: 7890 Comm: syz.2.502 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:fib6_mtu net/ipv6/route.c:1648 [inline] RIP: 0010:rt6_insert_exception+0x9eb/0x10a0 net/ipv6/route.c:1753 Code: 3b 14 cf f7 45 85 f6 0f 85 1d 02 00 00 e8 7d 19 cf f7 48 8d bb e0 05 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 89 RSP: 0000:ffffc9000610f120 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000c001000 RDX: 00000000000000bc RSI: ffffffff8a38bc83 RDI: 00000000000005e0 RBP: ffff888052f06000 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff888042d16c00 R13: ffff888042d16cc8 R14: 0000000000000001 R15: 0000000000000500 FS: 0000000000000000(0000) GS:ffff88809717d000(0063) knlGS:00000000f540db40 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 00000000f73c6d50 CR3: 000000006eff0000 CR4: 0000000000352ef0 Call Trace: <TASK> __ip6_rt_update_pmtu+0x555/0xd60 net/ipv6/route.c:2982 ip6_update_pmtu+0x34f/0x3b0 net/ipv6/route.c:3014 icmpv6_err+0x2a2/0x3f0 net/ipv6/icmp.c:82 icmpv6_notify+0x35e/0x820 net/ipv6/icmp.c:1087 icmpv6_rcv+0x10bf/0x1ae0 net/ipv6/icmp.c:1228 ip6_protocol_deliver_rcu+0xf97/0x1500 net/ipv6/ip6_input.c:478 ip6_input_finish+0x1e4/0x4a0 net/ipv6/ip6_input.c:529 NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ip6_input+0x105/0x2f0 net/ipv6/ip6_input.c:540 ip6_mc_input+0x513/0xf50 net/ipv6/ip6_input.c:630 dst_input include/net/dst.h:480 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:119 [inline] NF_HOOK include/linux/netfilter.h:318 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0x34c/0x3d0 net/ipv6/ip6_input.c:351 __netif_receive_skb_one_core+0x12d/0x1e0 net/core/dev.c:6202 __netif_receive_skb+0x1f/0x120 net/core/dev.c:6315 netif_receive_skb_internal net/core/dev.c:6401 [inline] netif_receive_skb+0x13b/0x7f0 net/core/dev.c:6460 tun_rx_batched.isra.0+0x3f6/0x750 drivers/net/tun.c:1511 tun_get_user+0x1e31/0x3c20 drivers/net/tun.c:1955 tun_chr_write_iter+0xdc/0x200 drivers/net/tun.c:2001 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x6ac/0x1070 fs/read_write.c:688 ksys_write+0x12a/0x250 fs/read_write.c:740 do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline] do_int80_emulation+0x141/0x700 arch/x86/entry/syscall_32.c:172 asm_int80_emulation+0x1a/0x20 arch/x86/include/asm/idtentry.h:621 RIP: 0023:0xf715616b Code: 57 56 53 8b 44 24 14 f6 00 08 75 23 8b 44 24 18 8b 5c 24 1c 8b 4c 24 20 8b 54 24 24 8b 74 24 28 8b 7c 24 2c 8b 6c 24 30 cd 80 <5b> 5e 5f 5d c3 5b 5e 5f 5d e9 f7 a1 ff ff 66 90 66 90 66 90 90 53 RSP: 002b:00000000f540d44c EFLAGS: 00000246 ORIG_RAX: 0000000000000004 RAX: ffffffffffffffda RBX: 00000000000000c8 RCX: 0000000080000640 RDX: 000000000000007a RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Fixes: dcd1f572954f ("net/ipv6: Remove fib6_idev") Reported-by: syzbot+01f005f9c6387ca6f6dd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69f83f22.170a0220.13cc2.0004.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260504064316.3820775-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-06	ipv6: default IPV6_SIT to m	Alyssa Ross	1	-2/+2
	This basically defaulted to m until recently, since IPV6 defaulted to m. Since IPV6 was changed to a boolean with a default of y, IPV6_SIT started defaulting to built-in as well. This results in a surprise sit0 device by default for defconfig (and defconfig-derived config) users at boot. For me, this broke an (admittedly non-robust) script. Preserve the behaviour of most configs by avoiding building this module, that's probably overall seldom used compared to IPv6 as a whole, into the kernel. Fixes: 309b905deee59 ("ipv6: convert CONFIG_IPV6 to built-in only and clean up Kconfigs") Signed-off-by: Alyssa Ross <hi@alyssa.is> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260503192515.290900-2-hi@alyssa.is Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-05	xfrm: esp: avoid in-place decrypt on shared skb frags	Kuan-Ting Chen	2	-1/+4
	MSG_SPLICE_PAGES can attach pages from a pipe directly to an skb. TCP marks such skbs with SKBFL_SHARED_FRAG after skb_splice_from_iter(), so later paths that may modify packet data can first make a private copy. The IPv4/IPv6 datagram append paths did not set this flag when splicing pages into UDP skbs. That leaves an ESP-in-UDP packet made from shared pipe pages looking like an ordinary uncloned nonlinear skb. ESP input then takes the no-COW fast path for uncloned skbs without a frag_list and decrypts in place over data that is not owned privately by the skb. Mark IPv4/IPv6 datagram splice frags with SKBFL_SHARED_FRAG, matching TCP. Also make ESP input fall back to skb_cow_data() when the flag is present, so ESP does not decrypt externally backed frags in place. Private nonlinear skb frags still use the existing fast path. This intentionally does not change ESP output. In esp_output_head(), the path that appends the ESP trailer to existing skb tailroom without calling skb_cow_data() is not reachable for nonlinear skbs: skb_tailroom() returns zero when skb->data_len is nonzero, while ESP tailen is positive. Thus ESP output will either use the separate destination-frag path or fall back to skb_cow_data(). Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Fixes: 7da0dde68486 ("ip, udp: Support MSG_SPLICE_PAGES") Fixes: 6d8192bd69bb ("ip6, udp6: Support MSG_SPLICE_PAGES") Reported-by: Hyunwoo Kim <imv4bel@gmail.com> Reported-by: Kuan-Ting Chen <h3xrabbit@gmail.com> Tested-by: Hyunwoo Kim <imv4bel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Kuan-Ting Chen <h3xrabbit@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2026-05-02	ip6_gre: Use cached t->net in ip6erspan_changelink().	Maoyi Xie	1	-2/+3
	After commit 5e72ce3e3980 ("net: ipv6: Use link netns in newlink() of rtnl_link_ops"), ip6erspan_newlink() correctly resolves the per-netns ip6gre hash via link_net. ip6erspan_changelink() was not converted in that series and still uses dev_net(dev), which diverges from the device's creation netns after IFLA_NET_NS_FD migration. This re-inserts the tunnel into the wrong per-netns hash. The original netns keeps a stale entry. When that netns is later destroyed, ip6gre_exit_rtnl_net() walks the stale entry, producing a slab-use-after-free reported by KASAN, followed by a kernel BUG at net/core/dev.c (LIST_POISON1) in unregister_netdevice_many_notify(). Reachable from an unprivileged user namespace (unshare --user --map-root-user --net). ip6gre_changelink() earlier in the same file already uses the cached t->net; only ip6erspan_changelink() has the wrong shape. Fixes: 2d665034f239 ("net: ip6_gre: Fix ip6erspan hlen calculation") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260430103318.3206018-1-maoyi.xie@ntu.edu.sg Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-02	ipv6: update route serial number on NETDEV_CHANGE	Sagarika Sharma	1	-0/+1
	When using IPv6 ECMP routes, if a netdev listed as a nexthop experiences a carrier change event (e.g., a bond device generating a NETDEV_CHANGE event after its slaves go linkdown), established connections utilizing that nexthop fail to fail over to other available nexthops. Instead, these connections stall or drop. This happens because the IPv6 FIB code does not invalidate the socket's cached destination when a NETDEV_CHANGE event occurs. While fib6_ifdown() correctly marks the nexthop with RTNH_F_LINKDOWN, it leaves the route's serial number unchanged. As a result, sockets with a previously cached dst do not realize the route is no longer viable and continue to try using the non-functional nexthop. This behavior contrasts with IPv4, which actively flushes cached destinations on a NETDEV_CHANGE event (see fib_netdev_event() in net/ipv4/fib_frontend.c). Fix this by updating the route serial number in fib6_ifdown() when setting RTNH_F_LINKDOWN. This invalidates stale cached destinations, forcing sockets to perform a new route lookup and fail over to a functioning nexthop. Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: Sagarika Sharma <sharmasagarika@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430200909.527827-2-sharmasagarika@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>