kernel/linux.git/include/net/xfrm.h, branch v7.0.12

xfrm: route MIGRATE notifications to caller's netns

2026-06-09T10:32:39+00:00

commit 7e2a4f7ca0952820731ef7bdadfc9a9e9d3571b4 upstream. xfrm_send_migrate() in net/xfrm/xfrm_user.c and pfkey_send_migrate() in net/key/af_key.c both hardcode &init_net for the multicast that announces a successful XFRM_MSG_MIGRATE / SADB_X_MIGRATE. XFRM_MSG_MIGRATE arrives on a per-netns NETLINK_XFRM socket, and the rest of the xfrm/af_key netlink path was made netns-aware in 2008. The other 14 multicast paths in xfrm_user.c route their event using xs_net(x), xp_net(xp) or sock_net(skb->sk); only the migrate path was missed. Two consequences of the init_net hardcoding: 1. The notification (selector, old/new endpoint addresses, and the km_address) is delivered to listeners on init_net's XFRMNLGRP_MIGRATE / pfkey BROADCAST_ALL groups rather than on the issuing netns. An IKE daemon running in init_net therefore receives migration notifications originating from any other netns on the host. 2. An IKE daemon running inside a non-init netns and subscribed to its own XFRMNLGRP_MIGRATE / pfkey groups never receives the notification of its own migration. IKEv2 MOBIKE / address-update handling inside a netns is silently broken. Thread struct net through km_migrate() and the xfrm_mgr.migrate function pointer, drop the &init_net override in xfrm_send_migrate() and pfkey_send_migrate(), and pass the caller's net (already in scope in xfrm_migrate() via sock_net(skb->sk)) all the way down. struct xfrm_mgr is in-tree only and not exported as a stable API, so the function-pointer signature change is internal. pfkey_broadcast() is already netns-aware via net_generic(net, pfkey_net_id) since the pernet conversion. The five other pfkey_broadcast() callers in af_key.c already pass xs_net(x), sock_net(sk) or a per-netns net, so this only removes the &init_net outlier. Fixes: 5c79de6e79cd ("[XFRM]: User interface for handling XFRM_MSG_MIGRATE") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie Signed-off-by: Steffen Klassert Signed-off-by: Greg Kroah-Hartman

xfrm: reduce struct sec_path size

2026-02-11T04:21:48+00:00

The mentioned struct has an hole and uses unnecessary wide type to store MAC length and indexes of very small arrays. It's also embedded into the skb_extensions, and the latter, due to recent CAN changes, may exceeds the 192 bytes mark (3 cachelines on x86_64 arch) on some reasonable configurations. Reordering and the sec_path fields, shrinking xfrm_offload.orig_mac_len to 16 bits and xfrm_offload.{len,olen,verified_cnt} to u8, we can save 16 bytes and keep skb_extensions size under control. Before: struct sec_path { int len; int olen; int verified_cnt; /* XXX 4 bytes hole, try to pack */$ struct xfrm_state * xvec[6]; struct xfrm_offload ovec[1]; /* size: 88, cachelines: 2, members: 5 */ /* sum members: 84, holes: 1, sum holes: 4 */ /* last cacheline: 24 bytes */ }; After: struct sec_path { struct xfrm_state * xvec[6]; struct xfrm_offload ovec[1]; /* typedef u8 -> __u8 */ unsigned char len; /* typedef u8 -> __u8 */ unsigned char olen; /* typedef u8 -> __u8 */ unsigned char verified_cnt; /* size: 72, cachelines: 2, members: 5 */ /* padding: 1 */ /* last cacheline: 8 bytes */ }; Signed-off-by: Paolo Abeni Reviewed-by: Florian Westphal Reviewed-by: Steffen Klassert Link: https://patch.msgid.link/83846bd2e3fa08899bd0162e41bfadfec95e82ef.1770398071.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski

xfrm: Determine inner GSO type from packet inner protocol

2025-10-30T10:52:31+00:00

The GSO segmentation functions for ESP tunnel mode (xfrm4_tunnel_gso_segment and xfrm6_tunnel_gso_segment) were determining the inner packet's L2 protocol type by checking the static x->inner_mode.family field from the xfrm state. This is unreliable. In tunnel mode, the state's actual inner family could be defined by x->inner_mode.family or by x->inner_mode_iaf.family. Checking only the former can lead to a mismatch with the actual packet being processed, causing GSO to create segments with the wrong L2 header type. This patch fixes the bug by deriving the inner mode directly from the packet's inner protocol stored in XFRM_MODE_SKB_CB(skb)->protocol. Instead of replicating the code, this patch modifies the xfrm_ip2inner_mode helper function. It now correctly returns &x->inner_mode if the selector family (x->sel.family) is already specified, thereby handling both specific and AF_UNSPEC cases appropriately. With this change, ESP GSO can use xfrm_ip2inner_mode to get the correct inner mode. It doesn't affect existing callers, as the updated logic now mirrors the checks they were already performing externally. Fixes: 26dbd66eab80 ("esp: choose the correct inner protocol for GSO on inter address family tunnels") Signed-off-by: Jianbo Liu Reviewed-by: Cosmin Ratiu Reviewed-by: Sabrina Dubroca Signed-off-by: Steffen Klassert

Revert "xfrm: destroy xfrm_state synchronously on net exit path"

2025-07-08T11:28:29+00:00

This reverts commit f75a2804da391571563c4b6b29e7797787332673. With all states (whether user or kern) removed from the hashtables during deletion, there's no need for synchronous destruction of states. xfrm6_tunnel states still need to have been destroyed (which will be the case when its last user is deleted (not destroyed)) so that xfrm6_tunnel_free_spi removes it from the per-netns hashtable before the netns is destroyed. This has the benefit of skipping one synchronize_rcu per state (in __xfrm_state_destroy(sync=true)) when we exit a netns. Signed-off-by: Sabrina Dubroca Signed-off-by: Steffen Klassert

xfrm: delete x->tunnel as we delete x

2025-07-08T11:28:27+00:00

The ipcomp fallback tunnels currently get deleted (from the various lists and hashtables) as the last user state that needed that fallback is destroyed (not deleted). If a reference to that user state still exists, the fallback state will remain on the hashtables/lists, triggering the WARN in xfrm_state_fini. Because of those remaining references, the fix in commit f75a2804da39 ("xfrm: destroy xfrm_state synchronously on net exit path") is not complete. We recently fixed one such situation in TCP due to defered freeing of skbs (commit 9b6412e6979f ("tcp: drop secpath at the same time as we currently drop dst")). This can also happen due to IP reassembly: skbs with a secpath remain on the reassembly queue until netns destruction. If we can't guarantee that the queues are flushed by the time xfrm_state_fini runs, there may still be references to a (user) xfrm_state, preventing the timely deletion of the corresponding fallback state. Instead of chasing each instance of skbs holding a secpath one by one, this patch fixes the issue directly within xfrm, by deleting the fallback state as soon as the last user state depending on it has been deleted. Destruction will still happen when the final reference is dropped. A separate lockdep class for the fallback state is required since we're going to lock x->tunnel while x is locked. Fixes: 9d4139c76905 ("netns xfrm: per-netns xfrm_state_all list") Signed-off-by: Sabrina Dubroca Signed-off-by: Steffen Klassert

xfrm: always initialize offload path

2025-06-12T05:07:14+00:00

Offload path is used for GRO with SW IPsec, and not just for HW offload. So initialize it anyway. Fixes: 585b64f5a620 ("xfrm: delay initialization of offload path till its actually requested") Reported-by: Sabrina Dubroca Closes: https://lore.kernel.org/all/aEGW_5HfPqU1rFjl@krikkit Signed-off-by: Leon Romanovsky Signed-off-by: Steffen Klassert

Merge tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

2025-05-26T16:32:48+00:00

Steffen Klassert says: ==================== 1) Remove some unnecessary strscpy_pad() size arguments. From Thorsten Blum. 2) Correct use of xso.real_dev on bonding offloads. Patchset from Cosmin Ratiu. 3) Add hardware offload configuration to XFRM_MSG_MIGRATE. From Chiachang Wang. 4) Refactor migration setup during cloning. This was done after the clone was created. Now it is done in the cloning function itself. From Chiachang Wang. 5) Validate assignment of maximal possible SEQ number. Prevent from setting to the maximum sequrnce number as this would cause for traffic drop. From Leon Romanovsky. 6) Prevent configuration of interface index when offload is used. Hardware can't handle this case.i From Leon Romanovsky. 7) Always use kfree_sensitive() for SA secret zeroization. From Zilin Guan. ipsec-next-2025-05-23 * tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: use kfree_sensitive() for SA secret zeroization xfrm: prevent configuration of interface index when offload is used xfrm: validate assignment of maximal possible SEQ number xfrm: Refactor migration setup during the cloning process xfrm: Migrate offload configuration bonding: Fix multiple long standing offload races bonding: Mark active offloaded xfrm_states xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free} xfrm: Remove unneeded device check from validate_xmit_xfrm xfrm: Use xdo.dev instead of xdo.real_dev net/mlx5: Avoid using xso.real_dev unnecessarily xfrm: Remove unnecessary strscpy_pad() size arguments ==================== Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni

xfrm: Migrate offload configuration

2025-04-17T09:00:03+00:00

Add hardware offload configuration to XFRM_MSG_MIGRATE using an option netlink attribute XFRMA_OFFLOAD_DEV. In the existing xfrm_state_migrate(), the xfrm_init_state() is called assuming no hardware offload by default. Even the original xfrm_state is configured with offload, the setting will be reset. If the device is configured with hardware offload, it's reasonable to allow the device to maintain its hardware offload mode. But the device will end up with offload disabled after receiving a migration event when the device migrates the connection from one netdev to another one. The devices that support migration may work with different underlying networks, such as mobile devices. The hardware setting should be forwarded to the different netdev based on the migration configuration. This change provides the capability for user space to migrate from one netdev to another. Test: Tested with kernel test in the Android tree located in https://android.googlesource.com/kernel/tests/ The xfrm_tunnel_test.py under the tests folder in particular. Signed-off-by: Chiachang Wang Reviewed-by: Leon Romanovsky Signed-off-by: Steffen Klassert

bonding: Fix multiple long standing offload races

2025-04-16T09:02:49+00:00

Refactor the bonding ipsec offload operations to fix a number of long-standing control plane races between state migration and user deletion and a few other issues. xfrm state deletion can happen concurrently with bond_change_active_slave() operation. This manifests itself as a bond_ipsec_del_sa() call with x->lock held, followed by a bond_ipsec_free_sa() a bit later from a wq. The alternate path of these calls coming from xfrm_dev_state_flush() can't happen, as that needs the RTNL lock and bond_change_active_slave() already holds it. 1. bond_ipsec_del_sa_all() might call xdo_dev_state_delete() a second time on an xfrm state that was concurrently killed. This is bad. 2. bond_ipsec_add_sa_all() can add a state on the new device, but pending bond_ipsec_free_sa() calls from the old device will then hit the WARN_ON() and then, worse, call xdo_dev_state_free() on the new device without a corresponding xdo_dev_state_delete(). 3. Resolve a sleeping in atomic context introduced by the mentioned "Fixes" commit. bond_ipsec_del_sa_all() and bond_ipsec_add_sa_all() now acquire x->lock and check for x->km.state to help with problems 1 and 2. And since xso.real_dev is now a private pointer managed by the bonding driver in xfrm state, make better use of it to fully fix problems 1 and 2. In bond_ipsec_del_sa_all(), set xso.real_dev to NULL while holding both the mutex and x->lock, which makes sure that neither bond_ipsec_del_sa() nor bond_ipsec_free_sa() could run concurrently. Fix problem 3 by moving the list cleanup (which requires the mutex) from bond_ipsec_del_sa() (called from atomic context) to bond_ipsec_free_sa() Finally, simplify bond_ipsec_del_sa() and bond_ipsec_free_sa() by using xso->real_dev directly, since it's now protected by locks and can be trusted to always reflect the offload device. Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") Signed-off-by: Cosmin Ratiu Reviewed-by: Leon Romanovsky Reviewed-by: Nikolay Aleksandrov Reviewed-by: Hangbin Liu Tested-by: Hangbin Liu Signed-off-by: Steffen Klassert

xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}

2025-04-16T09:01:41+00:00

Previously, device driver IPSec offload implementations would fall into two categories: 1. Those that used xso.dev to determine the offload device. 2. Those that used xso.real_dev to determine the offload device. The first category didn't work with bonding while the second did. In a non-bonding setup the two pointers are the same. This commit adds explicit pointers for the offload netdevice to .xdo_dev_state_add() / .xdo_dev_state_delete() / .xdo_dev_state_free() which eliminates the confusion and allows drivers from the first category to work with bonding. xso.real_dev now becomes a private pointer managed by the bonding driver. Signed-off-by: Cosmin Ratiu Reviewed-by: Leon Romanovsky Reviewed-by: Nikolay Aleksandrov Signed-off-by: Steffen Klassert