kernel/linux.git/net/mptcp/protocol.c, branch v7.0.10

mptcp: better mptcp-level RTT estimator

2026-05-23T11:08:37+00:00

[ Upstream commit d2000361e4ddf5047d660a902a3b0ed7520be1e5 ] The current MPTCP-level RTT estimator has several issues. On high speed links, the MPTCP-level receive buffer auto-tuning happens with a frequency well above the TCP-level's one. That in turn can cause excessive/unneeded receive buffer increase. On such links, the initial rtt_us value is considerably higher than the actual delay, and the current mptcp_rcv_space_adjust() updates msk->rcvq_space.rtt_us with a period equal to the such field previous value. If the initial rtt_us is 40ms, its first update will happen after 40ms, even if the subflows see actual RTT orders of magnitude lower. Additionally: - setting the msk RTT to the maximum among all the subflows RTTs makes DRS constantly overshooting the rcvbuf size when a subflow has considerable higher latency than the other(s). - during unidirectional bulk transfers with multiple active subflows, the TCP-level RTT estimator occasionally sees considerably higher value than the real link delay, i.e. when the packet scheduler reacts to an incoming ACK on given subflow pushing data on a different subflow. - currently inactive but still open subflows (i.e. switched to backup mode) are always considered when computing the msk-level RTT. Address the all the issues above with a more accurate RTT estimation strategy: the MPTCP-level RTT is set to the minimum of all the subflows actually feeding data into the MPTCP receive buffer, using a small sliding window. While at it, also use EWMA to compute the msk-level scaling_ratio, to that MPTCP can avoid traversing the subflow list is mptcp_rcv_space_adjust(). Use some care to avoid updating msk and ssk level fields too often. Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") Signed-off-by: Paolo Abeni Reviewed-by: Mat Martineau Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260407-net-next-mptcp-reduce-rbuf-v2-1-0d1d135bf6f6@kernel.org Signed-off-by: Jakub Kicinski Signed-off-by: Sasha Levin

mptcp: fastclose msk when linger time is 0

2026-05-14T13:31:17+00:00

commit f14d6e9c3678a067f304abba561e0c5446c7e845 upstream. The SO_LINGER socket option has been supported for a while with MPTCP sockets [1], but it didn't cause the equivalent of a TCP reset as expected when enabled and its time was set to 0. This was causing some behavioural differences with TCP where some connections were not promptly stopped as expected. To fix that, an extra condition is checked at close() time before sending an MP_FASTCLOSE, the MPTCP equivalent of a TCP reset. Note that backporting up to [1] will be difficult as more changes are needed to be able to send MP_FASTCLOSE. It seems better to stop at [2], which was supposed to already imitate TCP. Validated with MPTCP packetdrill tests [3]. Fixes: 268b12387460 ("mptcp: setsockopt: support SO_LINGER") [1] Fixes: d21f83485518 ("mptcp: use fastclose on more edge scenarios") [2] Cc: stable@vger.kernel.org Reported-by: Lance Tuller Closes: https://github.com/lance0/xfr/pull/67 Link: https://github.com/multipath-tcp/packetdrill/pull/196 [3] Reviewed-by: Mat Martineau Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260427-net-mptcp-misc-fixes-7-1-rc2-v1-3-7432b7f279fa@kernel.org Signed-off-by: Jakub Kicinski Signed-off-by: Greg Kroah-Hartman

mptcp: sync the msk->sndbuf at accept() time

2026-05-07T04:14:10+00:00

commit fcf04b14334641f4b0b8647824480935e9416d52 upstream. On passive MPTCP connections, the msk sndbuf is not updated correctly. The root cause is an order issue in the accept path: - tcp_check_req() -> subflow_syn_recv_sock() -> mptcp_sk_clone_init() calls __mptcp_propagate_sndbuf() to copy the ssk sndbuf into msk - Later, tcp_child_process() -> tcp_init_transfer() -> tcp_sndbuf_expand() grows the ssk sndbuf. So __mptcp_propagate_sndbuf() runs before the ssk sndbuf has been expanded and the msk ends up with a much smaller sndbuf than the subflow: MPTCP: msk->sndbuf:20480, msk->first->sndbuf:2626560 Fix this by moving the __mptcp_propagate_sndbuf() call from mptcp_sk_clone_init() -- the ssk sndbuf is not yet finalized there -- to __mptcp_propagate_sndbuf() at accept() time, when the ssk sndbuf has been fully expanded by tcp_sndbuf_expand(). Fixes: 8005184fd1ca ("mptcp: refactor sndbuf auto-tuning") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/602 Signed-off-by: Gang Yan Acked-by: Paolo Abeni Reviewed-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260420-net-mptcp-sync-sndbuf-accept-v1-1-e3523e3aeb44@kernel.org Signed-off-by: Paolo Abeni Signed-off-by: Greg Kroah-Hartman

mptcp: fix slab-use-after-free in __inet_lookup_established

2026-04-09T02:30:08+00:00

The ehash table lookups are lockless and rely on SLAB_TYPESAFE_BY_RCU to guarantee socket memory stability during RCU read-side critical sections. Both tcp_prot and tcpv6_prot have their slab caches created with this flag via proto_register(). However, MPTCP's mptcp_subflow_init() copies tcpv6_prot into tcpv6_prot_override during inet_init() (fs_initcall, level 5), before inet6_init() (module_init/device_initcall, level 6) has called proto_register(&tcpv6_prot). At that point, tcpv6_prot.slab is still NULL, so tcpv6_prot_override.slab remains NULL permanently. This causes MPTCP v6 subflow child sockets to be allocated via kmalloc (falling into kmalloc-4k) instead of the TCPv6 slab cache. The kmalloc-4k cache lacks SLAB_TYPESAFE_BY_RCU, so when these sockets are freed without SOCK_RCU_FREE (which is cleared for child sockets by design), the memory can be immediately reused. Concurrent ehash lookups under rcu_read_lock can then access freed memory, triggering a slab-use-after-free in __inet_lookup_established. Fix this by splitting the IPv6-specific initialization out of mptcp_subflow_init() into a new mptcp_subflow_v6_init(), called from mptcp_proto_v6_init() before protocol registration. This ensures tcpv6_prot_override.slab correctly inherits the SLAB_TYPESAFE_BY_RCU slab cache. Fixes: b19bc2945b40 ("mptcp: implement delegated actions") Cc: stable@vger.kernel.org Signed-off-by: Jiayuan Chen Reviewed-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260406031512.189159-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski

mptcp: fix soft lockup in mptcp_recvmsg()

2026-04-01T01:58:37+00:00

syzbot reported a soft lockup in mptcp_recvmsg() [0]. When receiving data with MSG_PEEK | MSG_WAITALL flags, the skb is not removed from the sk_receive_queue. This causes sk_wait_data() to always find available data and never perform actual waiting, leading to a soft lockup. Fix this by adding a 'last' parameter to track the last peeked skb. This allows sk_wait_data() to make informed waiting decisions and prevent infinite loops when MSG_PEEK is used. [0]: watchdog: BUG: soft lockup - CPU#2 stuck for 156s! [server:1963] Modules linked in: CPU: 2 UID: 0 PID: 1963 Comm: server Not tainted 6.19.0-rc8 #61 PREEMPT(none) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:sk_wait_data+0x15/0x190 Code: 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 41 56 41 55 41 54 49 89 f4 55 48 89 d5 53 48 89 fb <48> 83 ec 30 65 48 8b 05 17 a4 6b 01 48 89 44 24 28 31 c0 65 48 8b RSP: 0018:ffffc90000603ca0 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff888102bf0800 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffc90000603d18 RDI: ffff888102bf0800 RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000101 R10: 0000000000000000 R11: 0000000000000075 R12: ffffc90000603d18 R13: ffff888102bf0800 R14: ffff888102bf0800 R15: 0000000000000000 FS: 00007f6e38b8c4c0(0000) GS:ffff8881b877e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055aa7bff1680 CR3: 0000000105cbe000 CR4: 00000000000006f0 Call Trace: mptcp_recvmsg+0x547/0x8c0 net/mptcp/protocol.c:2329 inet_recvmsg+0x11f/0x130 net/ipv4/af_inet.c:891 sock_recvmsg+0x94/0xc0 net/socket.c:1100 __sys_recvfrom+0xb2/0x130 net/socket.c:2256 __x64_sys_recvfrom+0x1f/0x30 net/socket.c:2267 do_syscall_64+0x59/0x2d0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e arch/x86/entry/entry_64.S:131 RIP: 0033:0x7f6e386a4a1d Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8d 05 f1 de 2c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b f3 c3 66 0f 1f 84 00 00 00 00 00 41 56 41 RSP: 002b:00007ffc3c4bb078 EFLAGS: 00000246 ORIG_RAX: 000000000000002d RAX: ffffffffffffffda RBX: 000000000000861e RCX: 00007f6e386a4a1d RDX: 00000000000003ff RSI: 00007ffc3c4bb150 RDI: 0000000000000004 RBP: 00007ffc3c4bb570 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000103 R11: 0000000000000246 R12: 00005605dbc00be0 R13: 00007ffc3c4bb650 R14: 0000000000000000 R15: 0000000000000000 Fixes: 8e04ce45a8db ("mptcp: fix MSG_PEEK stream corruption") Signed-off-by: Li Xiasong Reviewed-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260330120335.659027-1-lixiasong1@huawei.com Signed-off-by: Jakub Kicinski

mptcp: allow overridden write_space to be invoked

2026-02-11T03:54:21+00:00

Future extensions with psock will override their own sk->sk_write_space callback. This patch ensures that the overridden sk_write_space can be invoked by MPTCP. INDIRECT_CALL is used to keep the default path optimised. Note that sk->sk_write_space was never called directly with MPTCP sockets, so changing it to sk_stream_write_space in the init, and using it from mptcp_write_space() is not supposed to change the current behaviour. This patch is shared early to ease discussions around future RFC and avoid confusions with this "fix" that is needed for different future extensions. Suggested-by: Paolo Abeni Co-developed-by: Gang Yan Signed-off-by: Gang Yan Signed-off-by: Geliang Tang Reviewed-by: Matthieu Baerts (NGI0) Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260206-net-next-mptcp-write_space-override-v2-1-e0b12be818c6@kernel.org Signed-off-by: Jakub Kicinski

mptcp: Change some dubious min_t(int, ...) to min()

2026-02-05T02:45:09+00:00

There are two: min_t(int, xxx, mptcp_wnd_end(msk) - msk->snd_nxt); Both mptcp_wnd_end(msk) and msk->snd_nxt are u64, their difference (aka the window size) might be limited to 32 bits - but that isn't knowable from this code. So checks being added to min_t() detect the potential discard of significant bits. Provided the 'avail_size' and return of mptcp_check_allowed_size() are changed to an unsigned type (size_t matches the type the caller uses) both min_t() can be changed to min(). Signed-off-by: David Laight Reviewed-by: Matthieu Baerts (NGI0) [ wrapped too long lines when declaring mptcp_check_allowed_size() ] Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-6-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski

trace: mptcp: add mptcp_rcvbuf_grow tracepoint

2026-02-05T02:45:09+00:00

Similar to tcp, provide a new tracepoint to better understand mptcp_rcv_space_adjust() behavior, which presents many artifacts. Note that the used format string is so long that I preferred wrap it, contrary to guidance for quoted strings. Reviewed-by: Mat Martineau Signed-off-by: Paolo Abeni Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-4-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski

mptcp: consolidate rcv space init

2026-02-05T02:45:09+00:00

MPTCP uses several calls of the mptcp_rcv_space_init() helper to initialize the receive space, with a catch-up call in mptcp_rcv_space_adjust(). Drop all the other strictly not needed invocations and move constant fields initialization at socket init/reset time. This removes a bit of complexity from mptcp DRS code. No functional changes intended. Reviewed-by: Mat Martineau Signed-off-by: Paolo Abeni Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-3-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski

mptcp: fix receive space timestamp initialization

2026-02-05T02:45:09+00:00

MPTCP initialize the receive buffer stamp in mptcp_rcv_space_init(), using the provided subflow stamp. Such helper is invoked in several places; for passive sockets, space init happened at clone time. In such scenario, MPTCP ends-up accesses the subflow stamp before its initialization, leading to quite randomic timing for the first receive buffer auto-tune event, as the timestamp for newly created subflow is not refreshed there. Fix the issue moving the stamp initialization out of the mentioned helper, at the data transfer start, and always using a fresh timestamp. Fixes: 013e3179dbd2 ("mptcp: fix rcv space initialization") Reviewed-by: Mat Martineau Signed-off-by: Paolo Abeni Signed-off-by: Matthieu Baerts (NGI0) Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-2-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski