summaryrefslogtreecommitdiff
path: root/net/mptcp
AgeCommit message (Collapse)AuthorFilesLines
2023-04-26inet6: Remove inet6_destroy_sock() in sk->sk_prot->destroy().Kuniyuki Iwashima1-7/+0
commit b5fc29233d28be7a3322848ebe73ac327559cdb9 upstream. After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in sk->sk_destruct() by setting inet6_sock_destruct() to it to make sure we do not leak inet6-specific resources. Now we can remove unnecessary inet6_destroy_sock() calls in sk->sk_prot->destroy(). DCCP and SCTP have their own sk->sk_destruct() function, so we change them separately in the following patches. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20mptcp: stricter state check in mptcp_workerPaolo Abeni1-1/+1
commit d6a0443733434408f2cbd4c53fea6910599bab9e upstream. As reported by Christoph, the mptcp protocol can run the worker when the relevant msk socket is in an unexpected state: connect() // incoming reset + fastclose // the mptcp worker is scheduled mptcp_disconnect() // msk is now CLOSED listen() mptcp_worker() Leading to the following splat: divide error: 0000 [#1] PREEMPT SMP CPU: 1 PID: 21 Comm: kworker/1:0 Not tainted 6.3.0-rc1-gde5e8fd0123c #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:__tcp_select_window+0x22c/0x4b0 net/ipv4/tcp_output.c:3018 RSP: 0018:ffffc900000b3c98 EFLAGS: 00010293 RAX: 000000000000ffd7 RBX: 000000000000ffd7 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8214ce97 RDI: 0000000000000004 RBP: 000000000000ffd7 R08: 0000000000000004 R09: 0000000000010000 R10: 000000000000ffd7 R11: ffff888005afa148 R12: 000000000000ffd7 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88803ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000405270 CR3: 000000003011e006 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_select_window net/ipv4/tcp_output.c:262 [inline] __tcp_transmit_skb+0x356/0x1280 net/ipv4/tcp_output.c:1345 tcp_transmit_skb net/ipv4/tcp_output.c:1417 [inline] tcp_send_active_reset+0x13e/0x320 net/ipv4/tcp_output.c:3459 mptcp_check_fastclose net/mptcp/protocol.c:2530 [inline] mptcp_worker+0x6c7/0x800 net/mptcp/protocol.c:2705 process_one_work+0x3bd/0x950 kernel/workqueue.c:2390 worker_thread+0x5b/0x610 kernel/workqueue.c:2537 kthread+0x138/0x170 kernel/kthread.c:376 ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308 </TASK> This change addresses the issue explicitly checking for bad states before running the mptcp worker. Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Link: https://github.com/multipath-tcp/mptcp_net-next/issues/374 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20mptcp: use mptcp_schedule_work instead of open-coding itPaolo Abeni2-15/+8
commit a5cb752b125766524c921faab1a45cc96065b0a7 upstream. Beyond reducing code duplication this also avoids scheduling the mptcp_worker on a closed socket on some edge scenarios. The addressed issue is actually older than the blamed commit below, but this fix needs it as a pre-requisite. Fixes: ba8f48f7a4d7 ("mptcp: introduce mptcp_schedule_work") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-30mptcp: fix UaF in listener shutdownPaolo Abeni3-78/+0
[ Upstream commit 0a3f4f1f9c27215e4ddcd312558342e57b93e518 ] Backports notes: one simple conflict in net/mptcp/protocol.c with: commit f8c9dfbd875b ("mptcp: add pm listener events") Where one commit removes code in __mptcp_close_ssk() while the other one adds one line at the same place. We can simply remove the whole condition because this extra instruction is not present in v6.1. As reported by Christoph after having refactored the passive socket initialization, the mptcp listener shutdown path is prone to an UaF issue. BUG: KASAN: use-after-free in _raw_spin_lock_bh+0x73/0xe0 Write of size 4 at addr ffff88810cb23098 by task syz-executor731/1266 CPU: 1 PID: 1266 Comm: syz-executor731 Not tainted 6.2.0-rc59af4eaa31c1f6c00c8f1e448ed99a45c66340dd5 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6e/0x91 print_report+0x16a/0x46f kasan_report+0xad/0x130 kasan_check_range+0x14a/0x1a0 _raw_spin_lock_bh+0x73/0xe0 subflow_error_report+0x6d/0x110 sk_error_report+0x3b/0x190 tcp_disconnect+0x138c/0x1aa0 inet_child_forget+0x6f/0x2e0 inet_csk_listen_stop+0x209/0x1060 __mptcp_close_ssk+0x52d/0x610 mptcp_destroy_common+0x165/0x640 mptcp_destroy+0x13/0x80 __mptcp_destroy_sock+0xe7/0x270 __mptcp_close+0x70e/0x9b0 mptcp_close+0x2b/0x150 inet_release+0xe9/0x1f0 __sock_release+0xd2/0x280 sock_close+0x15/0x20 __fput+0x252/0xa20 task_work_run+0x169/0x250 exit_to_user_mode_prepare+0x113/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc The msk grace period can legitly expire in between the last reference count dropped in mptcp_subflow_queue_clean() and the later eventual access in inet_csk_listen_stop() After the previous patch we don't need anymore special-casing msk listener socket cleanup: the mptcp worker will process each of the unaccepted msk sockets. Just drop the now unnecessary code. Please note this commit depends on the two parent ones: mptcp: refactor passive socket initialization mptcp: use the workqueue to destroy unaccepted sockets Fixes: 6aeed9045071 ("mptcp: fix race on unaccepted mptcp sockets") Cc: stable@vger.kernel.org Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/346 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-30mptcp: use the workqueue to destroy unaccepted socketsPaolo Abeni3-16/+47
[ Upstream commit b6985b9b82954caa53f862d6059d06c0526254f0 ] Backports notes: one simple conflict in net/mptcp/protocol.c with: commit a5ef058dc4d9 ("net: introduce and use custom sockopt socket flag") Where the two commits add a new line for different actions in the same context in mptcp_stream_accept(). Christoph reported a UaF at token lookup time after having refactored the passive socket initialization part: BUG: KASAN: use-after-free in __token_bucket_busy+0x253/0x260 Read of size 4 at addr ffff88810698d5b0 by task syz-executor653/3198 CPU: 1 PID: 3198 Comm: syz-executor653 Not tainted 6.2.0-rc59af4eaa31c1f6c00c8f1e448ed99a45c66340dd5 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6e/0x91 print_report+0x16a/0x46f kasan_report+0xad/0x130 __token_bucket_busy+0x253/0x260 mptcp_token_new_connect+0x13d/0x490 mptcp_connect+0x4ed/0x860 __inet_stream_connect+0x80e/0xd90 tcp_sendmsg_fastopen+0x3ce/0x710 mptcp_sendmsg+0xff1/0x1a20 inet_sendmsg+0x11d/0x140 __sys_sendto+0x405/0x490 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc We need to properly clean-up all the paired MPTCP-level resources and be sure to release the msk last, even when the unaccepted subflow is destroyed by the TCP internals via inet_child_forget(). We can re-use the existing MPTCP_WORK_CLOSE_SUBFLOW infra, explicitly checking that for the critical scenario: the closed subflow is the MPC one, the msk is not accepted and eventually going through full cleanup. With such change, __mptcp_destroy_sock() is always called on msk sockets, even on accepted ones. We don't need anymore to transiently drop one sk reference at msk clone time. Please note this commit depends on the parent one: mptcp: refactor passive socket initialization Fixes: 58b09919626b ("mptcp: create msk early") Cc: stable@vger.kernel.org Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/347 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-30mptcp: refactor passive socket initializationPaolo Abeni2-23/+21
[ Upstream commit 3a236aef280ed5122b2d47087eb514d0921ae033 ] After commit 30e51b923e43 ("mptcp: fix unreleased socket in accept queue") unaccepted msk sockets go throu complete shutdown, we don't need anymore to delay inserting the first subflow into the subflow lists. The reference counting deserve some extra care, as __mptcp_close() is unaware of the request socket linkage to the first subflow. Please note that this is more a refactoring than a fix but because this modification is needed to include other corrections, see the following commits. Then a Fixes tag has been added here to help the stable team. Fixes: 30e51b923e43 ("mptcp: fix unreleased socket in accept queue") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-22mptcp: fix lockdep false positive in mptcp_pm_nl_create_listen_socket()Paolo Abeni1-0/+16
commit cee4034a3db1d30c3243dd51506a9d4ab1a849fa upstream. Christoph reports a lockdep splat in the mptcp_subflow_create_socket() error path, when such function is invoked by mptcp_pm_nl_create_listen_socket(). Such code path acquires two separates, nested socket lock, with the internal lock operation lacking the "nested" annotation. Adding that in sock_release() for mptcp's sake only could be confusing. Instead just add a new lockclass to the in-kernel msk socket, re-initializing the lockdep infra after the socket creation. Fixes: ad2171009d96 ("mptcp: fix locking for in-kernel listener creation") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/354 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22mptcp: avoid setting TCP_CLOSE state twiceMatthieu Baerts1-1/+0
commit 3ba14528684f528566fb7d956bfbfb958b591d86 upstream. tcp_set_state() is called from tcp_done() already. There is then no need to first set the state to TCP_CLOSE, then call tcp_done(). Fixes: d582484726c4 ("mptcp: fix fallback for MP_JOIN subflows") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/362 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22mptcp: add ro_after_init for tcp{,v6}_prot_overrideGeliang Tang1-2/+2
commit 822467a48e938e661965d09df5fcac66f7291050 upstream. Add __ro_after_init labels for the variables tcp_prot_override and tcpv6_prot_override, just like other variables adjacent to them, to indicate that they are initialised from the init hooks and no writes occur afterwards. Fixes: b19bc2945b40 ("mptcp: implement delegated actions") Cc: stable@vger.kernel.org Fixes: 51fa7f8ebf0e ("mptcp: mark ops structures as ro_after_init") Signed-off-by: Geliang Tang <geliang.tang@suse.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22mptcp: fix possible deadlock in subflow_error_reportPaolo Abeni1-0/+7
commit b7a679ba7c652587b85294f4953f33ac0b756d40 upstream. Christoph reported a possible deadlock while the TCP stack destroys an unaccepted subflow due to an incoming reset: the MPTCP socket error path tries to acquire the msk-level socket lock while TCP still owns the listener socket accept queue spinlock, and the reverse dependency already exists in the TCP stack. Note that the above is actually a lockdep false positive, as the chain involves two separate sockets. A different per-socket lockdep key will address the issue, but such a change will be quite invasive. Instead, we can simply stop earlier the socket error handling for orphaned or unaccepted subflows, breaking the critical lockdep chain. Error handling in such a scenario is a no-op. Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com> Fixes: 15cc10453398 ("mptcp: deliver ssk errors to msk") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/355 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22mptcp: fix locking for in-kernel listener creationPaolo Abeni2-5/+7
[ Upstream commit ad2171009d968104ccda9dc517f5a3ba891515db ] For consistency, in mptcp_pm_nl_create_listen_socket(), we need to call the __mptcp_nmpc_socket() under the msk socket lock. Note that as a side effect, mptcp_subflow_create_socket() needs a 'nested' lockdep annotation, as it will acquire the subflow (kernel) socket lock under the in-kernel listener msk socket lock. The current lack of locking is almost harmless, because the relevant socket is not exposed to the user space, but in future we will add more complexity to the mentioned helper, let's play safe. Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-22mptcp: deduplicate error paths on endpoint creationPaolo Abeni1-22/+13
[ Upstream commit 976d302fb6165ad620778d7ba834cde6e3fe9f9f ] When endpoint creation fails, we need to free the newly allocated entry and eventually destroy the paired mptcp listener socket. Consolidate such action in a single point let all the errors path reach it. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: ad2171009d96 ("mptcp: fix locking for in-kernel listener creation") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-22mptcp: fix locking for setsockopt corner-casePaolo Abeni1-2/+9
[ Upstream commit 21e43569685de4ad773fb060c11a15f3fd5e7ac4 ] We need to call the __mptcp_nmpc_socket(), and later subflow socket access under the msk socket lock, or e.g. a racing connect() could change the socket status under the hood, with unexpected results. Fixes: 54635bd04701 ("mptcp: add TCP_FASTOPEN_CONNECT socket option") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-22mptcp: sockopt: make 'tcp_fastopen_connect' genericMatthieu Baerts1-5/+6
[ Upstream commit d3d429047cc66ff49780c93e4fccd9527723d385 ] There are other socket options that need to act only on the first subflow, e.g. all TCP_FASTOPEN* socket options. This is similar to the getsockopt version. In the next commit, this new mptcp_setsockopt_first_sf_only() helper is used by other another option. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: 21e43569685d ("mptcp: fix locking for setsockopt corner-case") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-14mptcp: be careful on subflow status propagation on errorsPaolo Abeni1-1/+9
commit 1249db44a102d9d3541ed7798d4b01ffdcf03524 upstream. Currently the subflow error report callback unconditionally propagates the fallback subflow status to the owning msk. If the msk is already orphaned, the above prevents the code from correctly tracking the msk moving to the TCP_CLOSE state and doing the appropriate cleanup. All the above causes increasing memory usage over time and sporadic self-tests failures. There is a great deal of infrastructure trying to propagate correctly the fallback subflow status to the owning mptcp socket, e.g. via mptcp_subflow_eof() and subflow_sched_work_if_closed(): in the error propagation path we need only to cope with unorphaned sockets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/339 Fixes: 15cc10453398 ("mptcp: deliver ssk errors to msk") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-14mptcp: do not wait for bare sockets' timeoutPaolo Abeni1-0/+9
commit d4e85922e3e7ef2071f91f65e61629b60f3a9cf4 upstream. If the peer closes all the existing subflows for a given mptcp socket and later the application closes it, the current implementation let it survive until the timewait timeout expires. While the above is allowed by the protocol specification it consumes resources for almost no reason and additionally causes sporadic self-tests failures. Let's move the mptcp socket to the TCP_CLOSE state when there are no alive subflows at close time, so that the allocated resources will be freed immediately. Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24mptcp: netlink: respect v4/v6-only socketsMatthieu Baerts3-0/+35
commit fb00ee4f3343acb2b9222ca9b73b47dd1e1a8efc upstream. If an MPTCP socket has been created with AF_INET6 and the IPV6_V6ONLY option has been set, the userspace PM would allow creating subflows using IPv4 addresses, e.g. mapped in v6. The kernel side of userspace PM will also accept creating subflows with local and remote addresses having different families. Depending on the subflow socket's family, different behaviours are expected: - If AF_INET is forced with a v6 address, the kernel will take the last byte of the IP and try to connect to that: a new subflow is created but to a non expected address. - If AF_INET6 is forced with a v4 address, the kernel will try to connect to a v4 address (v4-mapped-v6). A -EBADF error from the connect() part is then expected. It is then required to check the given families can be accepted. This is done by using a new helper for addresses family matching, taking care of IPv4 vs IPv4-mapped-IPv6 addresses. This helper will be re-used later by the in-kernel path-manager to use mixed IPv4 and IPv6 addresses. While at it, a clear error message is now reported if there are some conflicts with the families that have been passed by the userspace. Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24mptcp: explicitly specify sock family at subflow creation timePaolo Abeni3-6/+8
commit 6bc1fe7dd748ba5e76e7917d110837cafe7b931c upstream. Let the caller specify the to-be-created subflow family. For a given MPTCP socket created with the AF_INET6 family, the current userspace PM can already ask the kernel to create subflows in v4 and v6. If "plain" IPv4 addresses are passed to the kernel, they are automatically mapped in v6 addresses "by accident". This can be problematic because the userspace will need to pass different addresses, now the v4-mapped-v6 addresses to destroy this new subflow. On the other hand, if the MPTCP socket has been created with the AF_INET family, the command to create a subflow in v6 will be accepted but the result will not be the one as expected as new subflow will be created in IPv4 using part of the v6 addresses passed to the kernel: not creating the expected subflow then. No functional change intended for the in-kernel PM where an explicit enforcement is currently in place. This arbitrary enforcement will be leveraged by other patches in a future version. Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Cc: stable@vger.kernel.org Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-12mptcp: fix lockdep false positivePaolo Abeni3-4/+19
[ Upstream commit fec3adfd754ccc99a7230e8ab9f105b65fb07bcc ] MattB reported a lockdep splat in the mptcp listener code cleanup: WARNING: possible circular locking dependency detected packetdrill/14278 is trying to acquire lock: ffff888017d868f0 ((work_completion)(&msk->work)){+.+.}-{0:0}, at: __flush_work (kernel/workqueue.c:3069) but task is already holding lock: ffff888017d84130 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close (net/mptcp/protocol.c:2973) which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (sk_lock-AF_INET){+.+.}-{0:0}: __lock_acquire (kernel/locking/lockdep.c:5055) lock_acquire (kernel/locking/lockdep.c:466) lock_sock_nested (net/core/sock.c:3463) mptcp_worker (net/mptcp/protocol.c:2614) process_one_work (kernel/workqueue.c:2294) worker_thread (include/linux/list.h:292) kthread (kernel/kthread.c:376) ret_from_fork (arch/x86/entry/entry_64.S:312) -> #0 ((work_completion)(&msk->work)){+.+.}-{0:0}: check_prev_add (kernel/locking/lockdep.c:3098) validate_chain (kernel/locking/lockdep.c:3217) __lock_acquire (kernel/locking/lockdep.c:5055) lock_acquire (kernel/locking/lockdep.c:466) __flush_work (kernel/workqueue.c:3070) __cancel_work_timer (kernel/workqueue.c:3160) mptcp_cancel_work (net/mptcp/protocol.c:2758) mptcp_subflow_queue_clean (net/mptcp/subflow.c:1817) __mptcp_close_ssk (net/mptcp/protocol.c:2363) mptcp_destroy_common (net/mptcp/protocol.c:3170) mptcp_destroy (include/net/sock.h:1495) __mptcp_destroy_sock (net/mptcp/protocol.c:2886) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_close (net/mptcp/protocol.c:2974) inet_release (net/ipv4/af_inet.c:432) __sock_release (net/socket.c:651) sock_close (net/socket.c:1367) __fput (fs/file_table.c:320) task_work_run (kernel/task_work.c:181 (discriminator 1)) exit_to_user_mode_prepare (include/linux/resume_user_mode.h:49) syscall_exit_to_user_mode (kernel/entry/common.c:130) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_INET); lock((work_completion)(&msk->work)); lock(sk_lock-AF_INET); lock((work_completion)(&msk->work)); *** DEADLOCK *** The report is actually a false positive, since the only existing lock nesting is the msk socket lock acquired by the mptcp work. cancel_work_sync() is invoked without the relevant socket lock being held, but under a different (the msk listener) socket lock. We could silence the splat adding a per workqueue dynamic lockdep key, but that looks overkill. Instead just tell lockdep the msk socket lock is not held around cancel_work_sync(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/322 Fixes: 30e51b923e43 ("mptcp: fix unreleased socket in accept queue") Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-12mptcp: fix deadlock in fastopen error pathPaolo Abeni2-4/+16
[ Upstream commit 7d803344fdc3e38079fabcf38b1e4cb6f8faa655 ] MatM reported a deadlock at fastopening time: INFO: task syz-executor.0:11454 blocked for more than 143 seconds. Tainted: G S 6.1.0-rc5-03226-gdb0157db5153 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor.0 state:D stack:25104 pid:11454 ppid:424 flags:0x00004006 Call Trace: <TASK> context_switch kernel/sched/core.c:5191 [inline] __schedule+0x5c2/0x1550 kernel/sched/core.c:6503 schedule+0xe8/0x1c0 kernel/sched/core.c:6579 __lock_sock+0x142/0x260 net/core/sock.c:2896 lock_sock_nested+0xdb/0x100 net/core/sock.c:3466 __mptcp_close_ssk+0x1a3/0x790 net/mptcp/protocol.c:2328 mptcp_destroy_common+0x16a/0x650 net/mptcp/protocol.c:3171 mptcp_disconnect+0xb8/0x450 net/mptcp/protocol.c:3019 __inet_stream_connect+0x897/0xa40 net/ipv4/af_inet.c:720 tcp_sendmsg_fastopen+0x3dd/0x740 net/ipv4/tcp.c:1200 mptcp_sendmsg_fastopen net/mptcp/protocol.c:1682 [inline] mptcp_sendmsg+0x128a/0x1a50 net/mptcp/protocol.c:1721 inet6_sendmsg+0x11f/0x150 net/ipv6/af_inet6.c:663 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xf7/0x190 net/socket.c:734 ____sys_sendmsg+0x336/0x970 net/socket.c:2476 ___sys_sendmsg+0x122/0x1c0 net/socket.c:2530 __sys_sendmmsg+0x18d/0x460 net/socket.c:2616 __do_sys_sendmmsg net/socket.c:2645 [inline] __se_sys_sendmmsg net/socket.c:2642 [inline] __x64_sys_sendmmsg+0x9d/0x110 net/socket.c:2642 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f5920a75e7d RSP: 002b:00007f59201e8028 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00007f5920bb4f80 RCX: 00007f5920a75e7d RDX: 0000000000000001 RSI: 0000000020002940 RDI: 0000000000000005 RBP: 00007f5920ae7593 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000020004050 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007f5920bb4f80 R15: 00007f59201c8000 </TASK> In the error path, tcp_sendmsg_fastopen() ends-up calling mptcp_disconnect(), and the latter tries to close each subflow, acquiring the socket lock on each of them. At fastopen time, we have a single subflow, and such subflow socket lock is already held by the called, causing the deadlock. We already track the 'fastopen in progress' status inside the msk socket. Use it to address the issue, making mptcp_disconnect() a no op when invoked from the fastopen (error) path and doing the relevant cleanup after releasing the subflow socket lock. While at the above, rename the fastopen status bit to something more meaningful. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/321 Fixes: fa9e57468aa1 ("mptcp: fix abba deadlock on fastopen") Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-07mptcp: use proper req destructor for IPv6Matthieu Baerts1-3/+16
commit d3295fee3c756ece33ac0d935e172e68c0a4161b upstream. Before, only the destructor from TCP request sock in IPv4 was called even if the subflow was IPv6. It is important to use the right destructor to avoid memory leaks with some advanced IPv6 features, e.g. when the request socks contain specific IPv6 options. Fixes: 79c0949e9a09 ("mptcp: Add key generation and token tree") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07mptcp: dedicated request sock for subflow in v6Matthieu Baerts1-8/+26
commit 34b21d1ddc8ace77a8fa35c1b1e06377209e0dae upstream. tcp_request_sock_ops structure is specific to IPv4. It should then not be used with MPTCP subflows on top of IPv6. For example, it contains the 'family' field, initialised to AF_INET. This 'family' field is used by TCP FastOpen code to generate the cookie but also by TCP Metrics, SELinux and SYN Cookies. Using the wrong family will not lead to crashes but displaying/using/checking wrong things. Note that 'send_reset' callback from request_sock_ops structure is used in some error paths. It is then also important to use the correct one for IPv4 or IPv6. The slab name can also be different in IPv4 and IPv6, it will be used when printing some log messages. The slab pointer will anyway be the same because the object size is the same for both v4 and v6. A BUILD_BUG_ON() has also been added to make sure this size is the same. Fixes: cec37a6e41aa ("mptcp: Handle MP_CAPABLE options for outgoing connections") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07mptcp: remove MPTCP 'ifdef' in TCP SYN cookiesMatthieu Baerts1-1/+11
commit 3fff88186f047627bb128d65155f42517f8e448f upstream. To ease the maintenance, it is often recommended to avoid having #ifdef preprocessor conditions. Here the section related to CONFIG_MPTCP was quite short but the next commit needs to add more code around. It is then cleaner to move specific MPTCP code to functions located in net/mptcp directory. Now that mptcp_subflow_request_sock_ops structure can be static, it can also be marked as "read only after init". Suggested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07mptcp: netlink: fix some error return codeWei Yongjun1-0/+4
commit e0fe1123ab2b07d2cd5475660bd0b4e6993ffaa7 upstream. Fix to return negative error code -EINVAL from some error handling case instead of 0, as done elsewhere in those functions. Fixes: 9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE") Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Cc: stable@vger.kernel.org Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-29mptcp: fix sleep in atomic at close timePaolo Abeni1-3/+3
Matt reported a splat at msk close time: BUG: sleeping function called from invalid context at net/mptcp/protocol.c:2877 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 155, name: packetdrill preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 4 locks held by packetdrill/155: #0: ffff888001536990 (&sb->s_type->i_mutex_key#6){+.+.}-{3:3}, at: __sock_release (net/socket.c:650) #1: ffff88800b498130 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close (net/mptcp/protocol.c:2973) #2: ffff88800b49a130 (sk_lock-AF_INET/1){+.+.}-{0:0}, at: __mptcp_close_ssk (net/mptcp/protocol.c:2363) #3: ffff88800b49a0b0 (slock-AF_INET){+...}-{2:2}, at: __lock_sock_fast (include/net/sock.h:1820) Preemption disabled at: 0x0 CPU: 1 PID: 155 Comm: packetdrill Not tainted 6.1.0-rc5 #365 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) __might_resched.cold (kernel/sched/core.c:9891) __mptcp_destroy_sock (include/linux/kernel.h:110) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_subflow_queue_clean (include/net/sock.h:1777) __mptcp_close_ssk (net/mptcp/protocol.c:2363) mptcp_destroy_common (net/mptcp/protocol.c:3170) mptcp_destroy (include/net/sock.h:1495) __mptcp_destroy_sock (net/mptcp/protocol.c:2886) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_close (net/mptcp/protocol.c:2974) inet_release (net/ipv4/af_inet.c:432) __sock_release (net/socket.c:651) sock_close (net/socket.c:1367) __fput (fs/file_table.c:320) task_work_run (kernel/task_work.c:181 (discriminator 1)) exit_to_user_mode_prepare (include/linux/resume_user_mode.h:49) syscall_exit_to_user_mode (kernel/entry/common.c:130) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) We can't call mptcp_close under the 'fast' socket lock variant, replace it with a sock_lock_nested() as the relevant code is already under the listening msk socket lock protection. Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/316 Fixes: 30e51b923e43 ("mptcp: fix unreleased socket in accept queue") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29mptcp: don't orphan ssk in mptcp_close()Menglong Dong1-7/+6
All of the subflows of a msk will be orphaned in mptcp_close(), which means the subflows are in DEAD state. After then, DATA_FIN will be sent, and the other side will response with a DATA_ACK for this DATA_FIN. However, if the other side still has pending data, the data that received on these subflows will not be passed to the msk, as they are DEAD and subflow_data_ready() will not be called in tcp_data_ready(). Therefore, these data can't be acked, and they will be retransmitted again and again, until timeout. Fix this by setting ssk->sk_socket and ssk->sk_wq to 'NULL', instead of orphaning the subflows in __mptcp_close(), as Paolo suggested. Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close") Reviewed-by: Biao Jiang <benbjiang@tencent.com> Reviewed-by: Mengen Sun <mengensun@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-25mptcp: fix abba deadlock on fastopenPaolo Abeni1-15/+34
Our CI reported lockdep splat in the fastopen code: ====================================================== WARNING: possible circular locking dependency detected 6.0.0.mptcp_f5e8bfe9878d+ #1558 Not tainted ------------------------------------------------------ packetdrill/1071 is trying to acquire lock: ffff8881bd198140 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_wait_for_connect+0x19c/0x310 but task is already holding lock: ffff8881b8346540 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0xfdf/0x1740 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (k-sk_lock-AF_INET){+.+.}-{0:0}: __lock_acquire+0xb6d/0x1860 lock_acquire+0x1d8/0x620 lock_sock_nested+0x37/0xd0 inet_stream_connect+0x3f/0xa0 mptcp_connect+0x411/0x800 __inet_stream_connect+0x3ab/0x800 mptcp_stream_connect+0xac/0x110 __sys_connect+0x101/0x130 __x64_sys_connect+0x6e/0xb0 do_syscall_64+0x59/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (sk_lock-AF_INET){+.+.}-{0:0}: check_prev_add+0x15e/0x2110 validate_chain+0xace/0xdf0 __lock_acquire+0xb6d/0x1860 lock_acquire+0x1d8/0x620 lock_sock_nested+0x37/0xd0 inet_wait_for_connect+0x19c/0x310 __inet_stream_connect+0x26c/0x800 tcp_sendmsg_fastopen+0x341/0x650 mptcp_sendmsg+0x109d/0x1740 sock_sendmsg+0xe1/0x120 __sys_sendto+0x1c7/0x2a0 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x59/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(k-sk_lock-AF_INET); lock(sk_lock-AF_INET); lock(k-sk_lock-AF_INET); lock(sk_lock-AF_INET); *** DEADLOCK *** 1 lock held by packetdrill/1071: #0: ffff8881b8346540 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0xfdf/0x1740 ====================================================== The problem is caused by the blocking inet_wait_for_connect() releasing and re-acquiring the msk socket lock while the subflow socket lock is still held and the MPTCP socket requires that the msk socket lock must be acquired before the subflow socket lock. Address the issue always invoking tcp_sendmsg_fastopen() in an unblocking manner, and later eventually complete the blocking __inet_stream_connect() as needed. Fixes: d98a82a6afc7 ("mptcp: handle defer connect in mptcp_sendmsg") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-25mptcp: factor out mptcp_connect()Paolo Abeni2-67/+73
The current MPTCP connect implementation duplicates a bit of inet code and does not use nor provide a struct proto->connect callback, which in turn will not fit the upcoming fastopen implementation. Refactor such implementation to use the common helper, moving the MPTCP-specific bits into mptcp_connect(). Additionally, avoid an indirect call to the subflow connect callback. Note that the fastopen call-path invokes mptcp_connect() while already holding the subflow socket lock. Explicitly keep track of such path via a new MPTCP-level flag and handle the locking accordingly. Additionally, track the connect flags in a new msk field to allow propagating them to the subflow inet_stream_connect call. Fixes: d98a82a6afc7 ("mptcp: handle defer connect in mptcp_sendmsg") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-25mptcp: set msk local address earlierPaolo Abeni3-2/+9
The mptcp_pm_nl_get_local_id() code assumes that the msk local address is available at that point. For passive sockets, we initialize such address at accept() time. Depending on the running configuration and the user-space timing, a passive MPJ subflow can join the msk socket before accept() completes. In such case, the PM assigns a wrong local id to the MPJ subflow and later PM netlink operations will end-up touching the wrong/unexpected subflow. All the above causes sporadic self-tests failures, especially when the host is heavy loaded. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/308 Fixes: 01cacb00b35c ("mptcp: add netlink-based PM") Fixes: d045b9eb95a9 ("mptcp: introduce implicit endpoints") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03mptcp: update misleading comments.Paolo Abeni1-7/+7
The MPTCP data path is quite complex and hard to understend even without some foggy comments referring to modified code and/or completely misleading from the beginning. Update a few of them to more accurately describing the current status. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03mptcp: use fastclose on more edge scenariosPaolo Abeni1-19/+44
Daire reported a user-space application hang-up when the peer is forcibly closed before the data transfer completion. The relevant application expects the peer to either do an application-level clean shutdown or a transport-level connection reset. We can accommodate a such user by extending the fastclose usage: at fd close time, if the msk socket has some unread data, and at FIN_WAIT timeout. Note that at MPTCP close time we must ensure that the TCP subflows will reset: set the linger socket option to a suitable value. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03mptcp: propagate fastclose errorPaolo Abeni1-11/+36
When an mptcp socket is closed due to an incoming FASTCLOSE option, so specific sk_err is set and later syscall will fail usually with EPIPE. Align the current fastclose error handling with TCP reset, properly setting the socket error according to the current msk state and propagating such error. Additionally sendmsg() is currently not handling properly the sk_err, always returning EPIPE. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-29/+22
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29mptcp: fix unreleased socket in accept queueMenglong Dong3-27/+9
The mptcp socket and its subflow sockets in accept queue can't be released after the process exit. While the release of a mptcp socket in listening state, the corresponding tcp socket will be released too. Meanwhile, the tcp socket in the unaccept queue will be released too. However, only init subflow is in the unaccept queue, and the joined subflow is not in the unaccept queue, which makes the joined subflow won't be released, and therefore the corresponding unaccepted mptcp socket will not be released to. This can be reproduced easily with following steps: 1. create 2 namespace and veth: $ ip netns add mptcp-client $ ip netns add mptcp-server $ sysctl -w net.ipv4.conf.all.rp_filter=0 $ ip netns exec mptcp-client sysctl -w net.mptcp.enabled=1 $ ip netns exec mptcp-server sysctl -w net.mptcp.enabled=1 $ ip link add red-client netns mptcp-client type veth peer red-server \ netns mptcp-server $ ip -n mptcp-server address add 10.0.0.1/24 dev red-server $ ip -n mptcp-server address add 192.168.0.1/24 dev red-server $ ip -n mptcp-client address add 10.0.0.2/24 dev red-client $ ip -n mptcp-client address add 192.168.0.2/24 dev red-client $ ip -n mptcp-server link set red-server up $ ip -n mptcp-client link set red-client up 2. configure the endpoint and limit for client and server: $ ip -n mptcp-server mptcp endpoint flush $ ip -n mptcp-server mptcp limits set subflow 2 add_addr_accepted 2 $ ip -n mptcp-client mptcp endpoint flush $ ip -n mptcp-client mptcp limits set subflow 2 add_addr_accepted 2 $ ip -n mptcp-client mptcp endpoint add 192.168.0.2 dev red-client id \ 1 subflow 3. listen and accept on a port, such as 9999. The nc command we used here is modified, which makes it use mptcp protocol by default. $ ip netns exec mptcp-server nc -l -k -p 9999 4. open another *two* terminal and use each of them to connect to the server with the following command: $ ip netns exec mptcp-client nc 10.0.0.1 9999 Input something after connect to trigger the connection of the second subflow. So that there are two established mptcp connections, with the second one still unaccepted. 5. exit all the nc command, and check the tcp socket in server namespace. And you will find that there is one tcp socket in CLOSE_WAIT state and can't release forever. Fix this by closing all of the unaccepted mptcp socket in mptcp_subflow_queue_clean() with __mptcp_close(). Now, we can ensure that all unaccepted mptcp sockets will be cleaned by __mptcp_close() before they are released, so mptcp_sock_destruct(), which is used to clean the unaccepted mptcp socket, is not needed anymore. The selftests for mptcp is ran for this commit, and no new failures. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Fixes: 6aeed9045071 ("mptcp: fix race on unaccepted mptcp sockets") Cc: stable@vger.kernel.org Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Mengen Sun <mengensun@tencent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29mptcp: factor out __mptcp_close() without socket lockMenglong Dong2-2/+13
Factor out __mptcp_close() from mptcp_close(). The caller of __mptcp_close() should hold the socket lock, and cancel mptcp work when __mptcp_close() returns true. This function will be used in the next commit. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Fixes: 6aeed9045071 ("mptcp: fix race on unaccepted mptcp sockets") Cc: stable@vger.kernel.org Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Mengen Sun <mengensun@tencent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29mptcp: poll allow write call before actual connectBenjamin Hesmans1-0/+4
If fastopen is used, poll must allow a first write that will trigger the SYN+data Similar to what is done in tcp_poll(). Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29mptcp: handle defer connect in mptcp_sendmsgDmytro Shytyi1-0/+22
When TCP_FASTOPEN_CONNECT has been set on the socket before a connect, the defer flag is set and must be handled when sendmsg is called. This is similar to what is done in tcp_sendmsg_locked(). Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29mptcp: add TCP_FASTOPEN_CONNECT socket optionBenjamin Hesmans1-1/+18
Set the option for the first subflow only. For the other subflows TFO can't be used because a mapping would be needed to cover the data in the SYN. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+7
drivers/net/ethernet/freescale/fec.h 7b15515fc1ca ("Revert "fec: Restart PPS after link state change"") 40c79ce13b03 ("net: fec: add stop mode support for imx8 platform") https://lore.kernel.org/all/20220921105337.62b41047@canb.auug.org.au/ drivers/pinctrl/pinctrl-ocelot.c c297561bc98a ("pinctrl: ocelot: Fix interrupt controller") 181f604b33cd ("pinctrl: ocelot: add ability to be used in a non-mmio configuration") https://lore.kernel.org/all/20220921110032.7cd28114@canb.auug.org.au/ tools/testing/selftests/drivers/net/bonding/Makefile bbb774d921e2 ("net: Add tests for bonding and team address list management") 152e8ec77640 ("selftests/bonding: add a test for bonding lladdr target") https://lore.kernel.org/all/20220921110437.5b7dbd82@canb.auug.org.au/ drivers/net/can/usb/gs_usb.c 5440428b3da6 ("can: gs_usb: gs_can_open(): fix race dev->can.state condition") 45dfa45f52e6 ("can: gs_usb: add RX and TX hardware timestamp support") https://lore.kernel.org/all/84f45a7d-92b6-4dc5-d7a1-072152fab6ff@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-20tcp: Access &tcp_hashinfo via net.Kuniyuki Iwashima1-2/+5
We will soon introduce an optional per-netns ehash. This means we cannot use tcp_hashinfo directly in most places. Instead, access it via net->ipv4.tcp_death_row.hashinfo. The access will be valid only while initialising tcp_hashinfo itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-15mptcp: account memory allocation in mptcp_nl_cmd_add_addr() to userThomas Haller1-1/+1
Now that non-root users can configure MPTCP endpoints, account the memory allocation to the user. Signed-off-by: Thomas Haller <thaller@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-15mptcp: allow privileged operations from user namespacesThomas Haller1-9/+9
GENL_ADMIN_PERM checks that the user has CAP_NET_ADMIN in the initial namespace by calling netlink_capable(). Instead, use GENL_UNS_ADMIN_PERM which uses netlink_ns_capable(). This checks that the caller has CAP_NET_ADMIN in the current user namespace. See also commit 4a92602aa1cd ("openvswitch: allow management from inside user namespaces") which introduced this mechanism. See also commit 5617c6cd6f84 ("nl80211: Allow privileged operations from user namespaces") which introduced this for nl80211. Signed-off-by: Thomas Haller <thaller@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-15mptcp: add do_check_data_fin to replace copiedGeliang Tang1-3/+4
This patch adds a new bool variable 'do_check_data_fin' to replace the original int variable 'copied' in __mptcp_push_pending(), check it to determine whether to call __mptcp_check_send_data_fin(). Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-15mptcp: add mptcp_for_each_subflow_safe helperMatthieu Baerts3-4/+6
Similar to mptcp_for_each_subflow(): this is clearer now that the _safe version is used in multiple places. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-09-13mptcp: fix fwd memory accounting on coalescePaolo Abeni1-1/+7
The intel bot reported a memory accounting related splat: [ 240.473094] ------------[ cut here ]------------ [ 240.478507] page_counter underflow: -4294828518 nr_pages=4294967290 [ 240.485500] WARNING: CPU: 2 PID: 14986 at mm/page_counter.c:56 page_counter_cancel+0x96/0xc0 [ 240.570849] CPU: 2 PID: 14986 Comm: mptcp_connect Tainted: G S 5.19.0-rc4-00739-gd24141fe7b48 #1 [ 240.581637] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017 [ 240.590600] RIP: 0010:page_counter_cancel+0x96/0xc0 [ 240.596179] Code: 00 00 00 45 31 c0 48 89 ef 5d 4c 89 c6 41 5c e9 40 fd ff ff 4c 89 e2 48 c7 c7 20 73 39 84 c6 05 d5 b1 52 04 01 e8 e7 95 f3 01 <0f> 0b eb a9 48 89 ef e8 1e 25 fc ff eb c3 66 66 2e 0f 1f 84 00 00 [ 240.615639] RSP: 0018:ffffc9000496f7c8 EFLAGS: 00010082 [ 240.621569] RAX: 0000000000000000 RBX: ffff88819c9c0120 RCX: 0000000000000000 [ 240.629404] RDX: 0000000000000027 RSI: 0000000000000004 RDI: fffff5200092deeb [ 240.637239] RBP: ffff88819c9c0120 R08: 0000000000000001 R09: ffff888366527a2b [ 240.645069] R10: ffffed106cca4f45 R11: 0000000000000001 R12: 00000000fffffffa [ 240.652903] R13: ffff888366536118 R14: 00000000fffffffa R15: ffff88819c9c0000 [ 240.660738] FS: 00007f3786e72540(0000) GS:ffff888366500000(0000) knlGS:0000000000000000 [ 240.669529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 240.675974] CR2: 00007f966b346000 CR3: 0000000168cea002 CR4: 00000000003706e0 [ 240.683807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 240.691641] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 240.699468] Call Trace: [ 240.702613] <TASK> [ 240.705413] page_counter_uncharge+0x29/0x80 [ 240.710389] drain_stock+0xd0/0x180 [ 240.714585] refill_stock+0x278/0x580 [ 240.718951] __sk_mem_reduce_allocated+0x222/0x5c0 [ 240.729248] __mptcp_update_rmem+0x235/0x2c0 [ 240.734228] __mptcp_move_skbs+0x194/0x6c0 [ 240.749764] mptcp_recvmsg+0xdfa/0x1340 [ 240.763153] inet_recvmsg+0x37f/0x500 [ 240.782109] sock_read_iter+0x24a/0x380 [ 240.805353] new_sync_read+0x420/0x540 [ 240.838552] vfs_read+0x37f/0x4c0 [ 240.842582] ksys_read+0x170/0x200 [ 240.864039] do_syscall_64+0x5c/0x80 [ 240.872770] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 240.878526] RIP: 0033:0x7f3786d9ae8e [ 240.882805] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 18 0a 00 e8 89 e8 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 [ 240.902259] RSP: 002b:00007fff7be81e08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 240.910533] RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007f3786d9ae8e [ 240.918368] RDX: 0000000000002000 RSI: 00007fff7be87ec0 RDI: 0000000000000005 [ 240.926206] RBP: 0000000000000005 R08: 00007f3786e6a230 R09: 00007f3786e6a240 [ 240.934046] R10: fffffffffffff288 R11: 0000000000000246 R12: 0000000000002000 [ 240.941884] R13: 00007fff7be87ec0 R14: 00007fff7be87ec0 R15: 0000000000002000 [ 240.949741] </TASK> [ 240.952632] irq event stamp: 27367 [ 240.956735] hardirqs last enabled at (27366): [<ffffffff81ba50ea>] mem_cgroup_uncharge_skmem+0x6a/0x80 [ 240.966848] hardirqs last disabled at (27367): [<ffffffff81b8fd42>] refill_stock+0x282/0x580 [ 240.976017] softirqs last enabled at (27360): [<ffffffff83a4d8ef>] mptcp_recvmsg+0xaf/0x1340 [ 240.985273] softirqs last disabled at (27364): [<ffffffff83a4d30c>] __mptcp_move_skbs+0x18c/0x6c0 [ 240.994872] ---[ end trace 0000000000000000 ]--- After commit d24141fe7b48 ("mptcp: drop SK_RECLAIM_* macros"), if rmem_fwd_alloc become negative, mptcp_rmem_uncharge() can try to reclaim a negative amount of pages, since the expression: reclaimable >= PAGE_SIZE will evaluate to true for any negative value of the int 'reclaimable': 'PAGE_SIZE' is an unsigned long and the negative integer will be promoted to a (very large) unsigned long value. Still after the mentioned commit, kfree_skb_partial() in mptcp_try_coalesce() will reclaim most of just released fwd memory, so that following charging of the skb delta size will lead to negative fwd memory values. At that point a racing recvmsg() can trigger the splat. Address the issue switching the order of the memory accounting operations. The fwd memory can still transiently reach negative values, but that will happen in an atomic scope and no code path could touch/use such value. Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: d24141fe7b48 ("mptcp: drop SK_RECLAIM_* macros") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20220906180404.1255873-1-matthieu.baerts@tessares.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-29genetlink: start to validate reserved header bytesJakub Kicinski1-0/+1
We had historically not checked that genlmsghdr.reserved is 0 on input which prevents us from using those precious bytes in the future. One use case would be to extend the cmd field, which is currently just 8 bits wide and 256 is not a lot of commands for some core families. To make sure that new families do the right thing by default put the onus of opting out of validation on existing families. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel) Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-24net: Fix data-races around sysctl_max_skb_frags.Kuniyuki Iwashima1-1/+1
While reading sysctl_max_skb_frags, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-05mptcp: do not queue data on closed subflowsPaolo Abeni2-5/+14
Dipanjan reported a syzbot splat at close time: WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153 inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153 Modules linked in: uio_ivshmem(OE) uio(E) CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G OE 5.19.0-rc6-g2eae0556bb9d #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153 Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91 f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00 RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002 RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000 R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400 R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258 FS: 0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0 Call Trace: <TASK> __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067 sk_destruct+0xbd/0xe0 net/core/sock.c:2112 __sk_free+0xef/0x3d0 net/core/sock.c:2123 sk_free+0x78/0xa0 net/core/sock.c:2134 sock_put include/net/sock.h:1927 [inline] __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351 __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828 mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586 process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289 worker_thread+0x623/0x1070 kernel/workqueue.c:2436 kthread+0x2e9/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302 </TASK> The root cause of the problem is that an mptcp-level (re)transmit can race with mptcp_close() and the packet scheduler checks the subflow state before acquiring the socket lock: we can try to (re)transmit on an already closed ssk. Fix the issue checking again the subflow socket status under the subflow socket lock protection. Additionally add the missing check for the fallback-to-tcp case. Fixes: d5f49190def6 ("mptcp: allow picking different xmit subflows") Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-05mptcp: move subflow cleanup in mptcp_destroy_common()Paolo Abeni3-26/+18
If the mptcp socket creation fails due to a CGROUP_INET_SOCK_CREATE eBPF program, the MPTCP protocol ends-up leaking all the subflows: the related cleanup happens in __mptcp_destroy_sock() that is not invoked in such code path. Address the issue moving the subflow sockets cleanup in the mptcp_destroy_common() helper, which is invoked in every msk cleanup path. Additionally get rid of the intermediate list_splice_init step, which is an unneeded relic from the past. The issue is present since before the reported root cause commit, but any attempt to backport the fix before that hash will require a complete rewrite. Fixes: e16163b6e2 ("mptcp: refactor shutdown and close") Reported-by: Nguyen Dinh Phi <phind.uet@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Co-developed-by: Nguyen Dinh Phi <phind.uet@gmail.com> Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-6/+6
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>