summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
4 daysmptcp: pm: do not remove closing subflowsMatthieu Baerts (NGI0)1-1/+2
commit db0a37b7ac27d8ca27d3dc676a16d081c16ec7b9 upstream. In a previous fix, the in-kernel path-manager has been modified not to retrigger the removal of a subflow if it was already closed, e.g. when the initial subflow is removed, but kept in the subflows list. To be complete, this fix should also skip the subflows that are in any closing state: mptcp_close_ssk() will initiate the closure, but the switch to the TCP_CLOSE state depends on the other peer. Fixes: 58e1b66b4e4b ("mptcp: pm: do not remove already closed subflows") Cc: stable@vger.kernel.org Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-4-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysmptcp: handle consistently DSS corruptionPaolo Abeni4-4/+28
commit e32d262c89e2b22cb0640223f953b548617ed8a6 upstream. Bugged peer implementation can send corrupted DSS options, consistently hitting a few warning in the data path. Use DEBUG_NET assertions, to avoid the splat on some builds and handle consistently the error, dumping related MIBs and performing fallback and/or reset according to the subflow type. Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-1-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysnet: Fix an unsafe loop on the listAnastasia Kovaleva1-1/+2
commit 1dae9f1187189bc09ff6d25ca97ead711f7e26f9 upstream. The kernel may crash when deleting a genetlink family if there are still listeners for that family: Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c000000000c080bc] netlink_update_socket_mc+0x3c/0xc0 LR [c000000000c0f764] __netlink_clear_multicast_users+0x74/0xc0 Call Trace: __netlink_clear_multicast_users+0x74/0xc0 genl_unregister_family+0xd4/0x2d0 Change the unsafe loop on the list to a safe one, because inside the loop there is an element removal from this list. Fixes: b8273570f802 ("genetlink: fix netns vs. netlink table locking (2)") Cc: stable@vger.kernel.org Signed-off-by: Anastasia Kovaleva <a.kovaleva@yadro.com> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241003104431.12391-1-a.kovaleva@yadro.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysnet: explicitly clear the sk pointer, when pf->create failsIgnat Korchagin1-1/+6
commit 631083143315d1b192bd7d915b967b37819e88ea upstream. We have recently noticed the exact same KASAN splat as in commit 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails"). The problem is that commit did not fully address the problem, as some pf->create implementations do not use sk_common_release in their error paths. For example, we can use the same reproducer as in the above commit, but changing ping to arping. arping uses AF_PACKET socket and if packet_create fails, it will just sk_free the allocated sk object. While we could chase all the pf->create implementations and make sure they NULL the freed sk object on error from the socket, we can't guarantee future protocols will not make the same mistake. So it is easier to just explicitly NULL the sk pointer upon return from pf->create in __sock_create. We do know that pf->create always releases the allocated sk object on error, so if the pointer is not NULL, it is definitely dangling. Fixes: 6cd4a78d962b ("net: do not leave a dangling sk pointer, when socket creation fails") Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241003170151.69445-1-ignat@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysmptcp: fallback when MPTCP opts are dropped after 1st dataMatthieu Baerts (NGI0)1-1/+1
commit 119d51e225febc8152476340a880f5415a01e99e upstream. As reported by Christoph [1], before this patch, an MPTCP connection was wrongly reset when a host received a first data packet with MPTCP options after the 3wHS, but got the next ones without. According to the MPTCP v1 specs [2], a fallback should happen in this case, because the host didn't receive a DATA_ACK from the other peer, nor receive data for more than the initial window which implies a DATA_ACK being received by the other peer. The patch here re-uses the same logic as the one used in other places: by looking at allow_infinite_fallback, which is disabled at the creation of an additional subflow. It's not looking at the first DATA_ACK (or implying one received from the other side) as suggested by the RFC, but it is in continuation with what was already done, which is safer, and it fixes the reported issue. The next step, looking at this first DATA_ACK, is tracked in [4]. This patch has been validated using the following Packetdrill script: 0 socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // 3WHS is OK +0.0 < S 0:0(0) win 65535 <mss 1460, sackOK, nop, nop, nop, wscale 6, mpcapable v1 flags[flag_h] nokey> +0.0 > S. 0:0(0) ack 1 <mss 1460, nop, nop, sackOK, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey]> +0.1 < . 1:1(0) ack 1 win 2048 <mpcapable v1 flags[flag_h] key[ckey=2, skey]> +0 accept(3, ..., ...) = 4 // Data from the client with valid MPTCP options (no DATA_ACK: normal) +0.1 < P. 1:501(500) ack 1 win 2048 <mpcapable v1 flags[flag_h] key[skey, ckey] mpcdatalen 500, nop, nop> // From here, the MPTCP options will be dropped by a middlebox +0.0 > . 1:1(0) ack 501 <dss dack8=501 dll=0 nocs> +0.1 read(4, ..., 500) = 500 +0 write(4, ..., 100) = 100 // The server replies with data, still thinking MPTCP is being used +0.0 > P. 1:101(100) ack 501 <dss dack8=501 dsn8=1 ssn=1 dll=100 nocs, nop, nop> // But the client already did a fallback to TCP, because the two previous packets have been received without MPTCP options +0.1 < . 501:501(0) ack 101 win 2048 +0.0 < P. 501:601(100) ack 101 win 2048 // The server should fallback to TCP, not reset: it didn't get a DATA_ACK, nor data for more than the initial window +0.0 > . 101:101(0) ack 601 Note that this script requires Packetdrill with MPTCP support, see [3]. Fixes: dea2b1ea9c70 ("mptcp: do not reset MP_CAPABLE subflow on mapping errors") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/518 [1] Link: https://datatracker.ietf.org/doc/html/rfc8684#name-fallback [2] Link: https://github.com/multipath-tcp/packetdrill [3] Link: https://github.com/multipath-tcp/mptcp_net-next/issues/519 [4] Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-3-c6fb8e93e551@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysBluetooth: hci_conn: Fix UAF in hci_enhanced_setup_syncLuiz Augusto von Dentz1-0/+3
commit 18fd04ad856df07733f5bb07e7f7168e7443d393 upstream. This checks if the ACL connection remains valid as it could be destroyed while hci_enhanced_setup_sync is pending on cmd_sync leading to the following trace: BUG: KASAN: slab-use-after-free in hci_enhanced_setup_sync+0x91b/0xa60 Read of size 1 at addr ffff888002328ffd by task kworker/u5:2/37 CPU: 0 UID: 0 PID: 37 Comm: kworker/u5:2 Not tainted 6.11.0-rc6-01300-g810be445d8d6 #7099 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 Workqueue: hci0 hci_cmd_sync_work Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 ? hci_enhanced_setup_sync+0x91b/0xa60 print_report+0x152/0x4c0 ? hci_enhanced_setup_sync+0x91b/0xa60 ? __virt_addr_valid+0x1fa/0x420 ? hci_enhanced_setup_sync+0x91b/0xa60 kasan_report+0xda/0x1b0 ? hci_enhanced_setup_sync+0x91b/0xa60 hci_enhanced_setup_sync+0x91b/0xa60 ? __pfx_hci_enhanced_setup_sync+0x10/0x10 ? __pfx___mutex_lock+0x10/0x10 hci_cmd_sync_work+0x1c2/0x330 process_one_work+0x7d9/0x1360 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_process_one_work+0x10/0x10 ? assign_work+0x167/0x240 worker_thread+0x5b7/0xf60 ? __kthread_parkme+0xac/0x1c0 ? __pfx_worker_thread+0x10/0x10 ? __pfx_worker_thread+0x10/0x10 kthread+0x293/0x360 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 34: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 __hci_conn_add+0x187/0x17d0 hci_connect_sco+0x2e1/0xb90 sco_sock_connect+0x2a2/0xb80 __sys_connect+0x227/0x2a0 __x64_sys_connect+0x6d/0xb0 do_syscall_64+0x71/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 37: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x101/0x160 kfree+0xd0/0x250 device_release+0x9a/0x210 kobject_put+0x151/0x280 hci_conn_del+0x448/0xbf0 hci_abort_conn_sync+0x46f/0x980 hci_cmd_sync_work+0x1c2/0x330 process_one_work+0x7d9/0x1360 worker_thread+0x5b7/0xf60 kthread+0x293/0x360 ret_from_fork+0x2f/0x70 ret_from_fork_asm+0x1a/0x30 Cc: stable@vger.kernel.org Fixes: e07a06b4eb41 ("Bluetooth: Convert SCO configure_datapath to hci_sync") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 dayswifi: mac80211: Avoid address calculations via out of bounds array indexingKenton Groombridge1-8/+9
[ Upstream commit 2663d0462eb32ae7c9b035300ab6b1523886c718 ] req->n_channels must be set before req->channels[] can be used. This patch fixes one of the issues encountered in [1]. [ 83.964255] UBSAN: array-index-out-of-bounds in net/mac80211/scan.c:364:4 [ 83.964258] index 0 is out of range for type 'struct ieee80211_channel *[]' [...] [ 83.964264] Call Trace: [ 83.964267] <TASK> [ 83.964269] dump_stack_lvl+0x3f/0xc0 [ 83.964274] __ubsan_handle_out_of_bounds+0xec/0x110 [ 83.964278] ieee80211_prep_hw_scan+0x2db/0x4b0 [ 83.964281] __ieee80211_start_scan+0x601/0x990 [ 83.964291] nl80211_trigger_scan+0x874/0x980 [ 83.964295] genl_family_rcv_msg_doit+0xe8/0x160 [ 83.964298] genl_rcv_msg+0x240/0x270 [...] [1] https://bugzilla.kernel.org/show_bug.cgi?id=218810 Co-authored-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Kenton Groombridge <concord@gentoo.org> Link: https://msgid.link/20240605152218.236061-1-concord@gentoo.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> [Xiangyu: Modified to apply on 6.1.y and 6.6.y] Signed-off-by: Xiangyu Chen <xiangyu.chen@windriver.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysphonet: Handle error of rtnl_register_module().Kuniyuki Iwashima1-17/+11
[ Upstream commit b5e837c86041bef60f36cf9f20a641a30764379a ] Before commit addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message handlers"), once the first rtnl_register_module() allocated rtnl_msg_handlers[PF_PHONET], the following calls never failed. However, after the commit, rtnl_register_module() could fail silently to allocate rtnl_msg_handlers[PF_PHONET][msgtype] and requires error handling for each call. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's use rtnl_register_many() to handle the errors easily. Fixes: addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message handlers") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: RĂ©mi Denis-Courmont <courmisch@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysphonet: no longer hold RTNL in route_dumpit()Eric Dumazet1-8/+9
[ Upstream commit 58a4ff5d77b187086eb12d41d613749420947f19 ] route_dumpit() already relies on RCU, RTNL is not needed. Also change return value at the end of a dump. This allows NLMSG_DONE to be appended to the current skb at the end of a dump, saving a couple of recvmsg() system calls. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Remi Denis-Courmont <courmisch@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240507121748.416287-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: b5e837c86041 ("phonet: Handle error of rtnl_register_module().") Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysmpls: Handle error of rtnl_register_module().Kuniyuki Iwashima1-11/+21
[ Upstream commit 5be2062e3080e3ff6707816caa445ec0c6eaacf7 ] Since introduced, mpls_init() has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: 03c0566542f4 ("mpls: Netlink commands to add, remove, and dump routes") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysmpls: no longer hold RTNL in mpls_netconf_dump_devconf()Eric Dumazet1-37/+22
[ Upstream commit e0f89d2864b062b027196925ea19f94b2ce50d6a ] - Use for_each_netdev_dump() to no longer rely on net->dev_index_head hash table. - No longer care of net->dev_base_seq - Fix return value at the end of a dump, so that NLMSG_DONE can be appended to current skb, saving one recvmsg() system call. - No longer grab RTNL, RCU protection is enough, afer adding one READ_ONCE(mdev->input_enabled) in mpls_netconf_fill_devconf() Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240410111951.2673193-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 5be2062e3080 ("mpls: Handle error of rtnl_register_module().") Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysrtnetlink: add RTNL_FLAG_DUMP_UNLOCKED flagEric Dumazet2-0/+5
[ Upstream commit 386520e0ecc01004d3a29c70c5a77d4bbf8a8420 ] Similarly to RTNL_FLAG_DOIT_UNLOCKED, this new flag allows dump operations registered via rtnl_register() or rtnl_register_module() to opt-out from RTNL protection. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 5be2062e3080 ("mpls: Handle error of rtnl_register_module().") Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysrtnetlink: change nlk->cb_mutex roleEric Dumazet2-16/+21
[ Upstream commit e39951d965bf58b5aba7f61dc1140dcb8271af22 ] In commit af65bdfce98d ("[NETLINK]: Switch cb_lock spinlock to mutex and allow to override it"), Patrick McHardy used a common mutex to protect both nlk->cb and the dump() operations. The override is used for rtnl dumps, registered with rntl_register() and rntl_register_module(). We want to be able to opt-out some dump() operations to not acquire RTNL, so we need to protect nlk->cb with a per socket mutex. This patch renames nlk->cb_def_mutex to nlk->nl_cb_mutex The optional pointer to the mutex used to protect dump() call is stored in nlk->dump_cb_mutex Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 5be2062e3080 ("mpls: Handle error of rtnl_register_module().") Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysmctp: Handle error of rtnl_register_module().Kuniyuki Iwashima4-35/+65
[ Upstream commit d51705614f668254cc5def7490df76f9680b4659 ] Since introduced, mctp has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: 583be982d934 ("mctp: Add device handling and netlink interface") Fixes: 831119f88781 ("mctp: Add neighbour netlink interface") Fixes: 06d2f4c583a7 ("mctp: Add netlink route management") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysbridge: Handle error of rtnl_register_module().Kuniyuki Iwashima3-13/+17
[ Upstream commit cba5e43b0b757734b1e79f624d93a71435e31136 ] Since introduced, br_vlan_rtnl_init() has been ignoring the returned value of rtnl_register_module(), which could fail silently. Handling the error allows users to view a module as an all-or-nothing thing in terms of the rtnetlink functionality. This prevents syzkaller from reporting spurious errors from its tests, where OOM often occurs and module is automatically loaded. Let's handle the errors by rtnl_register_many(). Fixes: 8dcea187088b ("net: bridge: vlan: add rtm definitions and dump support") Fixes: f26b296585dc ("net: bridge: vlan: add new rtm message support") Fixes: adb3ce9bcb0f ("net: bridge: vlan: add del rtm message support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysrtnetlink: Add bulk registration helpers for rtnetlink message handlers.Kuniyuki Iwashima1-0/+29
[ Upstream commit 07cc7b0b942bf55ef1a471470ecda8d2a6a6541f ] Before commit addf9b90de22 ("net: rtnetlink: use rcu to free rtnl message handlers"), once rtnl_msg_handlers[protocol] was allocated, the following rtnl_register_module() for the same protocol never failed. However, after the commit, rtnl_msg_handler[protocol][msgtype] needs to be allocated in each rtnl_register_module(), so each call could fail. Many callers of rtnl_register_module() do not handle the returned error, and we need to add many error handlings. To handle that easily, let's add wrapper functions for bulk registration of rtnetlink message handlers. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: 78b7b991838a ("vxlan: Handle error of rtnl_register_module().") Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnet: do not delay dst_entries_add() in dst_release()Eric Dumazet1-5/+12
[ Upstream commit ac888d58869bb99753e7652be19a151df9ecb35d ] dst_entries_add() uses per-cpu data that might be freed at netns dismantle from ip6_route_net_exit() calling dst_entries_destroy() Before ip6_route_net_exit() can be called, we release all the dsts associated with this netns, via calls to dst_release(), which waits an rcu grace period before calling dst_destroy() dst_entries_add() use in dst_destroy() is racy, because dst_entries_destroy() could have been called already. Decrementing the number of dsts must happen sooner. Notes: 1) in CONFIG_XFRM case, dst_destroy() can call dst_release_immediate(child), this might also cause UAF if the child does not have DST_NOCOUNT set. IPSEC maintainers might take a look and see how to address this. 2) There is also discussion about removing this count of dst, which might happen in future kernels. Fixes: f88649721268 ("ipv4: fix dst race in sk_dst_get()") Closes: https://lore.kernel.org/lkml/CANn89iLCCGsP7SFn9HKpvnKu96Td4KD08xf7aGtiYgZnkjaL=w@mail.gmail.com/T/ Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20241008143110.1064899-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnetfilter: fib: check correct rtable in vrf setupsFlorian Westphal2-5/+4
[ Upstream commit 05ef7055debc804e8083737402127975e7244fc4 ] We need to init l3mdev unconditionally, else main routing table is searched and incorrect result is returned unless strict (iif keyword) matching is requested. Next patch adds a selftest for this. Fixes: 2a8a7c0eaa87 ("netfilter: nft_fib: Fix for rpath check with VRF devices") Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1761 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnetfilter: xtables: avoid NFPROTO_UNSPEC where neededFlorian Westphal16-165/+422
[ Upstream commit 0bfcb7b71e735560077a42847f69597ec7dcc326 ] syzbot managed to call xt_cluster match via ebtables: WARNING: CPU: 0 PID: 11 at net/netfilter/xt_cluster.c:72 xt_cluster_mt+0x196/0x780 [..] ebt_do_table+0x174b/0x2a40 Module registers to NFPROTO_UNSPEC, but it assumes ipv4/ipv6 packet processing. As this is only useful to restrict locally terminating TCP/UDP traffic, register this for ipv4 and ipv6 family only. Pablo points out that this is a general issue, direct users of the set/getsockopt interface can call into targets/matches that were only intended for use with ip(6)tables. Check all UNSPEC matches and targets for similar issues: - matches and targets are fine except if they assume skb_network_header() is valid -- this is only true when called from inet layer: ip(6) stack pulls the ip/ipv6 header into linear data area. - targets that return XT_CONTINUE or other xtables verdicts must be restricted too, they are incompatbile with the ebtables traverser, e.g. EBT_CONTINUE is a completely different value than XT_CONTINUE. Most matches/targets are changed to register for NFPROTO_IPV4/IPV6, as they are provided for use by ip(6)tables. The MARK target is also used by arptables, so register for NFPROTO_ARP too. While at it, bail out if connbytes fails to enable the corresponding conntrack family. This change passes the selftests in iptables.git. Reported-by: syzbot+256c348558aa5cf611a9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netfilter-devel/66fec2e2.050a0220.9ec68.0047.GAE@google.com/ Fixes: 0269ea493734 ("netfilter: xtables: add cluster match") Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 dayssctp: ensure sk_state is set to CLOSED if hashing fails in sctp_listen_startXin Long1-5/+13
[ Upstream commit 4d5c70e6155d5eae198bade4afeab3c1b15073b6 ] If hashing fails in sctp_listen_start(), the socket remains in the LISTENING state, even though it was not added to the hash table. This can lead to a scenario where a socket appears to be listening without actually being accessible. This patch ensures that if the hashing operation fails, the sk_state is set back to CLOSED before returning an error. Note that there is no need to undo the autobind operation if hashing fails, as the bind port can still be used for next listen() call on the same socket. Fixes: 76c6d988aeb3 ("sctp: add sock_reuseport for the sock in __sctp_hash_endpoint") Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnet/sched: accept TCA_STAB only for root qdiscEric Dumazet1-1/+6
[ Upstream commit 3cb7cf1540ddff5473d6baeb530228d19bc97b8a ] Most qdiscs maintain their backlog using qdisc_pkt_len(skb) on the assumption it is invariant between the enqueue() and dequeue() handlers. Unfortunately syzbot can crash a host rather easily using a TBF + SFQ combination, with an STAB on SFQ [1] We can't support TCA_STAB on arbitrary level, this would require to maintain per-qdisc storage. [1] [ 88.796496] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 88.798611] #PF: supervisor read access in kernel mode [ 88.799014] #PF: error_code(0x0000) - not-present page [ 88.799506] PGD 0 P4D 0 [ 88.799829] Oops: Oops: 0000 [#1] SMP NOPTI [ 88.800569] CPU: 14 UID: 0 PID: 2053 Comm: b371744477 Not tainted 6.12.0-rc1-virtme #1117 [ 88.801107] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 88.801779] RIP: 0010:sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq [ 88.802544] Code: 0f b7 50 12 48 8d 04 d5 00 00 00 00 48 89 d6 48 29 d0 48 8b 91 c0 01 00 00 48 c1 e0 03 48 01 c2 66 83 7a 1a 00 7e c0 48 8b 3a <4c> 8b 07 4c 89 02 49 89 50 08 48 c7 47 08 00 00 00 00 48 c7 07 00 All code ======== 0: 0f b7 50 12 movzwl 0x12(%rax),%edx 4: 48 8d 04 d5 00 00 00 lea 0x0(,%rdx,8),%rax b: 00 c: 48 89 d6 mov %rdx,%rsi f: 48 29 d0 sub %rdx,%rax 12: 48 8b 91 c0 01 00 00 mov 0x1c0(%rcx),%rdx 19: 48 c1 e0 03 shl $0x3,%rax 1d: 48 01 c2 add %rax,%rdx 20: 66 83 7a 1a 00 cmpw $0x0,0x1a(%rdx) 25: 7e c0 jle 0xffffffffffffffe7 27: 48 8b 3a mov (%rdx),%rdi 2a:* 4c 8b 07 mov (%rdi),%r8 <-- trapping instruction 2d: 4c 89 02 mov %r8,(%rdx) 30: 49 89 50 08 mov %rdx,0x8(%r8) 34: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 3b: 00 3c: 48 rex.W 3d: c7 .byte 0xc7 3e: 07 (bad) ... Code starting with the faulting instruction =========================================== 0: 4c 8b 07 mov (%rdi),%r8 3: 4c 89 02 mov %r8,(%rdx) 6: 49 89 50 08 mov %rdx,0x8(%r8) a: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 11: 00 12: 48 rex.W 13: c7 .byte 0xc7 14: 07 (bad) ... [ 88.803721] RSP: 0018:ffff9a1f892b7d58 EFLAGS: 00000206 [ 88.804032] RAX: 0000000000000000 RBX: ffff9a1f8420c800 RCX: ffff9a1f8420c800 [ 88.804560] RDX: ffff9a1f81bc1440 RSI: 0000000000000000 RDI: 0000000000000000 [ 88.805056] RBP: ffffffffc04bb0e0 R08: 0000000000000001 R09: 00000000ff7f9a1f [ 88.805473] R10: 000000000001001b R11: 0000000000009a1f R12: 0000000000000140 [ 88.806194] R13: 0000000000000001 R14: ffff9a1f886df400 R15: ffff9a1f886df4ac [ 88.806734] FS: 00007f445601a740(0000) GS:ffff9a2e7fd80000(0000) knlGS:0000000000000000 [ 88.807225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 88.807672] CR2: 0000000000000000 CR3: 000000050cc46000 CR4: 00000000000006f0 [ 88.808165] Call Trace: [ 88.808459] <TASK> [ 88.808710] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) [ 88.809261] ? page_fault_oops (arch/x86/mm/fault.c:715) [ 88.809561] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) [ 88.809806] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) [ 88.810074] ? sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq [ 88.810411] sfq_reset (net/sched/sch_sfq.c:525) sch_sfq [ 88.810671] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036) [ 88.810950] tbf_reset (./include/linux/timekeeping.h:169 net/sched/sch_tbf.c:334) sch_tbf [ 88.811208] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036) [ 88.811484] netif_set_real_num_tx_queues (./include/linux/spinlock.h:396 ./include/net/sch_generic.h:768 net/core/dev.c:2958) [ 88.811870] __tun_detach (drivers/net/tun.c:590 drivers/net/tun.c:673) [ 88.812271] tun_chr_close (drivers/net/tun.c:702 drivers/net/tun.c:3517) [ 88.812505] __fput (fs/file_table.c:432 (discriminator 1)) [ 88.812735] task_work_run (kernel/task_work.c:230) [ 88.813016] do_exit (kernel/exit.c:940) [ 88.813372] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:58 (discriminator 4)) [ 88.813639] ? handle_mm_fault (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/memcontrol.h:1022 ./include/linux/memcontrol.h:1045 ./include/linux/memcontrol.h:1052 mm/memory.c:5928 mm/memory.c:6088) [ 88.813867] do_group_exit (kernel/exit.c:1070) [ 88.814138] __x64_sys_exit_group (kernel/exit.c:1099) [ 88.814490] x64_sys_call (??:?) [ 88.814791] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) [ 88.815012] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) [ 88.815495] RIP: 0033:0x7f44560f1975 Fixes: 175f9c1bba9b ("net_sched: Add size table for qdiscs") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20241007184130.3960565-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysBluetooth: RFCOMM: FIX possible deadlock in rfcomm_sk_state_changeLuiz Augusto von Dentz1-2/+0
[ Upstream commit 08d1914293dae38350b8088980e59fbc699a72fe ] rfcomm_sk_state_change attempts to use sock_lock so it must never be called with it locked but rfcomm_sock_ioctl always attempt to lock it causing the following trace: ====================================================== WARNING: possible circular locking dependency detected 6.8.0-syzkaller-08951-gfe46a7dd189e #0 Not tainted ------------------------------------------------------ syz-executor386/5093 is trying to acquire lock: ffff88807c396258 (sk_lock-AF_BLUETOOTH-BTPROTO_RFCOMM){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1671 [inline] ffff88807c396258 (sk_lock-AF_BLUETOOTH-BTPROTO_RFCOMM){+.+.}-{0:0}, at: rfcomm_sk_state_change+0x5b/0x310 net/bluetooth/rfcomm/sock.c:73 but task is already holding lock: ffff88807badfd28 (&d->lock){+.+.}-{3:3}, at: __rfcomm_dlc_close+0x226/0x6a0 net/bluetooth/rfcomm/core.c:491 Reported-by: syzbot+d7ce59b06b3eb14fd218@syzkaller.appspotmail.com Tested-by: syzbot+d7ce59b06b3eb14fd218@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d7ce59b06b3eb14fd218 Fixes: 3241ad820dbb ("[Bluetooth] Add timestamp support to L2CAP, RFCOMM and SCO") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnetfilter: br_netfilter: fix panic with metadata_dst skbAndy Roulin1-0/+5
[ Upstream commit f9ff7665cd128012868098bbd07e28993e314fdb ] Fix a kernel panic in the br_netfilter module when sending untagged traffic via a VxLAN device. This happens during the check for fragmentation in br_nf_dev_queue_xmit. It is dependent on: 1) the br_netfilter module being loaded; 2) net.bridge.bridge-nf-call-iptables set to 1; 3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port; 4) untagged frames with size higher than the VxLAN MTU forwarded/flooded When forwarding the untagged packet to the VxLAN bridge port, before the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL. Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check for frames that needs to be fragmented: frames with higher MTU than the VxLAN device end up calling br_nf_ip_fragment, which in turns call ip_skb_dst_mtu. The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst with valid dst->dev, thus the crash. This case was never supported in the first place, so drop the packet instead. PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data. [ 176.291791] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000110 [ 176.292101] Mem abort info: [ 176.292184] ESR = 0x0000000096000004 [ 176.292322] EC = 0x25: DABT (current EL), IL = 32 bits [ 176.292530] SET = 0, FnV = 0 [ 176.292709] EA = 0, S1PTW = 0 [ 176.292862] FSC = 0x04: level 0 translation fault [ 176.293013] Data abort info: [ 176.293104] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 176.293488] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 176.293787] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000 [ 176.294166] [0000000000000110] pgd=0000000000000000, p4d=0000000000000000 [ 176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth br_netfilter bridge stp llc ipv6 crct10dif_ce [ 176.295923] CPU: 0 PID: 188 Comm: ping Not tainted 6.8.0-rc3-g5b3fbd61b9d1 #2 [ 176.296314] Hardware name: linux,dummy-virt (DT) [ 176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter] [ 176.297636] sp : ffff800080003630 [ 176.297743] x29: ffff800080003630 x28: 0000000000000008 x27: ffff6828c49ad9f8 [ 176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24: 00000000000003e8 [ 176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21: ffff6828c3b16d28 [ 176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18: 0000000000000014 [ 176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15: 0000000095744632 [ 176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12: ffffb7e137926a70 [ 176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 : 0000000000000000 [ 176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 : f20e0100bebafeca [ 176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 : 0000000000000000 [ 176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 : ffff6828c7f918f0 [ 176.300889] Call trace: [ 176.301123] br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter] [ 176.301411] br_nf_post_routing+0x2a8/0x3e4 [br_netfilter] [ 176.301703] nf_hook_slow+0x48/0x124 [ 176.302060] br_forward_finish+0xc8/0xe8 [bridge] [ 176.302371] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.302605] br_nf_forward_finish+0x118/0x22c [br_netfilter] [ 176.302824] br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter] [ 176.303136] br_nf_forward+0x2b8/0x4e0 [br_netfilter] [ 176.303359] nf_hook_slow+0x48/0x124 [ 176.303803] __br_forward+0xc4/0x194 [bridge] [ 176.304013] br_flood+0xd4/0x168 [bridge] [ 176.304300] br_handle_frame_finish+0x1d4/0x5c4 [bridge] [ 176.304536] br_nf_hook_thresh+0x124/0x134 [br_netfilter] [ 176.304978] br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter] [ 176.305188] br_nf_pre_routing+0x250/0x524 [br_netfilter] [ 176.305428] br_handle_frame+0x244/0x3cc [bridge] [ 176.305695] __netif_receive_skb_core.constprop.0+0x33c/0xecc [ 176.306080] __netif_receive_skb_one_core+0x40/0x8c [ 176.306197] __netif_receive_skb+0x18/0x64 [ 176.306369] process_backlog+0x80/0x124 [ 176.306540] __napi_poll+0x38/0x17c [ 176.306636] net_rx_action+0x124/0x26c [ 176.306758] __do_softirq+0x100/0x26c [ 176.307051] ____do_softirq+0x10/0x1c [ 176.307162] call_on_irq_stack+0x24/0x4c [ 176.307289] do_softirq_own_stack+0x1c/0x2c [ 176.307396] do_softirq+0x54/0x6c [ 176.307485] __local_bh_enable_ip+0x8c/0x98 [ 176.307637] __dev_queue_xmit+0x22c/0xd28 [ 176.307775] neigh_resolve_output+0xf4/0x1a0 [ 176.308018] ip_finish_output2+0x1c8/0x628 [ 176.308137] ip_do_fragment+0x5b4/0x658 [ 176.308279] ip_fragment.constprop.0+0x48/0xec [ 176.308420] __ip_finish_output+0xa4/0x254 [ 176.308593] ip_finish_output+0x34/0x130 [ 176.308814] ip_output+0x6c/0x108 [ 176.308929] ip_send_skb+0x50/0xf0 [ 176.309095] ip_push_pending_frames+0x30/0x54 [ 176.309254] raw_sendmsg+0x758/0xaec [ 176.309568] inet_sendmsg+0x44/0x70 [ 176.309667] __sys_sendto+0x110/0x178 [ 176.309758] __arm64_sys_sendto+0x28/0x38 [ 176.309918] invoke_syscall+0x48/0x110 [ 176.310211] el0_svc_common.constprop.0+0x40/0xe0 [ 176.310353] do_el0_svc+0x1c/0x28 [ 176.310434] el0_svc+0x34/0xb4 [ 176.310551] el0t_64_sync_handler+0x120/0x12c [ 176.310690] el0t_64_sync+0x190/0x194 [ 176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860) [ 176.315743] ---[ end trace 0000000000000000 ]--- [ 176.316060] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000 [ 176.316564] PHYS_OFFSET: 0xffff97d780000000 [ 176.316782] CPU features: 0x0,88000203,3c020000,0100421b [ 176.317210] Memory Limit: none [ 176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal Exception in interrupt ]---\ Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Andy Roulin <aroulin@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241001154400.22787-2-aroulin@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysrxrpc: Fix uninitialised variable in rxrpc_send_data()David Howells1-5/+5
[ Upstream commit 7a310f8d7dfe2d92a1f31ddb5357bfdd97eed273 ] Fix the uninitialised txb variable in rxrpc_send_data() by moving the code that loads it above all the jumps to maybe_error, txb being stored back into call->tx_pending right before the normal return. Fixes: b0f571ecd794 ("rxrpc: Fix locking in rxrpc's sendmsg") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lists.infradead.org/pipermail/linux-afs/2024-October/008896.html Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20241001132702.3122709-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daystcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits outNeal Cardwell1-2/+9
[ Upstream commit 27c80efcc20486c82698f05f00e288b44513c86b ] Fix tcp_rcv_synrecv_state_fastopen() to not zero retrans_stamp if retransmits are outstanding. tcp_fastopen_synack_timer() sets retrans_stamp, so typically we'll need to zero retrans_stamp here to prevent spurious retransmits_timed_out(). The logic to zero retrans_stamp is from this 2019 commit: commit cd736d8b67fb ("tcp: fix retrans timestamp on passive Fast Open") However, in the corner case where the ACK of our TFO SYNACK carried some SACK blocks that caused us to enter TCP_CA_Recovery then that non-zero retrans_stamp corresponds to the active fast recovery, and we need to leave retrans_stamp with its current non-zero value, for correct ETIMEDOUT and undo behavior. Fixes: cd736d8b67fb ("tcp: fix retrans timestamp on passive Fast Open") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-4-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daystcp: new TCP_INFO stats for RTO eventsAananth V4-2/+43
[ Upstream commit 3868ab0f192581eff978501a05f3dc2e01541d77 ] The 2023 SIGCOMM paper "Improving Network Availability with Protective ReRoute" has indicated Linux TCP's RTO-triggered txhash rehashing can effectively reduce application disruption during outages. To better measure the efficacy of this feature, this patch adds three more detailed stats during RTO recovery and exports via TCP_INFO. Applications and monitoring systems can leverage this data to measure the network path diversity and end-to-end repair latency during network outages to improve their network infrastructure. The following counters are added to tcp_sock in order to track RTO events over the lifetime of a TCP socket. 1. u16 total_rto - Counts the total number of RTO timeouts. 2. u16 total_rto_recoveries - Counts the total number of RTO recoveries. 3. u32 total_rto_time - Counts the total time spent (ms) in RTO recoveries. (time spent in CA_Loss and CA_Recovery states) To compute total_rto_time, we add a new u32 rto_stamp field to tcp_sock. rto_stamp records the start timestamp (ms) of the last RTO recovery (CA_Loss). Corresponding fields are also added to the tcp_info struct. Signed-off-by: Aananth V <aananthv@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 27c80efcc204 ("tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out") Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daystcp: fix tcp_enter_recovery() to zero retrans_stamp when it's safeNeal Cardwell1-0/+13
[ Upstream commit b41b4cbd9655bcebcce941bef3601db8110335be ] Fix tcp_enter_recovery() so that if there are no retransmits out then we zero retrans_stamp when entering fast recovery. This is necessary to fix two buggy behaviors. Currently a non-zero retrans_stamp value can persist across multiple back-to-back loss recovery episodes. This is because we generally only clears retrans_stamp if we are completely done with loss recoveries, and get to tcp_try_to_open() and find !tcp_any_retrans_done(sk). This behavior causes two bugs: (1) When a loss recovery episode (CA_Loss or CA_Recovery) is followed immediately by a new CA_Recovery, the retrans_stamp value can persist and can be a time before this new CA_Recovery episode starts. That means that timestamp-based undo will be using the wrong retrans_stamp (a value that is too old) when comparing incoming TS ecr values to retrans_stamp to see if the current fast recovery episode can be undone. (2) If there is a roughly minutes-long sequence of back-to-back fast recovery episodes, one after another (e.g. in a shallow-buffered or policed bottleneck), where each fast recovery successfully makes forward progress and recovers one window of sequence space (but leaves at least one retransmit in flight at the end of the recovery), followed by several RTOs, then the ETIMEDOUT check may be using the wrong retrans_stamp (a value set at the start of the first fast recovery in the sequence). This can cause a very premature ETIMEDOUT, killing the connection prematurely. This commit changes the code to zero retrans_stamp when entering fast recovery, when this is known to be safe (no retransmits are out in the network). That ensures that when starting a fast recovery episode, and it is safe to do so, retrans_stamp is set when we send the fast retransmit packet. That addresses both bug (1) and bug (2) by ensuring that (if no retransmits are out when we start a fast recovery) we use the initial fast retransmit of this fast recovery as the time value for undo and ETIMEDOUT calculations. This makes intuitive sense, since the start of a new fast recovery episode (in a scenario where no lost packets are out in the network) means that the connection has made forward progress since the last RTO or fast recovery, and we should thus "restart the clock" used for both undo and ETIMEDOUT logic. Note that if when we start fast recovery there *are* retransmits out in the network, there can still be undesirable (1)/(2) issues. For example, after this patch we can still have the (1) and (2) problems in cases like this: + round 1: sender sends flight 1 + round 2: sender receives SACKs and enters fast recovery 1, retransmits some packets in flight 1 and then sends some new data as flight 2 + round 3: sender receives some SACKs for flight 2, notes losses, and retransmits some packets to fill the holes in flight 2 + fast recovery has some lost retransmits in flight 1 and continues for one or more rounds sending retransmits for flight 1 and flight 2 + fast recovery 1 completes when snd_una reaches high_seq at end of flight 1 + there are still holes in the SACK scoreboard in flight 2, so we enter fast recovery 2, but some retransmits in the flight 2 sequence range are still in flight (retrans_out > 0), so we can't execute the new retrans_stamp=0 added here to clear retrans_stamp It's not yet clear how to fix these remaining (1)/(2) issues in an efficient way without breaking undo behavior, given that retrans_stamp is currently used for undo and ETIMEDOUT. Perhaps the optimal (but expensive) strategy would be to set retrans_stamp to the timestamp of the earliest outstanding retransmit when entering fast recovery. But at least this commit makes things better. Note that this does not change the semantics of retrans_stamp; it simply makes retrans_stamp accurate in some cases where it was not before: (1) Some loss recovery, followed by an immediate entry into a fast recovery, where there are no retransmits out when entering the fast recovery. (2) When a TFO server has a SYNACK retransmit that sets retrans_stamp, and then the ACK that completes the 3-way handshake has SACK blocks that trigger a fast recovery. In this case when entering fast recovery we want to zero out the retrans_stamp from the TFO SYNACK retransmit, and set the retrans_stamp based on the timestamp of the fast recovery. We introduce a tcp_retrans_stamp_cleanup() helper, because this two-line sequence already appears in 3 places and is about to appear in 2 more as a result of this bug fix patch series. Once this bug fix patches series in the net branch makes it into the net-next branch we'll update the 3 other call sites to use the new helper. This is a long-standing issue. The Fixes tag below is chosen to be the oldest commit at which the patch will apply cleanly, which is from Linux v3.5 in 2012. Fixes: 1fbc340514fc ("tcp: early retransmit: tcp_enter_recovery()") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-3-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daystcp: fix to allow timestamp undo if no retransmits were sentNeal Cardwell1-2/+16
[ Upstream commit e37ab7373696e650d3b6262a5b882aadad69bb9e ] Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that it can trigger undo even if TSQ prevents a fast recovery episode from reaching tcp_retransmit_skb(). Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after this commit from 2019: commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") ...and before this fix we could have buggy scenarios like the following: + Due to reordering, a TCP connection receives some SACKs and enters a spurious fast recovery. + TSQ prevents all invocations of tcp_retransmit_skb(), because many skbs are queued in lower layers of the sending machine's network stack; thus tp->retrans_stamp remains 0. + The connection receives a TCP timestamp ECR value echoing a timestamp before the fast recovery, indicating that the fast recovery was spurious. + The connection fails to undo the spurious fast recovery because tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false, due to the new logic in the 2019 commit: commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") This fix tweaks the logic to be more similar to the tcp_packet_delayed() logic before bc9f38c8328e, except that we take care not to be fooled by the FLAG_SYN_ACKED code path zeroing out tp->retrans_stamp (the bug noted and fixed by Yuchung in bc9f38c8328e). Note that this returns the high-level behavior of tcp_packet_delayed() to again match the comment for the function, which says: "Nothing was retransmitted or returned timestamp is less than timestamp of the first retransmission." Note that this comment is in the original 2005-04-16 Linux git commit, so this is evidently long-standing behavior. Fixes: bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com> Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnetfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=nSimon Horman2-9/+6
[ Upstream commit fc56878ca1c288e49b5cbb43860a5938e3463654 ] If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64 defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1 using gcc-14 results in the following warnings, which are treated as errors: net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset': net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable] 243 | struct iphdr *niph; | ^~~~ cc1: all warnings being treated as errors net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6': net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable] 286 | struct ipv6hdr *ip6h; | ^~~~ cc1: all warnings being treated as errors Address this by reducing the scope of these local variables to where they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER enabled. Compile tested and run through netfilter selftests. Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Closes: https://lore.kernel.org/netfilter-devel/20240906145513.567781-1-andriy.shevchenko@linux.intel.com/ Signed-off-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysnetfilter: nf_nat: don't try nat source port reallocation for reverse dir clashFlorian Westphal1-2/+118
[ Upstream commit d8f84a9bc7c4e07fdc4edc00f9e868b8db974ccb ] A conntrack entry can be inserted to the connection tracking table if there is no existing entry with an identical tuple in either direction. Example: INITIATOR -> NAT/PAT -> RESPONDER Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite). Then, later, NAT/PAT machine itself also wants to connect to RESPONDER. This will not work if the SNAT done earlier has same IP:PORT source pair. Conntrack table has: ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT and new locally originating connection wants: ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT REPLY: $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT This is handled by the NAT engine which will do a source port reallocation for the locally originating connection that is colliding with an existing tuple by attempting a source port rewrite. This is done even if this new connection attempt did not go through a masquerade/snat rule. There is a rare race condition with connection-less protocols like UDP, where we do the port reallocation even though its not needed. This happens when new packets from the same, pre-existing flow are received in both directions at the exact same time on different CPUs after the conntrack table was flushed (or conntrack becomes active for first time). With strict ordering/single cpu, the first packet creates new ct entry and second packet is resolved as established reply packet. With parallel processing, both packets are picked up as new and both get their own ct entry. In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by NAT engine because a port collision is detected. This change isn't enough to prevent a packet drop later during nf_conntrack_confirm(), the existing clash resolution strategy will not detect such reverse clash case. This is resolved by a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysBluetooth: Fix usage of __hci_cmd_sync_statusLuiz Augusto von Dentz1-15/+12
[ Upstream commit 87be7b189b2c50d4b51512f59e4e97db4eedee8a ] __hci_cmd_sync_status shall only be used if hci_req_sync_lock is _not_ required which is not the case of hci_dev_cmd so it needs to use hci_cmd_sync_status which uses hci_req_sync_lock internally. Fixes: f1a8f402f13f ("Bluetooth: L2CAP: Fix deadlock") Reported-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 dayslibceph: init the cursor when preparing sparse read in msgr2Xiubo Li1-0/+3
[ Upstream commit 321e3c3de53c7530cd518219d01f04e7e32a9d23 ] The cursor is no longer initialized in the OSD client, causing the sparse read state machine to fall into an infinite loop. The cursor should be initialized in IN_S_PREPARE_SPARSE_DATA state. [ idryomov: use msg instead of con->in_msg, changelog ] Link: https://tracker.ceph.com/issues/64607 Fixes: 8e46a2d068c9 ("libceph: just wait for more data to be available on the socket") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysrxrpc: Fix a race between socket set up and I/O thread creationDavid Howells3-4/+10
commit bc212465326e8587325f520a052346f0b57360e6 upstream. In rxrpc_open_socket(), it sets up the socket and then sets up the I/O thread that will handle it. This is a problem, however, as there's a gap between the two phases in which a packet may come into rxrpc_encap_rcv() from the UDP packet but we oops when trying to wake the not-yet created I/O thread. As a quick fix, just make rxrpc_encap_rcv() discard the packet if there's no I/O thread yet. A better, but more intrusive fix would perhaps be to rearrange things such that the socket creation is done by the I/O thread. Fixes: a275da62e8c1 ("rxrpc: Create a per-local endpoint receive queue and I/O thread") Signed-off-by: David Howells <dhowells@redhat.com> cc: yuxuanzhe@outlook.com cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001132702.3122709-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysnet: dsa: fix netdev_priv() dereference before check on non-DSA netdevice eventsVladimir Oltean1-2/+5
[ Upstream commit 844f104790bd69c2e4dbb9ee3eba46fde1fcea7b ] After the blamed commit, we started doing this dereference for every NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER event in the system. static inline struct dsa_port *dsa_user_to_port(const struct net_device *dev) { struct dsa_user_priv *p = netdev_priv(dev); return p->dp; } Which is obviously bogus, because not all net_devices have a netdev_priv() of type struct dsa_user_priv. But struct dsa_user_priv is fairly small, and p->dp means dereferencing 8 bytes starting with offset 16. Most drivers allocate that much private memory anyway, making our access not fault, and we discard the bogus data quickly afterwards, so this wasn't caught. But the dummy interface is somewhat special in that it calls alloc_netdev() with a priv size of 0. So every netdev_priv() dereference is invalid, and we get this when we emit a NETDEV_PRECHANGEUPPER event with a VLAN as its new upper: $ ip link add dummy1 type dummy $ ip link add link dummy1 name dummy1.100 type vlan id 100 [ 43.309174] ================================================================== [ 43.316456] BUG: KASAN: slab-out-of-bounds in dsa_user_prechangeupper+0x30/0xe8 [ 43.323835] Read of size 8 at addr ffff3f86481d2990 by task ip/374 [ 43.330058] [ 43.342436] Call trace: [ 43.366542] dsa_user_prechangeupper+0x30/0xe8 [ 43.371024] dsa_user_netdevice_event+0xb38/0xee8 [ 43.375768] notifier_call_chain+0xa4/0x210 [ 43.379985] raw_notifier_call_chain+0x24/0x38 [ 43.384464] __netdev_upper_dev_link+0x3ec/0x5d8 [ 43.389120] netdev_upper_dev_link+0x70/0xa8 [ 43.393424] register_vlan_dev+0x1bc/0x310 [ 43.397554] vlan_newlink+0x210/0x248 [ 43.401247] rtnl_newlink+0x9fc/0xe30 [ 43.404942] rtnetlink_rcv_msg+0x378/0x580 Avoid the kernel oops by dereferencing after the type check, as customary. Fixes: 4c3f80d22b2e ("net: dsa: walk through all changeupper notifier functions") Reported-and-tested-by: syzbot+d81bcd883824180500c8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/0000000000001d4255060e87545c@google.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240110003354.2796778-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> (cherry picked from commit 844f104790bd69c2e4dbb9ee3eba46fde1fcea7b) [Harshit: CVE-2024-26596; Resolve conflicts due to missing commit: 6ca80638b90c ("net: dsa: Use conduit and user terms") in 6.6.y, used dsa_slave_to_port() instead of dsa_user_to_port()] Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysnetfilter: nf_tables: restore set elements when delete set failsPablo Neira Ayuso5-20/+42
>From abort path, nft_mapelem_activate() needs to restore refcounters to the original state. Currently, it uses the set->ops->walk() to iterate over these set elements. The existing set iterator skips inactive elements in the next generation, this does not work from the abort path to restore the original state since it has to skip active elements instead (not inactive ones). This patch moves the check for inactive elements to the set iterator callback, then it reverses the logic for the .activate case which needs to skip active elements. Toggle next generation bit for elements when delete set command is invoked and call nft_clear() from .activate (abort) path to restore the next generation bit. The splat below shows an object in mappings memleak: [43929.457523] ------------[ cut here ]------------ [43929.457532] WARNING: CPU: 0 PID: 1139 at include/net/netfilter/nf_tables.h:1237 nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [...] [43929.458014] RIP: 0010:nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458076] Code: 83 f8 01 77 ab 49 8d 7c 24 08 e8 37 5e d0 de 49 8b 6c 24 08 48 8d 7d 50 e8 e9 5c d0 de 8b 45 50 8d 50 ff 89 55 50 85 c0 75 86 <0f> 0b eb 82 0f 0b eb b3 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 [43929.458081] RSP: 0018:ffff888140f9f4b0 EFLAGS: 00010246 [43929.458086] RAX: 0000000000000000 RBX: ffff8881434f5288 RCX: dffffc0000000000 [43929.458090] RDX: 00000000ffffffff RSI: ffffffffa26d28a7 RDI: ffff88810ecc9550 [43929.458093] RBP: ffff88810ecc9500 R08: 0000000000000001 R09: ffffed10281f3e8f [43929.458096] R10: 0000000000000003 R11: ffff0000ffff0000 R12: ffff8881434f52a0 [43929.458100] R13: ffff888140f9f5f4 R14: ffff888151c7a800 R15: 0000000000000002 [43929.458103] FS: 00007f0c687c4740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000 [43929.458107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43929.458111] CR2: 00007f58dbe5b008 CR3: 0000000123602005 CR4: 00000000001706f0 [43929.458114] Call Trace: [43929.458118] <TASK> [43929.458121] ? __warn+0x9f/0x1a0 [43929.458127] ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458188] ? report_bug+0x1b1/0x1e0 [43929.458196] ? handle_bug+0x3c/0x70 [43929.458200] ? exc_invalid_op+0x17/0x40 [43929.458211] ? nft_setelem_data_deactivate+0xd7/0xf0 [nf_tables] [43929.458271] ? nft_setelem_data_deactivate+0xe4/0xf0 [nf_tables] [43929.458332] nft_mapelem_deactivate+0x24/0x30 [nf_tables] [43929.458392] nft_rhash_walk+0xdd/0x180 [nf_tables] [43929.458453] ? __pfx_nft_rhash_walk+0x10/0x10 [nf_tables] [43929.458512] ? rb_insert_color+0x2e/0x280 [43929.458520] nft_map_deactivate+0xdc/0x1e0 [nf_tables] [43929.458582] ? __pfx_nft_map_deactivate+0x10/0x10 [nf_tables] [43929.458642] ? __pfx_nft_mapelem_deactivate+0x10/0x10 [nf_tables] [43929.458701] ? __rcu_read_unlock+0x46/0x70 [43929.458709] nft_delset+0xff/0x110 [nf_tables] [43929.458769] nft_flush_table+0x16f/0x460 [nf_tables] [43929.458830] nf_tables_deltable+0x501/0x580 [nf_tables] Fixes: 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> (cherry picked from commit e79b47a8615d42c68aaeb68971593333667382ed) [Vegard: CVE-2024-27012; fixed conflicts due to missing commits 0e1ea651c9717ddcd8e0648d8468477a31867b0a ("netfilter: nf_tables: shrink memory consumption of set elements") and 9dad402b89e81a0516bad5e0ac009b7a0a80898f ("netfilter: nf_tables: expose opaque set element as struct nft_elem_priv") so we pass the correct types and values to nft_setelem_data_deactivate(), nft_setelem_validate(), nft_set_elem_ext(), etc.] Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysnetfilter: nf_tables: fix memleak in map from abort pathPablo Neira Ayuso1-2/+14
[ Upstream commit 86a1471d7cde792941109b93b558b5dc078b9ee9 ] The delete set command does not rely on the transaction object for element removal, therefore, a combination of delete element + delete set from the abort path could result in restoring twice the refcount of the mapping. Check for inactive element in the next generation for the delete element command in the abort path, skip restoring state if next generation bit has been already cleared. This is similar to the activate logic using the set walk iterator. [ 6170.286929] ------------[ cut here ]------------ [ 6170.286939] WARNING: CPU: 6 PID: 790302 at net/netfilter/nf_tables_api.c:2086 nf_tables_chain_destroy+0x1f7/0x220 [nf_tables] [ 6170.287071] Modules linked in: [...] [ 6170.287633] CPU: 6 PID: 790302 Comm: kworker/6:2 Not tainted 6.9.0-rc3+ #365 [ 6170.287768] RIP: 0010:nf_tables_chain_destroy+0x1f7/0x220 [nf_tables] [ 6170.287886] Code: df 48 8d 7d 58 e8 69 2e 3b df 48 8b 7d 58 e8 80 1b 37 df 48 8d 7d 68 e8 57 2e 3b df 48 8b 7d 68 e8 6e 1b 37 df 48 89 ef eb c4 <0f> 0b 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 0f [ 6170.287895] RSP: 0018:ffff888134b8fd08 EFLAGS: 00010202 [ 6170.287904] RAX: 0000000000000001 RBX: ffff888125bffb28 RCX: dffffc0000000000 [ 6170.287912] RDX: 0000000000000003 RSI: ffffffffa20298ab RDI: ffff88811ebe4750 [ 6170.287919] RBP: ffff88811ebe4700 R08: ffff88838e812650 R09: fffffbfff0623a55 [ 6170.287926] R10: ffffffff8311d2af R11: 0000000000000001 R12: ffff888125bffb10 [ 6170.287933] R13: ffff888125bffb10 R14: dead000000000122 R15: dead000000000100 [ 6170.287940] FS: 0000000000000000(0000) GS:ffff888390b00000(0000) knlGS:0000000000000000 [ 6170.287948] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6170.287955] CR2: 00007fd31fc00710 CR3: 0000000133f60004 CR4: 00000000001706f0 [ 6170.287962] Call Trace: [ 6170.287967] <TASK> [ 6170.287973] ? __warn+0x9f/0x1a0 [ 6170.287986] ? nf_tables_chain_destroy+0x1f7/0x220 [nf_tables] [ 6170.288092] ? report_bug+0x1b1/0x1e0 [ 6170.287986] ? nf_tables_chain_destroy+0x1f7/0x220 [nf_tables] [ 6170.288092] ? report_bug+0x1b1/0x1e0 [ 6170.288104] ? handle_bug+0x3c/0x70 [ 6170.288112] ? exc_invalid_op+0x17/0x40 [ 6170.288120] ? asm_exc_invalid_op+0x1a/0x20 [ 6170.288132] ? nf_tables_chain_destroy+0x2b/0x220 [nf_tables] [ 6170.288243] ? nf_tables_chain_destroy+0x1f7/0x220 [nf_tables] [ 6170.288366] ? nf_tables_chain_destroy+0x2b/0x220 [nf_tables] [ 6170.288483] nf_tables_trans_destroy_work+0x588/0x590 [nf_tables] Fixes: 591054469b3e ("netfilter: nf_tables: revisit chain/object refcounting from elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> (cherry picked from commit 86a1471d7cde792941109b93b558b5dc078b9ee9) [Vegard: CVE-2024-27011; fixed conflicts due to missing commits 0e1ea651c9717ddcd8e0648d8468477a31867b0a ("netfilter: nf_tables: shrink memory consumption of set elements") and 9dad402b89e81a0516bad5e0ac009b7a0a80898f ("netfilter: nf_tables: expose opaque set element as struct nft_elem_priv") so we pass the correct types and values to nft_setelem_active_next() + nft_set_elem_ext()] Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysmac802154: Fix potential RCU dereference issue in mac802154_scan_workerJiawei Ye1-1/+3
commit bff1709b3980bd7f80be6786f64cc9a9ee9e56da upstream. In the `mac802154_scan_worker` function, the `scan_req->type` field was accessed after the RCU read-side critical section was unlocked. According to RCU usage rules, this is illegal and can lead to unpredictable behavior, such as accessing memory that has been updated or causing use-after-free issues. This possible bug was identified using a static analysis tool developed by myself, specifically designed to detect RCU-related issues. To address this, the `scan_req->type` value is now stored in a local variable `scan_req_type` while still within the RCU read-side critical section. The `scan_req_type` is then used after the RCU lock is released, ensuring that the type value is safely accessed without violating RCU rules. Fixes: e2c3e6f53a7a ("mac802154: Handle active scanning") Cc: stable@vger.kernel.org Signed-off-by: Jiawei Ye <jiawei.ye@foxmail.com> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/tencent_3B2F4F2B4DA30FAE2F51A9634A16B3AD4908@qq.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysBluetooth: hci_event: Align BR/EDR JUST_WORKS paring with LELuiz Augusto von Dentz1-8/+5
commit b25e11f978b63cb7857890edb3a698599cddb10e upstream. This aligned BR/EDR JUST_WORKS method with LE which since 92516cd97fd4 ("Bluetooth: Always request for user confirmation for Just Works") always request user confirmation with confirm_hint set since the likes of bluetoothd have dedicated policy around JUST_WORKS method (e.g. main.conf:JustWorksRepairing). CVE: CVE-2024-8805 Cc: stable@vger.kernel.org Fixes: ba15a58b179e ("Bluetooth: Fix SSP acceptor just-works confirmation without MITM") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Kiran K <kiran.k@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysgso: fix udp gso fraglist segmentation after pull from frag_listWillem de Bruijn1-2/+20
commit a1e40ac5b5e9077fe1f7ae0eb88034db0f9ae1ab upstream. Detect gso fraglist skbs with corrupted geometry (see below) and pass these to skb_segment instead of skb_segment_list, as the first can segment them correctly. Valid SKB_GSO_FRAGLIST skbs - consist of two or more segments - the head_skb holds the protocol headers plus first gso_size - one or more frag_list skbs hold exactly one segment - all but the last must be gso_size Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can modify these skbs, breaking these invariants. In extreme cases they pull all data into skb linear. For UDP, this causes a NULL ptr deref in __udpv4_gso_segment_list_csum at udp_hdr(seg->next)->dest. Detect invalid geometry due to pull, by checking head_skb size. Don't just drop, as this may blackhole a destination. Convert to be able to pass to regular skb_segment. Link: https://lore.kernel.org/netdev/20240428142913.18666-1-shiming.cheng@mediatek.com/ Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20241001171752.107580-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 dayswifi: mac80211: fix RCU list iterationsJohannes Berg4-4/+8
[ Upstream commit ac35180032fbc5d80b29af00ba4881815ceefcb6 ] There are a number of places where RCU list iteration is used, but that aren't (always) called with RCU held. Use just list_for_each_entry() in most, and annotate iface iteration with the required locks. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20240827094939.ed8ac0b2f897.I8443c9c3c0f8051841353491dae758021b53115e@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daystcp: avoid reusing FIN_WAIT2 when trying to find port in connect() processJason Xing1-0/+3
[ Upstream commit 0d9e5df4a257afc3a471a82961ace9a22b88295a ] We found that one close-wait socket was reset by the other side due to a new connection reusing the same port which is beyond our expectation, so we have to investigate the underlying reason. The following experiment is conducted in the test environment. We limit the port range from 40000 to 40010 and delay the time to close() after receiving a fin from the active close side, which can help us easily reproduce like what happened in production. Here are three connections captured by tcpdump: 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1 // a few seconds later, within 60 seconds 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193 // later, very quickly 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 As we can see, the first flow is reset because: 1) client starts a new connection, I mean, the second one 2) client tries to find a suitable port which is a timewait socket (its state is timewait, substate is fin_wait2) 3) client occupies that timewait port to send a SYN 4) server finds a corresponding close-wait socket in ehash table, then replies with a challenge ack 5) client sends an RST to terminate this old close-wait socket. I don't think the port selection algo can choose a FIN_WAIT2 socket when we turn on tcp_tw_reuse because on the server side there remain unread data. In some cases, if one side haven't call close() yet, we should not consider it as expendable and treat it at will. Even though, sometimes, the server isn't able to call close() as soon as possible like what we expect, it can not be terminated easily, especially due to a second unrelated connection happening. After this patch, we can see the expected failure if we start a connection when all the ports are occupied in fin_wait2 state: "Ncat: Cannot assign requested address." Reported-by: Jade Dong <jadedong@tencent.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240823001152.31004-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysnetpoll: Ensure clean state on setup failuresBreno Leitao1-5/+10
[ Upstream commit ae5a0456e0b4cfd7e61619e55251ffdf1bc7adfb ] Modify netpoll_setup() and __netpoll_setup() to ensure that the netpoll structure (np) is left in a clean state if setup fails for any reason. This prevents carrying over misconfigured fields in case of partial setup success. Key changes: - np->dev is now set only after successful setup, ensuring it's always NULL if netpoll is not configured or if netpoll_setup() fails. - np->local_ip is zeroed if netpoll setup doesn't complete successfully. - Added DEBUG_NET_WARN_ON_ONCE() checks to catch unexpected states. - Reordered some operations in __netpoll_setup() for better logical flow. These changes improve the reliability of netpoll configuration, since it assures that the structure is fully initialized or totally unset. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20240822111051.179850-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysipv4: Mask upper DSCP bits and ECN bits in NETLINK_FIB_LOOKUP familyIdo Schimmel1-1/+1
[ Upstream commit 8fed54758cd248cd311a2b5c1e180abef1866237 ] The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB lookup according to user provided parameters and communicate the result back to user space. However, unlike other users of the FIB lookup API, the upper DSCP bits and the ECN bits of the DS field are not masked, which can result in the wrong result being returned. Solve this by masking the upper DSCP bits and the ECN bits using IPTOS_RT_MASK. The structure that communicates the request and the response is not exported to user space, so it is unlikely that this netlink family is actually in use [1]. [1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysipv4: Check !in_dev earlier for ioctl(SIOCSIFADDR).Kuniyuki Iwashima1-4/+2
[ Upstream commit e3af3d3c5b26c33a7950e34e137584f6056c4319 ] dev->ip_ptr could be NULL if we set an invalid MTU. Even then, if we issue ioctl(SIOCSIFADDR) for a new IPv4 address, devinet_ioctl() allocates struct in_ifaddr and fails later in inet_set_ifa() because in_dev is NULL. Let's move the check earlier. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daystipc: guard against string buffer overrunSimon Horman1-2/+6
[ Upstream commit 6555a2a9212be6983d2319d65276484f7c5f431a ] Smatch reports that copying media_name and if_name to name_parts may overwrite the destination. .../bearer.c:166 bearer_name_validate() error: strcpy() 'media_name' too large for 'name_parts->media_name' (32 vs 16) .../bearer.c:167 bearer_name_validate() error: strcpy() 'if_name' too large for 'name_parts->if_name' (1010102 vs 16) This does seem to be the case so guard against this possibility by using strscpy() and failing if truncation occurs. Introduced by commit b97bf3fd8f6a ("[TIPC] Initial merge") Compile tested only. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240801-tipic-overrun-v2-1-c5b869d1f074@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysnet: sched: consistently use rcu_replace_pointer() in taprio_change()Dmitry Antipov1-1/+3
[ Upstream commit d5c4546062fd6f5dbce575c7ea52ad66d1968678 ] According to Vinicius (and carefully looking through the whole https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa once again), txtime branch of 'taprio_change()' is not going to race against 'advance_sched()'. But using 'rcu_replace_pointer()' in the former may be a good idea as well. Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 dayswifi: cfg80211: Set correct chandef when starting CACIssam Hamdi1-1/+14
[ Upstream commit 20361712880396e44ce80aaeec2d93d182035651 ] When starting CAC in a mode other than AP mode, it return a "WARNING: CPU: 0 PID: 63 at cfg80211_chandef_dfs_usable+0x20/0xaf [cfg80211]" caused by the chandef.chan being null at the end of CAC. Solution: Ensure the channel definition is set for the different modes when starting CAC to avoid getting a NULL 'chan' at the end of CAC. Call Trace: ? show_regs.part.0+0x14/0x16 ? __warn+0x67/0xc0 ? cfg80211_chandef_dfs_usable+0x20/0xaf [cfg80211] ? report_bug+0xa7/0x130 ? exc_overflow+0x30/0x30 ? handle_bug+0x27/0x50 ? exc_invalid_op+0x18/0x60 ? handle_exception+0xf6/0xf6 ? exc_overflow+0x30/0x30 ? cfg80211_chandef_dfs_usable+0x20/0xaf [cfg80211] ? exc_overflow+0x30/0x30 ? cfg80211_chandef_dfs_usable+0x20/0xaf [cfg80211] ? regulatory_propagate_dfs_state.cold+0x1b/0x4c [cfg80211] ? cfg80211_propagate_cac_done_wk+0x1a/0x30 [cfg80211] ? process_one_work+0x165/0x280 ? worker_thread+0x120/0x3f0 ? kthread+0xc2/0xf0 ? process_one_work+0x280/0x280 ? kthread_complete_and_exit+0x20/0x20 ? ret_from_fork+0x19/0x24 Reported-by: Kretschmer Mathias <mathias.kretschmer@fit.fraunhofer.de> Signed-off-by: Issam Hamdi <ih@simonwunderlich.de> Link: https://patch.msgid.link/20240816142418.3381951-1-ih@simonwunderlich.de [shorten subject, remove OCB, reorder cases to match previous list] Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysBluetooth: L2CAP: Fix not validating setsockopt user inputLuiz Augusto von Dentz1-32/+20
[ Upstream commit 4f3951242ace5efc7131932e2e01e6ac6baed846 ] Check user input length before copying data. Fixes: 33575df7be67 ("Bluetooth: move l2cap_sock_setsockopt() to l2cap_sock.c") Fixes: 3ee7b7cd8390 ("Bluetooth: Add BT_MODE socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysBluetooth: ISO: Fix not validating setsockopt user inputLuiz Augusto von Dentz1-24/+12
[ Upstream commit 9e8742cdfc4b0e65266bb4a901a19462bda9285e ] Check user input length before copying data. Fixes: ccf74f2390d6 ("Bluetooth: Add BTPROTO_ISO socket type") Fixes: 0731c5ab4d51 ("Bluetooth: ISO: Add support for BT_PKT_STATUS") Fixes: f764a6c2c1e4 ("Bluetooth: ISO: Add broadcast support") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysBluetooth: hci_sock: Fix not validating setsockopt user inputLuiz Augusto von Dentz1-13/+8
[ Upstream commit b2186061d6043d6345a97100460363e990af0d46 ] Check user input length before copying data. Fixes: 09572fca7223 ("Bluetooth: hci_sock: Add support for BT_{SND,RCV}BUF") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>