summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2022-12-14ipv4: Fix incorrect route flushing when table ID 0 is usedIdo Schimmel1-0/+3
[ Upstream commit c0d999348e01df03e0a7f550351f3907fabbf611 ] Cited commit added the table ID to the FIB info structure, but did not properly initialize it when table ID 0 is used. This can lead to a route in the default VRF with a preferred source address not being flushed when the address is deleted. Consider the following example: # ip address add dev dummy1 192.0.2.1/28 # ip address add dev dummy1 192.0.2.17/28 # ip route add 198.51.100.0/24 via 192.0.2.2 src 192.0.2.17 metric 100 # ip route add table 0 198.51.100.0/24 via 192.0.2.2 src 192.0.2.17 metric 200 # ip route show 198.51.100.0/24 198.51.100.0/24 via 192.0.2.2 dev dummy1 src 192.0.2.17 metric 100 198.51.100.0/24 via 192.0.2.2 dev dummy1 src 192.0.2.17 metric 200 Both routes are installed in the default VRF, but they are using two different FIB info structures. One with a metric of 100 and table ID of 254 (main) and one with a metric of 200 and table ID of 0. Therefore, when the preferred source address is deleted from the default VRF, the second route is not flushed: # ip address del dev dummy1 192.0.2.17/28 # ip route show 198.51.100.0/24 198.51.100.0/24 via 192.0.2.2 dev dummy1 src 192.0.2.17 metric 200 Fix by storing a table ID of 254 instead of 0 in the route configuration structure. Add a test case that fails before the fix: # ./fib_tests.sh -t ipv4_del_addr IPv4 delete address route tests Regular FIB info TEST: Route removed from VRF when source address deleted [ OK ] TEST: Route in default VRF not removed [ OK ] TEST: Route removed in default VRF when source address deleted [ OK ] TEST: Route in VRF is not removed by address delete [ OK ] Identical FIB info with different table ID TEST: Route removed from VRF when source address deleted [ OK ] TEST: Route in default VRF not removed [ OK ] TEST: Route removed in default VRF when source address deleted [ OK ] TEST: Route in VRF is not removed by address delete [ OK ] Table ID 0 TEST: Route removed in default VRF when source address deleted [FAIL] Tests passed: 8 Tests failed: 1 And passes after: # ./fib_tests.sh -t ipv4_del_addr IPv4 delete address route tests Regular FIB info TEST: Route removed from VRF when source address deleted [ OK ] TEST: Route in default VRF not removed [ OK ] TEST: Route removed in default VRF when source address deleted [ OK ] TEST: Route in VRF is not removed by address delete [ OK ] Identical FIB info with different table ID TEST: Route removed from VRF when source address deleted [ OK ] TEST: Route in default VRF not removed [ OK ] TEST: Route removed in default VRF when source address deleted [ OK ] TEST: Route in VRF is not removed by address delete [ OK ] Table ID 0 TEST: Route removed in default VRF when source address deleted [ OK ] Tests passed: 9 Tests failed: 0 Fixes: 5a56a0b3a45d ("net: Don't delete routes in different VRFs") Reported-by: Donald Sharp <sharpd@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-14ipv4: Fix incorrect route flushing when source address is deletedIdo Schimmel1-0/+1
[ Upstream commit f96a3d74554df537b6db5c99c27c80e7afadc8d1 ] Cited commit added the table ID to the FIB info structure, but did not prevent structures with different table IDs from being consolidated. This can lead to routes being flushed from a VRF when an address is deleted from a different VRF. Fix by taking the table ID into account when looking for a matching FIB info. This is already done for FIB info structures backed by a nexthop object in fib_find_info_nh(). Add test cases that fail before the fix: # ./fib_tests.sh -t ipv4_del_addr IPv4 delete address route tests Regular FIB info TEST: Route removed from VRF when source address deleted [ OK ] TEST: Route in default VRF not removed [ OK ] TEST: Route removed in default VRF when source address deleted [ OK ] TEST: Route in VRF is not removed by address delete [ OK ] Identical FIB info with different table ID TEST: Route removed from VRF when source address deleted [FAIL] TEST: Route in default VRF not removed [ OK ] RTNETLINK answers: File exists TEST: Route removed in default VRF when source address deleted [ OK ] TEST: Route in VRF is not removed by address delete [FAIL] Tests passed: 6 Tests failed: 2 And pass after: # ./fib_tests.sh -t ipv4_del_addr IPv4 delete address route tests Regular FIB info TEST: Route removed from VRF when source address deleted [ OK ] TEST: Route in default VRF not removed [ OK ] TEST: Route removed in default VRF when source address deleted [ OK ] TEST: Route in VRF is not removed by address delete [ OK ] Identical FIB info with different table ID TEST: Route removed from VRF when source address deleted [ OK ] TEST: Route in default VRF not removed [ OK ] TEST: Route removed in default VRF when source address deleted [ OK ] TEST: Route in VRF is not removed by address delete [ OK ] Tests passed: 8 Tests failed: 0 Fixes: 5a56a0b3a45d ("net: Don't delete routes in different VRFs") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-14ip_gre: do not report erspan version on GRE interfaceHangbin Liu1-19/+29
[ Upstream commit ee496694b9eea651ae1aa4c4667d886cdf74aa3b ] Although the type I ERSPAN is based on the barebones IP + GRE encapsulation and no extra ERSPAN header. Report erspan version on GRE interface looks unreasonable. Fix this by separating the erspan and gre fill info. IPv6 GRE does not have this info as IPv6 only supports erspan version 1 and 2. Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: f989d546a2d5 ("erspan: Add type I version 0 support.") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Link: https://lore.kernel.org/r/20221203032858.3130339-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-08ipv4: Fix route deletion when nexthop info is not specifiedIdo Schimmel1-3/+5
[ Upstream commit d5082d386eee7e8ec46fa8581932c81a4961dcef ] When the kernel receives a route deletion request from user space it tries to delete a route that matches the route attributes specified in the request. If only prefix information is specified in the request, the kernel should delete the first matching FIB alias regardless of its associated FIB info. However, an error is currently returned when the FIB info is backed by a nexthop object: # ip nexthop add id 1 via 192.0.2.2 dev dummy10 # ip route add 198.51.100.0/24 nhid 1 # ip route del 198.51.100.0/24 RTNETLINK answers: No such process Fix by matching on such a FIB info when legacy nexthop attributes are not specified in the request. An earlier check already covers the case where a nexthop ID is specified in the request. Add tests that cover these flows. Before the fix: # ./fib_nexthops.sh -t ipv4_fcnal ... TEST: Delete route when not specifying nexthop attributes [FAIL] Tests passed: 11 Tests failed: 1 After the fix: # ./fib_nexthops.sh -t ipv4_fcnal ... TEST: Delete route when not specifying nexthop attributes [ OK ] Tests passed: 12 Tests failed: 0 No regressions in other tests: # ./fib_nexthops.sh ... Tests passed: 228 Tests failed: 0 # ./fib_tests.sh ... Tests passed: 186 Tests failed: 0 Cc: stable@vger.kernel.org Reported-by: Jonas Gorski <jonas.gorski@gmail.com> Tested-by: Jonas Gorski <jonas.gorski@gmail.com> Fixes: 493ced1ac47c ("ipv4: Allow routes to use nexthop objects") Fixes: 6bf92d70e690 ("net: ipv4: fix route with nexthop object delete warning") Fixes: 61b91eb33a69 ("ipv4: Handle attempt to delete multipath route when fib_info contains an nh reference") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20221124210932.2470010-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-08ipv4: Handle attempt to delete multipath route when fib_info contains an nh ↵David Ahern1-4/+4
reference [ Upstream commit 61b91eb33a69c3be11b259c5ea484505cd79f883 ] Gwangun Jung reported a slab-out-of-bounds access in fib_nh_match: fib_nh_match+0xf98/0x1130 linux-6.0-rc7/net/ipv4/fib_semantics.c:961 fib_table_delete+0x5f3/0xa40 linux-6.0-rc7/net/ipv4/fib_trie.c:1753 inet_rtm_delroute+0x2b3/0x380 linux-6.0-rc7/net/ipv4/fib_frontend.c:874 Separate nexthop objects are mutually exclusive with the legacy multipath spec. Fix fib_nh_match to return if the config for the to be deleted route contains a multipath spec while the fib_info is using a nexthop object. Fixes: 493ced1ac47c ("ipv4: Allow routes to use nexthop objects") Fixes: 6bf92d70e690 ("net: ipv4: fix route with nexthop object delete warning") Reported-by: Gwangun Jung <exsociety@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: d5082d386eee ("ipv4: Fix route deletion when nexthop info is not specified") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02tcp: configurable source port perturb table sizeGleb Mazovetskiy2-5/+15
[ Upstream commit aeac4ec8f46d610a10adbaeff5e2edf6a88ffc62 ] On embedded systems with little memory and no relevant security concerns, it is beneficial to reduce the size of the table. Reducing the size from 2^16 to 2^8 saves 255 KiB of kernel RAM. Makes the table size configurable as an expert option. The size was previously increased from 2^8 to 2^16 in commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16"). Signed-off-by: Gleb Mazovetskiy <glex.spb@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02ipv4: Fix error return code in fib_table_insert()Ziyang Xuan1-1/+3
[ Upstream commit 568fe84940ac0e4e0b2cd7751b8b4911f7b9c215 ] In fib_table_insert(), if the alias was already inserted, but node not exist, the error code should be set before return from error handling path. Fixes: a6c76c17df02 ("ipv4: Notify route after insertion to the routing table") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Link: https://lore.kernel.org/r/20221120072838.2167047-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02dccp/tcp: Reset saddr on failure after inet6?_hash_connect().Kuniyuki Iwashima1-0/+2
[ Upstream commit 77934dc6db0d2b111a8f2759e9ad2fb67f5cffa5 ] When connect() is called on a socket bound to the wildcard address, we change the socket's saddr to a local address. If the socket fails to connect() to the destination, we have to reset the saddr. However, when an error occurs after inet_hash6?_connect() in (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave the socket bound to the address. From the user's point of view, whether saddr is reset or not varies with errno. Let's fix this inconsistent behaviour. Note that after this patch, the repro [0] will trigger the WARN_ON() in inet_csk_get_port() again, but this patch is not buggy and rather fixes a bug papering over the bhash2's bug for which we need another fix. For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect() by this sequence: s1 = socket() s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s1.bind(('127.0.0.1', 10000)) s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000))) # or s1.connect(('127.0.0.1', 10000)) s2 = socket() s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s2.bind(('0.0.0.0', 10000)) s2.connect(('127.0.0.1', 10000)) # -EADDRNOTAVAIL s2.listen(32) # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2); [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09 Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6") Fixes: 7c657876b63c ("[DCCP]: Initial implementation") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02netfilter: conntrack: Fix data-races around ct markDaniel Xu1-2/+2
[ Upstream commit 52d1aa8b8249ff477aaa38b6f74a8ced780d079c ] nf_conn:mark can be read from and written to in parallel. Use READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted compiler optimizations. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02xfrm: replay: Fix ESN wrap around for GSOChristian Langrock1-0/+3
[ Upstream commit 4b549ccce941798703f159b227aa28c716aa78fa ] When using GSO it can happen that the wrong seq_hi is used for the last packets before the wrap around. This can lead to double usage of a sequence number. To avoid this, we should serialize this last GSO packet. Fixes: d7dbefc45cf5 ("xfrm: Add xfrm_replay_overflow functions for offloading") Co-developed-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Christian Langrock <christian.langrock@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-02xfrm: fix "disable_policy" on ipv4 early demuxEyal Birger1-0/+5
[ Upstream commit 3a5913183aa1b14148c723bda030e6102ad73008 ] The commit in the "Fixes" tag tried to avoid a case where policy check is ignored due to dst caching in next hops. However, when the traffic is locally consumed, the dst may be cached in a local TCP or UDP socket as part of early demux. In this case the "disable_policy" flag is not checked as ip_route_input_noref() was only called before caching, and thus, packets after the initial packet in a flow will be dropped if not matching policies. Fix by checking the "disable_policy" flag also when a valid dst is already available. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216557 Reported-by: Monil Patel <monil191989@gmail.com> Fixes: e6175a2ed1f1 ("xfrm: fix "disable_policy" flag use when arriving from different devices") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> ---- v2: use dev instead of skb->dev Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-26tcp: cdg: allow tcp_cdg_release() to be called multiple timesEric Dumazet1-0/+2
commit 72e560cb8c6f80fc2b4afc5d3634a32465e13a51 upstream. Apparently, mptcp is able to call tcp_disconnect() on an already disconnected flow. This is generally fine, unless current congestion control is CDG, because it might trigger a double-free [1] Instead of fixing MPTCP, and future bugs, we can make tcp_disconnect() more resilient. [1] BUG: KASAN: double-free in slab_free mm/slub.c:3539 [inline] BUG: KASAN: double-free in kfree+0xe2/0x580 mm/slub.c:4567 CPU: 0 PID: 3645 Comm: kworker/0:7 Not tainted 6.0.0-syzkaller-02734-g0326074ff465 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 Workqueue: events mptcp_worker Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:317 [inline] print_report.cold+0x2ba/0x719 mm/kasan/report.c:433 kasan_report_invalid_free+0x81/0x190 mm/kasan/report.c:462 ____kasan_slab_free+0x18b/0x1c0 mm/kasan/common.c:356 kasan_slab_free include/linux/kasan.h:200 [inline] slab_free_hook mm/slub.c:1759 [inline] slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1785 slab_free mm/slub.c:3539 [inline] kfree+0xe2/0x580 mm/slub.c:4567 tcp_disconnect+0x980/0x1e20 net/ipv4/tcp.c:3145 __mptcp_close_ssk+0x5ca/0x7e0 net/mptcp/protocol.c:2327 mptcp_do_fastclose net/mptcp/protocol.c:2592 [inline] mptcp_worker+0x78c/0xff0 net/mptcp/protocol.c:2627 process_one_work+0x991/0x1610 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> Allocated by task 3671: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:45 [inline] set_alloc_info mm/kasan/common.c:437 [inline] ____kasan_kmalloc mm/kasan/common.c:516 [inline] ____kasan_kmalloc mm/kasan/common.c:475 [inline] __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:525 kmalloc_array include/linux/slab.h:640 [inline] kcalloc include/linux/slab.h:671 [inline] tcp_cdg_init+0x10d/0x170 net/ipv4/tcp_cdg.c:380 tcp_init_congestion_control+0xab/0x550 net/ipv4/tcp_cong.c:193 tcp_reinit_congestion_control net/ipv4/tcp_cong.c:217 [inline] tcp_set_congestion_control+0x96c/0xaa0 net/ipv4/tcp_cong.c:391 do_tcp_setsockopt+0x505/0x2320 net/ipv4/tcp.c:3513 tcp_setsockopt+0xd4/0x100 net/ipv4/tcp.c:3801 mptcp_setsockopt+0x35f/0x2570 net/mptcp/sockopt.c:844 __sys_setsockopt+0x2d6/0x690 net/socket.c:2252 __do_sys_setsockopt net/socket.c:2263 [inline] __se_sys_setsockopt net/socket.c:2260 [inline] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2260 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 16: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:45 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:367 [inline] ____kasan_slab_free+0x166/0x1c0 mm/kasan/common.c:329 kasan_slab_free include/linux/kasan.h:200 [inline] slab_free_hook mm/slub.c:1759 [inline] slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1785 slab_free mm/slub.c:3539 [inline] kfree+0xe2/0x580 mm/slub.c:4567 tcp_cleanup_congestion_control+0x70/0x120 net/ipv4/tcp_cong.c:226 tcp_v4_destroy_sock+0xdd/0x750 net/ipv4/tcp_ipv4.c:2254 tcp_v6_destroy_sock+0x11/0x20 net/ipv6/tcp_ipv6.c:1969 inet_csk_destroy_sock+0x196/0x440 net/ipv4/inet_connection_sock.c:1157 tcp_done+0x23b/0x340 net/ipv4/tcp.c:4649 tcp_rcv_state_process+0x40e7/0x4990 net/ipv4/tcp_input.c:6624 tcp_v6_do_rcv+0x3fc/0x13c0 net/ipv6/tcp_ipv6.c:1525 tcp_v6_rcv+0x2e8e/0x3830 net/ipv6/tcp_ipv6.c:1759 ip6_protocol_deliver_rcu+0x2db/0x1950 net/ipv6/ip6_input.c:439 ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:484 NF_HOOK include/linux/netfilter.h:302 [inline] NF_HOOK include/linux/netfilter.h:296 [inline] ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:493 dst_input include/net/dst.h:455 [inline] ip6_rcv_finish+0x193/0x2c0 net/ipv6/ip6_input.c:79 ip_sabotage_in net/bridge/br_netfilter_hooks.c:874 [inline] ip_sabotage_in+0x1fa/0x260 net/bridge/br_netfilter_hooks.c:865 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline] nf_hook_slow+0xc5/0x1f0 net/netfilter/core.c:614 nf_hook.constprop.0+0x3ac/0x650 include/linux/netfilter.h:257 NF_HOOK include/linux/netfilter.h:300 [inline] ipv6_rcv+0x9e/0x380 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5485 __netif_receive_skb+0x1f/0x1c0 net/core/dev.c:5599 netif_receive_skb_internal net/core/dev.c:5685 [inline] netif_receive_skb+0x12f/0x8d0 net/core/dev.c:5744 NF_HOOK include/linux/netfilter.h:302 [inline] NF_HOOK include/linux/netfilter.h:296 [inline] br_pass_frame_up+0x303/0x410 net/bridge/br_input.c:68 br_handle_frame_finish+0x909/0x1aa0 net/bridge/br_input.c:199 br_nf_hook_thresh+0x2f8/0x3d0 net/bridge/br_netfilter_hooks.c:1041 br_nf_pre_routing_finish_ipv6+0x695/0xef0 net/bridge/br_netfilter_ipv6.c:207 NF_HOOK include/linux/netfilter.h:302 [inline] br_nf_pre_routing_ipv6+0x417/0x7c0 net/bridge/br_netfilter_ipv6.c:237 br_nf_pre_routing+0x1496/0x1fe0 net/bridge/br_netfilter_hooks.c:507 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline] nf_hook_bridge_pre net/bridge/br_input.c:255 [inline] br_handle_frame+0x9c9/0x12d0 net/bridge/br_input.c:399 __netif_receive_skb_core+0x9fe/0x38f0 net/core/dev.c:5379 __netif_receive_skb_one_core+0xae/0x180 net/core/dev.c:5483 __netif_receive_skb+0x1f/0x1c0 net/core/dev.c:5599 process_backlog+0x3a0/0x7c0 net/core/dev.c:5927 __napi_poll+0xb3/0x6d0 net/core/dev.c:6494 napi_poll net/core/dev.c:6561 [inline] net_rx_action+0x9c1/0xd90 net/core/dev.c:6672 __do_softirq+0x1d0/0x9c8 kernel/softirq.c:571 Fixes: 2b0a8c9eee81 ("tcp: add CDG congestion control") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-16tcp: prohibit TCP_REPAIR_OPTIONS if data was already sentLu Wei1-1/+1
[ Upstream commit 0c175da7b0378445f5ef53904247cfbfb87e0b78 ] If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code of TCPOPT_SACK_PERM is called to enable sack after data is sent and dupacks are received , it will trigger a warning in function tcp_verify_left_out() as follows: ============================================ WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132 tcp_timeout_mark_lost+0x154/0x160 tcp_enter_loss+0x2b/0x290 tcp_retransmit_timer+0x50b/0x640 tcp_write_timer_handler+0x1c8/0x340 tcp_write_timer+0xe5/0x140 call_timer_fn+0x3a/0x1b0 __run_timers.part.0+0x1bf/0x2d0 run_timer_softirq+0x43/0xb0 __do_softirq+0xfd/0x373 __irq_exit_rcu+0xf6/0x140 The warning is caused in the following steps: 1. a socket named socketA is created 2. socketA enters repair mode without build a connection 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED directly 4. socketA leaves repair mode 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup ack receives) increase 6. socketA enters repair mode again 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out increases 9. sack_outs + lost_out > packets_out triggers since lost_out and sack_outs increase repeatly In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if Step7 not happen and the warning will not be triggered. As suggested by Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was already sent. socket-tcp tests in CRIU has been tested as follows: $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \ --ignore-taint socket-tcp* represent all socket-tcp tests in test/zdtm/static/. Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters") Signed-off-by: Lu Wei <luwei32@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-16bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queuesWang Yufen1-0/+1
[ Upstream commit d8616ee2affcff37c5d315310da557a694a3303d ] During TCP sockmap redirect pressure test, the following warning is triggered: WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0 CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G W 5.10.0+ #9 Call Trace: inet_csk_destroy_sock+0x55/0x110 inet_csk_listen_stop+0xbb/0x380 tcp_close+0x41b/0x480 inet_release+0x42/0x80 __sock_release+0x3d/0xa0 sock_close+0x11/0x20 __fput+0x9d/0x240 task_work_run+0x62/0x90 exit_to_user_mode_prepare+0x110/0x120 syscall_exit_to_user_mode+0x27/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason we observed is that: When the listener is closing, a connection may have completed the three-way handshake but not accepted, and the client has sent some packets. The child sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(), but psocks of child sks have not released. To fix, add sock_map_destroy to release psocks. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220524075311.649153-1-wangyufen@huawei.com Stable-dep-of: 8bbabb3fddcd ("bpf, sock_map: Move cancel_work_sync() out of sock lock") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-16bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queuesWang Yufen1-3/+5
[ Upstream commit 8ec95b94716a1e4d126edc3fb2bc426a717e2dba ] When running `test_sockmap` selftests, the following warning appears: WARNING: CPU: 2 PID: 197 at net/core/stream.c:205 sk_stream_kill_queues+0xd3/0xf0 Call Trace: <TASK> inet_csk_destroy_sock+0x55/0x110 tcp_rcv_state_process+0xd28/0x1380 ? tcp_v4_do_rcv+0x77/0x2c0 tcp_v4_do_rcv+0x77/0x2c0 __release_sock+0x106/0x130 __tcp_close+0x1a7/0x4e0 tcp_close+0x20/0x70 inet_release+0x3c/0x80 __sock_release+0x3a/0xb0 sock_close+0x14/0x20 __fput+0xa3/0x260 task_work_run+0x59/0xb0 exit_to_user_mode_prepare+0x1b3/0x1c0 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x48/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The root case is in commit 84472b436e76 ("bpf, sockmap: Fix more uncharged while msg has more_data"), where I used msg->sg.size to replace the tosend, causing breakage: if (msg->apply_bytes && msg->apply_bytes < tosend) tosend = psock->apply_bytes; Fixes: 84472b436e76 ("bpf, sockmap: Fix more uncharged while msg has more_data") Reported-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/1667266296-8794-1-git-send-email-wangyufen@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-10tcp/udp: Make early_demux back namespacified.Kuniyuki Iwashima3-84/+26
commit 11052589cf5c0bab3b4884d423d5f60c38fcf25d upstream. Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made it possible to enable/disable early_demux on a per-netns basis. Then, we introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp"). However, the .proc_handler() was wrong and actually disabled us from changing the behaviour in each netns. We can execute early_demux if net.ipv4.ip_early_demux is on and each proto .early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux, the change itself is saved in each netns variable, but the .early_demux() handler is a global variable, so the handler is switched based on the init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have nothing to do with the logic. Whether we CAN execute proto .early_demux() is always decided by init_net's sysctl knob, and whether we DO it or not is by each netns ip_early_demux knob. This patch namespacifies (tcp|udp)_early_demux again. For now, the users of the .early_demux() handler are TCP and UDP only, and they are called directly to avoid retpoline. So, we can remove the .early_demux() handler from inet6?_protos and need not dereference them in ip6?_rcv_finish_core(). If another proto needs .early_demux(), we can restore it at that time. Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-03nh: fix scope used to find saddr when adding non gw nhNicolas Dichtel1-1/+1
[ Upstream commit bac0f937c343d651874f83b265ca8f5070ed4f06 ] As explained by Julian, fib_nh_scope is related to fib_nh_gw4, but fib_info_update_nhc_saddr() needs the scope of the route, which is the scope "before" fib_nh_scope, ie fib_nh_scope - 1. This patch fixes the problem described in commit 747c14307214 ("ip: fix dflt addr selection for connected nexthop"). Fixes: 597cfe4fc339 ("nexthop: Add support for IPv4 nexthops") Link: https://lore.kernel.org/netdev/6c8a44ba-c2d5-cdf-c5c7-5baf97cba38@ssi.bg/ Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-03tcp: fix indefinite deferral of RTO with SACK renegingNeal Cardwell1-1/+2
[ Upstream commit 3d2af9cce3133b3bc596a9d065c6f9d93419ccfb ] This commit fixes a bug that can cause a TCP data sender to repeatedly defer RTOs when encountering SACK reneging. The bug is that when we're in fast recovery in a scenario with SACK reneging, every time we get an ACK we call tcp_check_sack_reneging() and it can note the apparent SACK reneging and rearm the RTO timer for srtt/2 into the future. In some SACK reneging scenarios that can happen repeatedly until the receive window fills up, at which point the sender can't send any more, the ACKs stop arriving, and the RTO fires at srtt/2 after the last ACK. But that can take far too long (O(10 secs)), since the connection is stuck in fast recovery with a low cwnd that cannot grow beyond ssthresh, even if more bandwidth is available. This fix changes the logic in tcp_check_sack_reneging() to only rearm the RTO timer if data is cumulatively ACKed, indicating forward progress. This avoids this kind of nearly infinite loop of RTO timer re-arming. In addition, this meets the goals of tcp_check_sack_reneging() in handling Windows TCP behavior that looks temporarily like SACK reneging but is not really. Many thanks to Jakub Kicinski and Neil Spring, who reported this issue and provided critical packet traces that enabled root-causing this issue. Also, many thanks to Jakub Kicinski for testing this fix. Fixes: 5ae344c949e7 ("tcp: reduce spurious retransmits due to transient SACK reneging") Reported-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Neil Spring <ntspring@fb.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Tested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-03tcp: fix a signed-integer-overflow bug in tcp_add_backlog()Lu Wei1-1/+3
[ Upstream commit ec791d8149ff60c40ad2074af3b92a39c916a03f ] The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and in tcp_add_backlog(), the variable limit is caculated by adding sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value of int and overflow. This patch reduces the limit budget by halving the sndbuf to solve this issue since ACK packets are much smaller than the payload. Fixes: c9c3321257e1 ("tcp: add tcp_add_backlog()") Signed-off-by: Lu Wei <luwei32@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-11-03tcp: minor optimization in tcp_add_backlog()Eric Dumazet1-3/+2
[ Upstream commit d519f350967a60b85a574ad8aeac43f2b4384746 ] If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values are not used. Defer their access to the point we need them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: ec791d8149ff ("tcp: fix a signed-integer-overflow bug in tcp_add_backlog()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-29udp: Update reuse->has_conns under reuseport_lock.Kuniyuki Iwashima2-2/+2
[ Upstream commit 69421bf98482d089e50799f45e48b25ce4a8d154 ] When we call connect() for a UDP socket in a reuseport group, we have to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel could select a unconnected socket wrongly for packets sent to the connected socket. However, the current way to set has_conns is illegal and possible to trigger that problem. reuseport_has_conns() changes has_conns under rcu_read_lock(), which upgrades the RCU reader to the updater. Then, it must do the update under the updater's lock, reuseport_lock, but it doesn't for now. For this reason, there is a race below where we fail to set has_conns resulting in the wrong socket selection. To avoid the race, let's split the reader and updater with proper locking. cpu1 cpu2 +----+ +----+ __ip[46]_datagram_connect() reuseport_grow() . . |- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size) | . | | |- rcu_read_lock() | |- reuse = rcu_dereference(sk->sk_reuseport_cb) | | | | | /* reuse->has_conns == 0 here */ | | |- more_reuse->has_conns = reuse->has_conns | |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */ | | | | | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, | | | more_reuse) | `- rcu_read_unlock() `- kfree_rcu(reuse, rcu) | |- sk->sk_state = TCP_ESTABLISHED Note the likely(reuse) in reuseport_has_conns_set() is always true, but we put the test there for ease of review. [0] For the record, usually, sk_reuseport_cb is changed under lock_sock(). The only exception is reuseport_grow() & TCP reqsk migration case. 1) shutdown() TCP listener, which is moved into the latter part of reuse->socks[] to migrate reqsk. 2) New listen() overflows reuse->socks[] and call reuseport_grow(). 3) reuse->max_socks overflows u16 with the new listener. 4) reuseport_grow() pops the old shutdown()ed listener from the array and update its sk->sk_reuseport_cb as NULL without lock_sock(). shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(), but, reuseport_has_conns_set() is called only for UDP under lock_sock(), so likely(reuse) never be false in reuseport_has_conns_set(). [0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/ Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26tcp: annotate data-race around tcp_md5sig_pool_populatedEric Dumazet1-4/+10
[ Upstream commit aacd467c0a576e5e44d2de4205855dc0fe43f6fb ] tcp_md5sig_pool_populated can be read while another thread changes its value. The race has no consequence because allocations are protected with tcp_md5sig_mutex. This patch adds READ_ONCE() and WRITE_ONCE() to document the race and silence KCSAN. Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26once: add DO_ONCE_SLOW() for sleepable contextsEric Dumazet1-2/+2
[ Upstream commit 62c07983bef9d3e78e71189441e1a470f0d1e653 ] Christophe Leroy reported a ~80ms latency spike happening at first TCP connect() time. This is because __inet_hash_connect() uses get_random_once() to populate a perturbation table which became quite big after commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") get_random_once() uses DO_ONCE(), which block hard irqs for the duration of the operation. This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock for operations where we prefer to stay in process context. Then __inet_hash_connect() can use get_random_slow_once() to populate its perturbation table. Fixes: 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") Fixes: 190cc82489f4 ("tcp: change source port randomizarion at connect() time") Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#t Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limitedNeal Cardwell2-7/+14
[ Upstream commit f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14 ] This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be. The following event sequence is an example that triggers the bug: (a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number. (b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false. This commit fixes that bug. One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number. Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized: (1) If the connection is cwnd-limited then we remember that fact for the current window. (2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window. In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false. This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC. Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26netfilter: nft_fib: Fix for rpath check with VRF devicesPhil Sutter1-0/+3
[ Upstream commit 2a8a7c0eaa8747c16aa4a48d573aa920d5c00a5c ] Analogous to commit b575b24b8eee3 ("netfilter: Fix rpfilter dropping vrf packets by mistake") but for nftables fib expression: Add special treatment of VRF devices so that typical reverse path filtering via 'fib saddr . iif oif' expression works as expected. Fixes: f6d0cbcf09c50 ("netfilter: nf_tables: add fib expression") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-23net: Find dst with sk's xfrm policy not ctl_sksewookseo2-1/+3
commit e22aa14866684f77b4f6b6cae98539e520ddb731 upstream. If we set XFRM security policy by calling setsockopt with option IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock' struct. However tcp_v6_send_response doesn't look up dst_entry with the actual socket but looks up with tcp control socket. This may cause a problem that a RST packet is sent without ESP encryption & peer's TCP socket can't receive it. This patch will make the function look up dest_entry with actual socket, if the socket has XFRM policy(sock_policy), so that the TCP response packet via this function can be encrypted, & aligned on the encrypted TCP socket. Tested: We encountered this problem when a TCP socket which is encrypted in ESP transport mode encryption, receives challenge ACK at SYN_SENT state. After receiving challenge ACK, TCP needs to send RST to establish the socket at next SYN try. But the RST was not encrypted & peer TCP socket still remains on ESTABLISHED state. So we verified this with test step as below. [Test step] 1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED). 2. Client tries a new connection on the same TCP ports(src & dst). 3. Server will return challenge ACK instead of SYN,ACK. 4. Client will send RST to server to clear the SOCKET. 5. Client will retransmit SYN to server on the same TCP ports. [Expected result] The TCP connection should be established. Cc: Maciej Żenczykowski <maze@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Sehee Lee <seheele@google.com> Signed-off-by: Sewook Seo <sewookseo@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-15tcp: fix early ETIMEDOUT after spurious non-SACK RTONeal Cardwell1-7/+18
[ Upstream commit 686dc2db2a0fdc1d34b424ec2c0a735becd8d62b ] Fix a bug reported and analyzed by Nagaraj Arankal, where the handling of a spurious non-SACK RTO could cause a connection to fail to clear retrans_stamp, causing a later RTO to very prematurely time out the connection with ETIMEDOUT. Here is the buggy scenario, expanding upon Nagaraj Arankal's excellent report: (*1) Send one data packet on a non-SACK connection (*2) Because no ACK packet is received, the packet is retransmitted and we enter CA_Loss; but this retransmission is spurious. (*3) The ACK for the original data is received. The transmitted packet is acknowledged. The TCP timestamp is before the retrans_stamp, so tcp_may_undo() returns true, and tcp_try_undo_loss() returns true without changing state to Open (because tcp_is_sack() is false), and tcp_process_loss() returns without calling tcp_try_undo_recovery(). Normally after undoing a CA_Loss episode, tcp_fastretrans_alert() would see that the connection has returned to CA_Open and fall through and call tcp_try_to_open(), which would set retrans_stamp to 0. However, for non-SACK connections we hold the connection in CA_Loss, so do not fall through to call tcp_try_to_open() and do not set retrans_stamp to 0. So retrans_stamp is (erroneously) still non-zero. At this point the first "retransmission event" has passed and been recovered from. Any future retransmission is a completely new "event". However, retrans_stamp is erroneously still set. (And we are still in CA_Loss, which is correct.) (*4) After 16 minutes (to correspond with tcp_retries2=15), a new data packet is sent. Note: No data is transmitted between (*3) and (*4) and we disabled keep alives. The socket's timeout SHOULD be calculated from this point in time, but instead it's calculated from the prior "event" 16 minutes ago (step (*2)). (*5) Because no ACK packet is received, the packet is retransmitted. (*6) At the time of the 2nd retransmission, the socket returns ETIMEDOUT, prematurely, because retrans_stamp is (erroneously) too far in the past (set at the time of (*2)). This commit fixes this bug by ensuring that we reuse in tcp_try_undo_loss() the same careful logic for non-SACK connections that we have in tcp_try_undo_recovery(). To avoid duplicating logic, we factor out that logic into a new tcp_is_non_sack_preventing_reopen() helper and call that helper from both undo functions. Fixes: da34ac7626b5 ("tcp: only undo on partial ACKs in CA_Loss") Reported-by: Nagaraj Arankal <nagaraj.p.arankal@hpe.com> Link: https://lore.kernel.org/all/SJ0PR84MB1847BE6C24D274C46A1B9B0EB27A9@SJ0PR84MB1847.NAMPRD84.PROD.OUTLOOK.COM/ Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220903121023.866900-1-ncardwell.kernel@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-15tcp: TX zerocopy should not sense pfmemalloc statusEric Dumazet1-1/+1
[ Upstream commit 3261400639463a853ba2b3be8bd009c2a8089775 ] We got a recent syzbot report [1] showing a possible misuse of pfmemalloc page status in TCP zerocopy paths. Indeed, for pages coming from user space or other layers, using page_is_pfmemalloc() is moot, and possibly could give false positives. There has been attempts to make page_is_pfmemalloc() more robust, but not using it in the first place in this context is probably better, removing cpu cycles. Note to stable teams : You need to backport 84ce071e38a6 ("net: introduce __skb_fill_page_desc_noacc") as a prereq. Race is more probable after commit c07aea3ef4d4 ("mm: add a signature in struct page") because page_is_pfmemalloc() is now using low order bit from page->lru.next, which can change more often than page->index. Low order bit should never be set for lru.next (when used as an anchor in LRU list), so KCSAN report is mostly a false positive. Backporting to older kernel versions seems not necessary. [1] BUG: KCSAN: data-race in lru_add_fn / tcp_build_frag write to 0xffffea0004a1d2c8 of 8 bytes by task 18600 on cpu 0: __list_add include/linux/list.h:73 [inline] list_add include/linux/list.h:88 [inline] lruvec_add_folio include/linux/mm_inline.h:105 [inline] lru_add_fn+0x440/0x520 mm/swap.c:228 folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246 folio_batch_add_and_move mm/swap.c:263 [inline] folio_add_lru+0xf1/0x140 mm/swap.c:490 filemap_add_folio+0xf8/0x150 mm/filemap.c:948 __filemap_get_folio+0x510/0x6d0 mm/filemap.c:1981 pagecache_get_page+0x26/0x190 mm/folio-compat.c:104 grab_cache_page_write_begin+0x2a/0x30 mm/folio-compat.c:116 ext4_da_write_begin+0x2dd/0x5f0 fs/ext4/inode.c:2988 generic_perform_write+0x1d4/0x3f0 mm/filemap.c:3738 ext4_buffered_write_iter+0x235/0x3e0 fs/ext4/file.c:270 ext4_file_write_iter+0x2e3/0x1210 call_write_iter include/linux/fs.h:2187 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x468/0x760 fs/read_write.c:578 ksys_write+0xe8/0x1a0 fs/read_write.c:631 __do_sys_write fs/read_write.c:643 [inline] __se_sys_write fs/read_write.c:640 [inline] __x64_sys_write+0x3e/0x50 fs/read_write.c:640 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffffea0004a1d2c8 of 8 bytes by task 18611 on cpu 1: page_is_pfmemalloc include/linux/mm.h:1740 [inline] __skb_fill_page_desc include/linux/skbuff.h:2422 [inline] skb_fill_page_desc include/linux/skbuff.h:2443 [inline] tcp_build_frag+0x613/0xb20 net/ipv4/tcp.c:1018 do_tcp_sendpages+0x3e8/0xaf0 net/ipv4/tcp.c:1075 tcp_sendpage_locked net/ipv4/tcp.c:1140 [inline] tcp_sendpage+0x89/0xb0 net/ipv4/tcp.c:1150 inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833 kernel_sendpage+0x184/0x300 net/socket.c:3561 sock_sendpage+0x5a/0x70 net/socket.c:1054 pipe_to_sendpage+0x128/0x160 fs/splice.c:361 splice_from_pipe_feed fs/splice.c:415 [inline] __splice_from_pipe+0x222/0x4d0 fs/splice.c:559 splice_from_pipe fs/splice.c:594 [inline] generic_splice_sendpage+0x89/0xc0 fs/splice.c:743 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:931 splice_direct_to_actor+0x305/0x620 fs/splice.c:886 do_splice_direct+0xfb/0x180 fs/splice.c:974 do_sendfile+0x3bf/0x910 fs/read_write.c:1249 __do_sys_sendfile64 fs/read_write.c:1317 [inline] __se_sys_sendfile64 fs/read_write.c:1303 [inline] __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1303 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0xffffea0004a1d288 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 18611 Comm: syz-executor.4 Not tainted 6.0.0-rc2-syzkaller-00248-ge022620b5d05-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022 Fixes: c07aea3ef4d4 ("mm: add a signature in struct page") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-15rxrpc: Fix ICMP/ICMP6 error handlingDavid Howells2-0/+3
[ Upstream commit ac56a0b48da86fd1b4389632fb7c4c8a5d86eefa ] Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing it to siphon off UDP packets early in the handling of received UDP packets thereby avoiding the packet going through the UDP receive queue, it doesn't get ICMP packets through the UDP ->sk_error_report() callback. In fact, it doesn't appear that there's any usable option for getting hold of ICMP packets. Fix this by adding a new UDP encap hook to distribute error messages for UDP tunnels. If the hook is set, then the tunnel driver will be able to see ICMP packets. The hook provides the offset into the packet of the UDP header of the original packet that caused the notification. An alternative would be to call the ->error_handler() hook - but that requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error() do, though isn't really necessary or desirable in rxrpc's case is we want to parse them there and then, not queue them). Changes ======= ver #3) - Fixed an uninitialised variable. ver #2) - Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals. Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-08ip: fix triggering of 'icmp redirect'Nicolas Dichtel1-2/+2
commit eb55dc09b5dd040232d5de32812cc83001a23da6 upstream. __mkroute_input() uses fib_validate_source() to trigger an icmp redirect. My understanding is that fib_validate_source() is used to know if the src address and the gateway address are on the same link. For that, fib_validate_source() returns 1 (same link) or 0 (not the same network). __mkroute_input() is the only user of these positive values, all other callers only look if the returned value is negative. Since the below patch, fib_validate_source() didn't return anymore 1 when both addresses are on the same network, because the route lookup returns RT_SCOPE_LINK instead of RT_SCOPE_HOST. But this is, in fact, right. Let's adapat the test to return 1 again when both addresses are on the same link. CC: stable@vger.kernel.org Fixes: 747c14307214 ("ip: fix dflt addr selection for connected nexthop") Reported-by: kernel test robot <yujie.liu@intel.com> Reported-by: Heng Qi <hengqi@linux.alibaba.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220829100121.3821-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-08tcp: annotate data-race around challenge_timestampEric Dumazet1-2/+2
[ Upstream commit 8c70521238b7863c2af607e20bcba20f974c969b ] challenge_timestamp can be read an written by concurrent threads. This was expected, but we need to annotate the race to avoid potential issues. Following patch moves challenge_timestamp and challenge_count to per-netns storage to provide better isolation. Fixes: 354e4aa391ed ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31net: Fix data-races around sysctl_devconf_inherit_init_net.Kuniyuki Iwashima1-6/+10
[ Upstream commit a5612ca10d1aa05624ebe72633e0c8c792970833 ] While reading sysctl_devconf_inherit_init_net, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 856c395cfa63 ("net: introduce a knob to control whether to inherit devconf config") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31net: Fix data-races around sysctl_max_skb_frags.Kuniyuki Iwashima1-2/+2
[ Upstream commit 657b991afb89d25fe6c4783b1b75a8ad4563670d ] While reading sysctl_max_skb_frags, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31tcp: expose the tcp_mark_push() and tcp_skb_entail() helpersPaolo Abeni1-4/+4
[ Upstream commit 04d8825c30b718781197c8f07b1915a11bfb8685 ] the tcp_skb_entail() helper is actually skb_entail(), renamed to provide proper scope. The two helper will be used by the next patch. RFC -> v1: - rename skb_entail to tcp_skb_entail (Eric) Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31net: Fix data-races around sysctl_optmem_max.Kuniyuki Iwashima1-3/+3
[ Upstream commit 7de6d09f51917c829af2b835aba8bb5040f8e86a ] While reading sysctl_optmem_max, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31net: Fix data-races around sysctl_[rw]mem_(max|default).Kuniyuki Iwashima2-2/+2
[ Upstream commit 1227c1771dd2ad44318aa3ab9e3a293b3f34ff2a ] While reading sysctl_[rw]mem_(max|default), they can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17tcp: fix over estimation in sk_forced_mem_schedule()Eric Dumazet1-3/+4
commit c4ee118561a0f74442439b7b5b486db1ac1ddfeb upstream. sk_forced_mem_schedule() has a bug similar to ones fixed in commit 7c80b038d23e ("net: fix sk_wmem_schedule() and sk_rmem_schedule() errors") While this bug has little chance to trigger in old kernels, we need to fix it before the following patch. Fixes: d83769a580f1 ("tcp: fix possible deadlock in tcp_send_fin()") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH()Eric Dumazet1-1/+1
[ Upstream commit 5d368f03280d3678433a7f119efe15dfbbb87bc8 ] INET6_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET6_MATCH() an inline function to ease our changes. v2: inline function only defined if IS_ENABLED(CONFIG_IPV6) Change the name to inet6_match(), this is no longer a macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()Eric Dumazet2-12/+6
[ Upstream commit 4915d50e300e96929d2462041d6f6c6f061167fd ] INET_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET_MATCH() an inline function to ease our changes. v2: We remove the 32bit version of it, as modern compilers should generate the same code really, no need to try to be smarter. Also make 'struct net *net' the first argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17tcp: make retransmitted SKB fit into the send windowYonglong Li1-7/+16
[ Upstream commit 536a6c8e05f95e3d1118c40ae8b3022ee2d05d52 ] current code of __tcp_retransmit_skb only check TCP_SKB_CB(skb)->seq in send window, and TCP_SKB_CB(skb)->seq_end maybe out of send window. If receiver has shrunk his window, and skb is out of new window, it should retransmit a smaller portion of the payload. test packetdrill script: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) +0 > S 0:0(0) win 65535 <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8> +.05 < S. 0:0(0) ack 1 win 6000 <mss 1000,nop,nop,sackOK> +0 > . 1:1(0) ack 1 +0 write(3, ..., 10000) = 10000 +0 > . 1:2001(2000) ack 1 win 65535 +0 > . 2001:4001(2000) ack 1 win 65535 +0 > . 4001:6001(2000) ack 1 win 65535 +.05 < . 1:1(0) ack 4001 win 1001 and tcpdump show: 192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 1:2001, ack 1, win 65535, length 2000 192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 2001:4001, ack 1, win 65535, length 2000 192.168.226.67.55 > 192.0.2.1.8080: Flags [P.], seq 4001:5001, ack 1, win 65535, length 1000 192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 5001:6001, ack 1, win 65535, length 1000 192.0.2.1.8080 > 192.168.226.67.55: Flags [.], ack 4001, win 1001, length 0 192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 5001:6001, ack 1, win 65535, length 1000 192.168.226.67.55 > 192.0.2.1.8080: Flags [P.], seq 4001:5001, ack 1, win 65535, length 1000 when cient retract window to 1001, send window is [4001,5002], but TLP send 5001-6001 packet which is out of send window. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/1657532838-20200-1-git-send-email-liyonglong@chinatelecom.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03ipv4: Fix data-races around sysctl_fib_notify_on_flag_change.Kuniyuki Iwashima1-2/+5
[ Upstream commit 96b9bd8c6d125490f9adfb57d387ef81a55a103e ] While reading sysctl_fib_notify_on_flag_change, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 680aea08e78c ("net: ipv4: Emit notification when fib hardware flags are changed") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03tcp: Fix data-races around sysctl_tcp_reflect_tos.Kuniyuki Iwashima1-2/+2
[ Upstream commit 870e3a634b6a6cb1543b359007aca73fe6a03ac5 ] While reading sysctl_tcp_reflect_tos, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03tcp: Fix a data-race around sysctl_tcp_comp_sack_nr.Kuniyuki Iwashima1-1/+1
[ Upstream commit 79f55473bfc8ac51bd6572929a679eeb4da22251 ] While reading sysctl_tcp_comp_sack_nr, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: 9c21d2fc41c0 ("tcp: add tcp_comp_sack_nr sysctl") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns.Kuniyuki Iwashima1-1/+1
[ Upstream commit 22396941a7f343d704738360f9ef0e6576489d43 ] While reading sysctl_tcp_comp_sack_slack_ns, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: a70437cc09a1 ("tcp: add hrtimer slack to sack compression") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns.Kuniyuki Iwashima1-1/+2
[ Upstream commit 4866b2b0f7672b6d760c4b8ece6fb56f965dcc8a ] While reading sysctl_tcp_comp_sack_delay_ns, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: 6d82aa242092 ("tcp: add tcp_comp_sack_delay_ns sysctl") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03net: Fix data-races around sysctl_[rw]mem(_offset)?.Kuniyuki Iwashima3-10/+11
[ Upstream commit 02739545951ad4c1215160db7fbf9b7a918d3c0b ] While reading these sysctl variables, they can be changed concurrently. Thus, we need to add READ_ONCE() to their readers. - .sysctl_rmem - .sysctl_rwmem - .sysctl_rmem_offset - .sysctl_wmem_offset - sysctl_tcp_rmem[1, 2] - sysctl_tcp_wmem[1, 2] - sysctl_decnet_rmem[1] - sysctl_decnet_wmem[1] - sysctl_tipc_rmem[1] Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03tcp: Fix data-races around sk_pacing_rate.Kuniyuki Iwashima1-2/+2
[ Upstream commit 59bf6c65a09fff74215517aecffbbdcd67df76e3 ] While reading sysctl_tcp_pacing_(ss|ca)_ratio, they can be changed concurrently. Thus, we need to add READ_ONCE() to their readers. Fixes: 43e122b014c9 ("tcp: refine pacing rate determination") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit.Kuniyuki Iwashima1-1/+2
[ Upstream commit 2afdbe7b8de84c28e219073a6661080e1b3ded48 ] While reading sysctl_tcp_invalid_ratelimit, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: 032ee4236954 ("tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03tcp: Fix a data-race around sysctl_tcp_autocorking.Kuniyuki Iwashima1-1/+1
[ Upstream commit 85225e6f0a76e6745bc841c9f25169c509b573d8 ] While reading sysctl_tcp_autocorking, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: f54b311142a9 ("tcp: auto corking") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-03tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen.Kuniyuki Iwashima1-1/+1
[ Upstream commit 1330ffacd05fc9ac4159d19286ce119e22450ed2 ] While reading sysctl_tcp_min_rtt_wlen, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: f672258391b4 ("tcp: track min RTT using windowed min-filter") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>