Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 92f1655aa2b2294d0b49925f3b875a634bd3b59e upstream.
__dst_negative_advice() does not enforce proper RCU rules when
sk->dst_cache must be cleared, leading to possible UAF.
RCU rules are that we must first clear sk->sk_dst_cache,
then call dst_release(old_dst).
Note that sk_dst_reset(sk) is implementing this protocol correctly,
while __dst_negative_advice() uses the wrong order.
Given that ip6_negative_advice() has special logic
against RTF_CACHE, this means each of the three ->negative_advice()
existing methods must perform the sk_dst_reset() themselves.
Note the check against NULL dst is centralized in
__dst_negative_advice(), there is no need to duplicate
it in various callbacks.
Many thanks to Clement Lecigne for tracking this issue.
This old bug became visible after the blamed commit, using UDP sockets.
Fixes: a87cb3e48ee8 ("net: Facility to report route quality of connected sockets")
Reported-by: Clement Lecigne <clecigne@google.com>
Diagnosed-by: Clement Lecigne <clecigne@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240528114353.1794151-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[Lee: Stable backport]
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bb487272380d120295e955ad8acfcbb281b57642 upstream.
Problem
=========
After commit 67f695134703 ("ipv6: Move setting default metric for routes"),
we noticed that the logic of assigning the default value of fc_metirc
changed in the ioctl process. That is, when users use ioctl(fd, SIOCADDRT,
rt) with a non-zero metric to add a route, then they may fail to delete a
route with passing in a metric value of 0 to the kernel by ioctl(fd,
SIOCDELRT, rt). But iproute can succeed in deleting it.
As a reference, when using iproute tools by netlink to delete routes with
a metric parameter equals 0, like the command as follows:
ip -6 route del fe80::/64 via fe81::5054:ff:fe11:3451 dev eth0 metric 0
the user can still succeed in deleting the route entry with the smallest
metric.
Root Reason
===========
After commit 67f695134703 ("ipv6: Move setting default metric for routes"),
When ioctl() pass in SIOCDELRT with a zero metric, rtmsg_to_fib6_config()
will set a defalut value (1024) to cfg->fc_metric in kernel, and in
ip6_route_del() and the line 4074 at net/ipv3/route.c, it will check by
if (cfg->fc_metric && cfg->fc_metric != rt->fib6_metric)
continue;
and the condition is true and skip the later procedure (deleting route)
because cfg->fc_metric != rt->fib6_metric. But before that commit,
cfg->fc_metric is still zero there, so the condition is false and it
will do the following procedure (deleting).
Solution
========
In order to keep a consistent behaviour across netlink() and ioctl(), we
should allow to delete a route with a metric value of 0. So we only do
the default setting of fc_metric in route adding.
CC: stable@vger.kernel.org # 5.4+
Fixes: 67f695134703 ("ipv6: Move setting default metric for routes")
Co-developed-by: Fan Yu <fan.yu9@zte.com.cn>
Signed-off-by: Fan Yu <fan.yu9@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240514201102055dD2Ba45qKbLlUMxu_DTHP@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 685f7d531264599b3f167f1e94bbd22f120e5fab ]
syzbot found another use-after-free in ip6_route_mpath_notify() [1]
Commit f7225172f25a ("net/ipv6: prevent use after free in
ip6_route_mpath_notify") was not able to fix the root cause.
We need to defer the fib6_info_release() calls after
ip6_route_mpath_notify(), in the cleanup phase.
[1]
BUG: KASAN: slab-use-after-free in rt6_fill_node+0x1460/0x1ac0
Read of size 4 at addr ffff88809a07fc64 by task syz-executor.2/23037
CPU: 0 PID: 23037 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-01035-gea7f3cfaa588 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x167/0x540 mm/kasan/report.c:488
kasan_report+0x142/0x180 mm/kasan/report.c:601
rt6_fill_node+0x1460/0x1ac0
inet6_rt_notify+0x13b/0x290 net/ipv6/route.c:6184
ip6_route_mpath_notify net/ipv6/route.c:5198 [inline]
ip6_route_multipath_add net/ipv6/route.c:5404 [inline]
inet6_rtm_newroute+0x1d0f/0x2300 net/ipv6/route.c:5517
rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6597
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f73dd87dda9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f73de6550c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f73dd9ac050 RCX: 00007f73dd87dda9
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000005
RBP: 00007f73dd8ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f73dd9ac050 R15: 00007ffdbdeb7858
</TASK>
Allocated by task 23037:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:372 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:389
kasan_kmalloc include/linux/kasan.h:211 [inline]
__do_kmalloc_node mm/slub.c:3981 [inline]
__kmalloc+0x22e/0x490 mm/slub.c:3994
kmalloc include/linux/slab.h:594 [inline]
kzalloc include/linux/slab.h:711 [inline]
fib6_info_alloc+0x2e/0xf0 net/ipv6/ip6_fib.c:155
ip6_route_info_create+0x445/0x12b0 net/ipv6/route.c:3758
ip6_route_multipath_add net/ipv6/route.c:5298 [inline]
inet6_rtm_newroute+0x744/0x2300 net/ipv6/route.c:5517
rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6597
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
Freed by task 16:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x4e/0x60 mm/kasan/generic.c:640
poison_slab_object+0xa6/0xe0 mm/kasan/common.c:241
__kasan_slab_free+0x34/0x70 mm/kasan/common.c:257
kasan_slab_free include/linux/kasan.h:184 [inline]
slab_free_hook mm/slub.c:2121 [inline]
slab_free mm/slub.c:4299 [inline]
kfree+0x14a/0x380 mm/slub.c:4409
rcu_do_batch kernel/rcu/tree.c:2190 [inline]
rcu_core+0xd76/0x1810 kernel/rcu/tree.c:2465
__do_softirq+0x2bb/0x942 kernel/softirq.c:553
Last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
__kasan_record_aux_stack+0xae/0x100 mm/kasan/generic.c:586
__call_rcu_common kernel/rcu/tree.c:2715 [inline]
call_rcu+0x167/0xa80 kernel/rcu/tree.c:2829
fib6_info_release include/net/ip6_fib.h:341 [inline]
ip6_route_multipath_add net/ipv6/route.c:5344 [inline]
inet6_rtm_newroute+0x114d/0x2300 net/ipv6/route.c:5517
rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6597
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543
netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367
netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
The buggy address belongs to the object at ffff88809a07fc00
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 100 bytes inside of
freed 512-byte region [ffff88809a07fc00, ffff88809a07fe00)
The buggy address belongs to the physical page:
page:ffffea0002681f00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9a07c
head:ffffea0002681f00 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
page_type: 0xffffffff()
raw: 00fff00000000840 ffff888014c41c80 dead000000000122 0000000000000000
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 23028, tgid 23027 (syz-executor.4), ts 2340253595219, free_ts 2339107097036
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x1ea/0x210 mm/page_alloc.c:1533
prep_new_page mm/page_alloc.c:1540 [inline]
get_page_from_freelist+0x33ea/0x3580 mm/page_alloc.c:3311
__alloc_pages+0x255/0x680 mm/page_alloc.c:4567
__alloc_pages_node include/linux/gfp.h:238 [inline]
alloc_pages_node include/linux/gfp.h:261 [inline]
alloc_slab_page+0x5f/0x160 mm/slub.c:2190
allocate_slab mm/slub.c:2354 [inline]
new_slab+0x84/0x2f0 mm/slub.c:2407
___slab_alloc+0xd17/0x13e0 mm/slub.c:3540
__slab_alloc mm/slub.c:3625 [inline]
__slab_alloc_node mm/slub.c:3678 [inline]
slab_alloc_node mm/slub.c:3850 [inline]
__do_kmalloc_node mm/slub.c:3980 [inline]
__kmalloc+0x2e0/0x490 mm/slub.c:3994
kmalloc include/linux/slab.h:594 [inline]
kzalloc include/linux/slab.h:711 [inline]
new_dir fs/proc/proc_sysctl.c:956 [inline]
get_subdir fs/proc/proc_sysctl.c:1000 [inline]
sysctl_mkdir_p fs/proc/proc_sysctl.c:1295 [inline]
__register_sysctl_table+0xb30/0x1440 fs/proc/proc_sysctl.c:1376
neigh_sysctl_register+0x416/0x500 net/core/neighbour.c:3859
devinet_sysctl_register+0xaf/0x1f0 net/ipv4/devinet.c:2644
inetdev_init+0x296/0x4d0 net/ipv4/devinet.c:286
inetdev_event+0x338/0x15c0 net/ipv4/devinet.c:1555
notifier_call_chain+0x18f/0x3b0 kernel/notifier.c:93
call_netdevice_notifiers_extack net/core/dev.c:1987 [inline]
call_netdevice_notifiers net/core/dev.c:2001 [inline]
register_netdevice+0x15b2/0x1a20 net/core/dev.c:10340
br_dev_newlink+0x27/0x100 net/bridge/br_netlink.c:1563
rtnl_newlink_create net/core/rtnetlink.c:3497 [inline]
__rtnl_newlink net/core/rtnetlink.c:3717 [inline]
rtnl_newlink+0x158f/0x20a0 net/core/rtnetlink.c:3730
page last free pid 11583 tgid 11583 stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1140 [inline]
free_unref_page_prepare+0x968/0xa90 mm/page_alloc.c:2346
free_unref_page+0x37/0x3f0 mm/page_alloc.c:2486
kasan_depopulate_vmalloc_pte+0x74/0x90 mm/kasan/shadow.c:415
apply_to_pte_range mm/memory.c:2619 [inline]
apply_to_pmd_range mm/memory.c:2663 [inline]
apply_to_pud_range mm/memory.c:2699 [inline]
apply_to_p4d_range mm/memory.c:2735 [inline]
__apply_to_page_range+0x8ec/0xe40 mm/memory.c:2769
kasan_release_vmalloc+0x9a/0xb0 mm/kasan/shadow.c:532
__purge_vmap_area_lazy+0x163f/0x1a10 mm/vmalloc.c:1770
drain_vmap_area_work+0x40/0xd0 mm/vmalloc.c:1804
process_one_work kernel/workqueue.c:2633 [inline]
process_scheduled_works+0x913/0x1420 kernel/workqueue.c:2706
worker_thread+0xa5f/0x1000 kernel/workqueue.c:2787
kthread+0x2ef/0x390 kernel/kthread.c:388
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:242
Memory state around the buggy address:
ffff88809a07fb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88809a07fb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88809a07fc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88809a07fc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88809a07fd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 3b1137fe7482 ("net: ipv6: Change notifications for multipath add to RTA_MULTIPATH")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240303144801.702646-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit af6d10345ca76670c1b7c37799f0d5576ccef277 upstream.
In ip6_dst_gc() replace:
if (entries > gc_thresh)
With:
if (entries > ops->gc_thresh)
Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
these warnings:
[1] 99.187805] dst_alloc: 7728 callbacks suppressed
[2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
.
.
[300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
When this happens the packet is dropped and sendto() gets a network is
unreachable error:
remaining pkt 200557 errno 101
remaining pkt 196462 errno 101
.
.
remaining pkt 126821 errno 101
Implement David Aherns suggestion to remove max_size check seeing that Ipv6
has a GC to manage memory usage. Ipv4 already does not check max_size.
Here are some memory comparisons for Ipv4 vs Ipv6 with the patch:
Test by running 5 instances of a program that sends UDP packets to a raw
socket 5000000 times. Compare Ipv4 and Ipv6 performance with a similar
program.
Ipv4:
Before test:
MemFree: 29427108 kB
Slab: 237612 kB
ip6_dst_cache 1912 2528 256 32 2 : tunables 0 0 0
xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0
ip_dst_cache 2881 3990 192 42 2 : tunables 0 0 0
During test:
MemFree: 29417608 kB
Slab: 247712 kB
ip6_dst_cache 1912 2528 256 32 2 : tunables 0 0 0
xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0
ip_dst_cache 44394 44394 192 42 2 : tunables 0 0 0
After test:
MemFree: 29422308 kB
Slab: 238104 kB
ip6_dst_cache 1912 2528 256 32 2 : tunables 0 0 0
xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0
ip_dst_cache 3048 4116 192 42 2 : tunables 0 0 0
Ipv6 with patch:
Errno 101 errors are not observed anymore with the patch.
Before test:
MemFree: 29422308 kB
Slab: 238104 kB
ip6_dst_cache 1912 2528 256 32 2 : tunables 0 0 0
xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0
ip_dst_cache 3048 4116 192 42 2 : tunables 0 0 0
During Test:
MemFree: 29431516 kB
Slab: 240940 kB
ip6_dst_cache 11980 12064 256 32 2 : tunables 0 0 0
xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0
ip_dst_cache 3048 4116 192 42 2 : tunables 0 0 0
After Test:
MemFree: 29441816 kB
Slab: 238132 kB
ip6_dst_cache 1902 2432 256 32 2 : tunables 0 0 0
xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0
ip_dst_cache 3048 4116 192 42 2 : tunables 0 0 0
Tested-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230112012532.311021-1-jmaxwell37@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cc: "Jitindar Singh, Suraj" <surajjs@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4cc59f386991ec9374cb4bc83dbe1c0b5a95033f ]
In function rt6_nlmsg_size(), the length of nexthop is calculated
by multipling the nexthop length of fib6_info and the number of
siblings. However if the fib6_info has no lwtunnel but the siblings
have lwtunnels, the nexthop length is less than it should be, and
it will trigger a warning in inet6_rt_notify() as follows:
WARNING: CPU: 0 PID: 6082 at net/ipv6/route.c:6180 inet6_rt_notify+0x120/0x130
......
Call Trace:
<TASK>
fib6_add_rt2node+0x685/0xa30
fib6_add+0x96/0x1b0
ip6_route_add+0x50/0xd0
inet6_rtm_newroute+0x97/0xa0
rtnetlink_rcv_msg+0x156/0x3d0
netlink_rcv_skb+0x5a/0x110
netlink_unicast+0x246/0x350
netlink_sendmsg+0x250/0x4c0
sock_sendmsg+0x66/0x70
___sys_sendmsg+0x7c/0xd0
__sys_sendmsg+0x5d/0xb0
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
This bug can be reproduced by script:
ip -6 addr add 2002::2/64 dev ens2
ip -6 route add 100::/64 via 2002::1 dev ens2 metric 100
for i in 10 20 30 40 50 60 70;
do
ip link add link ens2 name ipv_$i type ipvlan
ip -6 addr add 2002::$i/64 dev ipv_$i
ifconfig ipv_$i up
done
for i in 10 20 30 40 50 60;
do
ip -6 route append 100::/64 encap ip6 dst 2002::$i via 2002::1
dev ipv_$i metric 100
done
ip -6 route append 100::/64 via 2002::1 dev ipv_70 metric 100
This patch fixes it by adding nexthop_len of every siblings using
rt6_nh_nlmsg_size().
Fixes: beb1afac518d ("net: ipv6: Add support to dump multipath routes via RTA_MULTIPATH attribute")
Signed-off-by: Lu Wei <luwei32@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230222083629.335683-2-luwei32@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 768b3c745fe5789f2430bdab02f35a9ad1148d97 ]
During the initialization of ip6_route_net_init_late(), if file
ipv6_route or rt6_stats fails to be created, the initialization is
successful by default. Therefore, the ipv6_route or rt6_stats file
doesn't be found during the remove in ip6_route_net_exit_late(). It
will cause WRNING.
The following is the stack information:
name 'rt6_stats'
WARNING: CPU: 0 PID: 9 at fs/proc/generic.c:712 remove_proc_entry+0x389/0x460
Modules linked in:
Workqueue: netns cleanup_net
RIP: 0010:remove_proc_entry+0x389/0x460
PKRU: 55555554
Call Trace:
<TASK>
ops_exit_list+0xb0/0x170
cleanup_net+0x4ea/0xb00
process_one_work+0x9bf/0x1710
worker_thread+0x665/0x1080
kthread+0x2e4/0x3a0
ret_from_fork+0x1f/0x30
</TASK>
Fixes: cdb1876192db ("[NETNS][IPV6] route6 - create route6 proc files for the namespace")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221102020610.351330-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bdf00bf24bef9be1ca641a6390fd5487873e0d2e ]
While reading nexthop_compat_mode, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 4f80116d3df3 ("net: ipv4: add sysctl for nexthop api compatibility mode")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 3b0dc529f56b5f2328244130683210be98f16f7f upstream.
When routes corresponding to addresses are restored by
fixup_permanent_addr(), the dst_nopolicy parameter was not set.
The typical use case is a user that configures an address on a down
interface and then put this interface up.
Let's take care of this flag in addrconf_f6i_alloc(), so that every callers
benefit ont it.
CC: stable@kernel.org
CC: David Forster <dforster@brocade.com>
Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Reported-by: Siwar Zitouni <siwar.zitouni@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220623120015.32640-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9cb7c013420f98fa6fd12fc6a5dc055170c108db ]
Reads and Writes to ip6_rt_gc_expire always have been racy,
as syzbot reported lately [1]
There is a possible risk of under-flow, leading
to unexpected high value passed to fib6_run_gc(),
although I have not observed this in the field.
Hosts hitting ip6_dst_gc() very hard are under pretty bad
state anyway.
[1]
BUG: KCSAN: data-race in ip6_dst_gc / ip6_dst_gc
read-write to 0xffff888102110744 of 4 bytes by task 13165 on cpu 1:
ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311
dst_alloc+0x9b/0x160 net/core/dst.c:86
ip6_dst_alloc net/ipv6/route.c:344 [inline]
icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261
mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807
mld_send_cr net/ipv6/mcast.c:2119 [inline]
mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
read-write to 0xffff888102110744 of 4 bytes by task 11607 on cpu 0:
ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311
dst_alloc+0x9b/0x160 net/core/dst.c:86
ip6_dst_alloc net/ipv6/route.c:344 [inline]
icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261
mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807
mld_send_cr net/ipv6/mcast.c:2119 [inline]
mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651
process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
worker_thread+0x618/0xa70 kernel/workqueue.c:2436
kthread+0x1a9/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
value changed: 0x00000bb3 -> 0x00000ba9
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 11607 Comm: kworker/0:21 Not tainted 5.18.0-rc1-syzkaller-00037-g42e7a03d3bad-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: mld mld_ifc_work
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220413181333.649424-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1158f79f82d437093aeed87d57df0548bdd68146 ]
VRF devices are the loopbacks for VRFs, and a loopback can not be
assigned to a VRF. Accordingly, the condition in ip6_pkt_drop should
be '||' not '&&'.
Fixes: 1d3fd8a10bed ("vrf: Use orig netdev to count Ip6InNoRoutes and a fresh route lookup when sending dest unreach")
Reported-by: Pudak, Filip <Filip.Pudak@windriver.com>
Reported-by: Xiao, Jiguang <Jiguang.Xiao@windriver.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220404150908.2937-1-dsahern@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit aafc2e3285c2d7a79b7ee15221c19fbeca7b1509 upstream.
struct fib6_node's fn_sernum field can be
read while other threads change it.
Add READ_ONCE()/WRITE_ONCE() annotations.
Do not change existing smp barriers in fib6_get_cookie_safe()
and __fib6_update_sernum_upto_root()
syzbot reported:
BUG: KCSAN: data-race in fib6_clean_node / inet6_csk_route_socket
write to 0xffff88813df62e2c of 4 bytes by task 1920 on cpu 1:
fib6_clean_node+0xc2/0x260 net/ipv6/ip6_fib.c:2178
fib6_walk_continue+0x38e/0x430 net/ipv6/ip6_fib.c:2112
fib6_walk net/ipv6/ip6_fib.c:2160 [inline]
fib6_clean_tree net/ipv6/ip6_fib.c:2240 [inline]
__fib6_clean_all+0x1a9/0x2e0 net/ipv6/ip6_fib.c:2256
fib6_flush_trees+0x6c/0x80 net/ipv6/ip6_fib.c:2281
rt_genid_bump_ipv6 include/net/net_namespace.h:488 [inline]
addrconf_dad_completed+0x57f/0x870 net/ipv6/addrconf.c:4230
addrconf_dad_work+0x908/0x1170
process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
worker_thread+0x616/0xa70 kernel/workqueue.c:2454
kthread+0x1bf/0x1e0 kernel/kthread.c:359
ret_from_fork+0x1f/0x30
read to 0xffff88813df62e2c of 4 bytes by task 15701 on cpu 0:
fib6_get_cookie_safe include/net/ip6_fib.h:285 [inline]
rt6_get_cookie include/net/ip6_fib.h:306 [inline]
ip6_dst_store include/net/ip6_route.h:234 [inline]
inet6_csk_route_socket+0x352/0x3c0 net/ipv6/inet6_connection_sock.c:109
inet6_csk_xmit+0x91/0x1e0 net/ipv6/inet6_connection_sock.c:121
__tcp_transmit_skb+0x1323/0x1840 net/ipv4/tcp_output.c:1402
tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline]
tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680
__tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864
tcp_push+0x2d9/0x2f0 net/ipv4/tcp.c:725
mptcp_push_release net/mptcp/protocol.c:1491 [inline]
__mptcp_push_pending+0x46c/0x490 net/mptcp/protocol.c:1578
mptcp_sendmsg+0x9ec/0xa50 net/mptcp/protocol.c:1764
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:643
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
kernel_sendmsg+0x97/0xd0 net/socket.c:745
sock_no_sendpage+0x84/0xb0 net/core/sock.c:3086
inet_sendpage+0x9d/0xc0 net/ipv4/af_inet.c:834
kernel_sendpage+0x187/0x200 net/socket.c:3492
sock_sendpage+0x5a/0x70 net/socket.c:1007
pipe_to_sendpage+0x128/0x160 fs/splice.c:364
splice_from_pipe_feed fs/splice.c:418 [inline]
__splice_from_pipe+0x207/0x500 fs/splice.c:562
splice_from_pipe fs/splice.c:597 [inline]
generic_splice_sendpage+0x94/0xd0 fs/splice.c:746
do_splice_from fs/splice.c:767 [inline]
direct_splice_actor+0x80/0xa0 fs/splice.c:936
splice_direct_to_actor+0x345/0x650 fs/splice.c:891
do_splice_direct+0x106/0x190 fs/splice.c:979
do_sendfile+0x675/0xc40 fs/read_write.c:1245
__do_sys_sendfile64 fs/read_write.c:1310 [inline]
__se_sys_sendfile64 fs/read_write.c:1296 [inline]
__x64_sys_sendfile64+0x102/0x140 fs/read_write.c:1296
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x0000026f -> 0x00000271
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 15701 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
The Fixes tag I chose is probably arbitrary, I do not think
we need to backport this patch to older kernels.
Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220120174112.1126644-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 95bdba23b5b4aa75fe3e6c84335e638641c707bb ]
As Nicolas noted, if gateway validation fails walking the multipath
attribute the code should jump to the cleanup to free previously
allocated memory.
Fixes: 1ff15a710a86 ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route")
Signed-off-by: David Ahern <dsahern@kernel.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20220103170555.94638-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e30a845b0376eb51c9c94f56bbd53b2e08ba822f ]
ip6_route_multipath_del loop continues processing the multipath
attribute even if delete of a nexthop path fails. For consistency,
do the same if the gateway attribute is invalid.
Fixes: 1ff15a710a86 ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route")
Signed-off-by: David Ahern <dsahern@kernel.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20220103171911.94739-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 8bda81a4d400cf8a72e554012f0d8c45e07a3904 upstream.
lwtunnel_valid_encap_type_attr is used to validate encap attributes
within a multipath route. Add length validation checking to the type.
lwtunnel_valid_encap_type_attr is called converting attributes to
fib{6,}_config struct which means it is used before fib_get_nhs,
ip6_route_multipath_add, and ip6_route_multipath_del - other
locations that use rtnh_ok and then nla_get_u16 on RTA_ENCAP_TYPE
attribute.
Fixes: 9ed59592e3e3 ("lwtunnel: fix autoload of lwt modules")
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1ff15a710a862db1101b97810af14aedc835a86a upstream.
Make sure RTA_GATEWAY for IPv6 multipath route has enough bytes to hold
an IPv6 address.
Fixes: 6b9ea5a64ed5 ("ipv6: fix multipath route replace error recovery")
Signed-off-by: David Ahern <dsahern@kernel.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4619bcf91399f00a40885100fb61d594d8454033 upstream.
Commit referenced in the Fixes tag used nla_memcpy for RTA_GATEWAY as
does the current nla_get_in6_addr. nla_memcpy protects against accessing
memory greater than what is in the attribute, but there is no check
requiring the attribute to have an IPv6 address. Add it.
Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: David Ahern <dsahern@kernel.org>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8837cbbf854246f5f4d565f21e6baa945d37aded ]
We need a way to release a fib6_nh's per-cpu dsts when replacing
nexthops otherwise we can end up with stale per-cpu dsts which hold net
device references, so add a new IPv6 stub called fib6_nh_release_dsts.
It must be used after an RCU grace period, so no new dsts can be created
through a group's nexthop entry.
Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
so it doesn't need a dummy stub when IPv6 is not enabled.
Fixes: 7bf4796dd099 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 597aa16c782496bf74c5dc3b45ff472ade6cee64 ]
Multipath RTA_FLOW is embedded in nexthop. Dump it in fib_add_nexthop()
to get the length of rtnexthop correct.
Fixes: b0f60193632e ("ipv4: Refactor nexthop attributes in fib_dump_info")
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a00df2caffed3883c341d5685f830434312e4a43 ]
Even after commit 4785305c05b2 ("ipv6: use siphash in rt6_exception_hash()"),
an attacker can still use brute force to learn some secrets from a victim
linux host.
One way to defeat these attacks is to make the max depth of the hash
table bucket a random value.
Before this patch, each bucket of the hash table used to store exceptions
could contain 6 items under attack.
After the patch, each bucket would contains a random number of items,
between 6 and 10. The attacker can no longer infer secrets.
This is slightly increasing memory size used by the hash table,
we do not expect this to be a problem.
Following patch is dealing with the same issue in IPv4.
Fixes: 35732d01fe31 ("ipv6: introduce a hash table to store dst cache")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Cc: Wei Wang <weiwan@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4785305c05b25a242e5314cc821f54ade4c18810 ]
A group of security researchers brought to our attention
the weakness of hash function used in rt6_exception_hash()
Lets use siphash instead of Jenkins Hash, to considerably
reduce security risks.
Following patch deals with IPv4.
Fixes: 35732d01fe31 ("ipv6: introduce a hash table to store dst cache")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Cc: Wei Wang <weiwan@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8fb4792f091e608a0a1d353dfdf07ef55a719db5 ]
While running the self-tests on a KASAN enabled kernel, I observed a
slab-out-of-bounds splat very similar to the one reported in
commit 821bbf79fe46 ("ipv6: Fix KASAN: slab-out-of-bounds Read in
fib6_nh_flush_exceptions").
We additionally need to take care of fib6_metrics initialization
failure when the caller provides an nh.
The fix is similar, explicitly free the route instead of calling
fib6_info_release on a half-initialized object.
Fixes: f88d8ea67fbdb ("ipv6: Plumb support for nexthop object in a fib6_info")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 821bbf79fe46a8b1d18aa456e8ed0a3c208c3754 ]
Reported by syzbot:
HEAD commit: 90c911ad Merge tag 'fixes' of git://git.kernel.org/pub/scm..
git tree: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
dashboard link: https://syzkaller.appspot.com/bug?extid=123aa35098fd3c000eb7
compiler: Debian clang version 11.0.1-2
==================================================================
BUG: KASAN: slab-out-of-bounds in fib6_nh_get_excptn_bucket net/ipv6/route.c:1604 [inline]
BUG: KASAN: slab-out-of-bounds in fib6_nh_flush_exceptions+0xbd/0x360 net/ipv6/route.c:1732
Read of size 8 at addr ffff8880145c78f8 by task syz-executor.4/17760
CPU: 0 PID: 17760 Comm: syz-executor.4 Not tainted 5.12.0-rc8-syzkaller #0
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x202/0x31e lib/dump_stack.c:120
print_address_description+0x5f/0x3b0 mm/kasan/report.c:232
__kasan_report mm/kasan/report.c:399 [inline]
kasan_report+0x15c/0x200 mm/kasan/report.c:416
fib6_nh_get_excptn_bucket net/ipv6/route.c:1604 [inline]
fib6_nh_flush_exceptions+0xbd/0x360 net/ipv6/route.c:1732
fib6_nh_release+0x9a/0x430 net/ipv6/route.c:3536
fib6_info_destroy_rcu+0xcb/0x1c0 net/ipv6/ip6_fib.c:174
rcu_do_batch kernel/rcu/tree.c:2559 [inline]
rcu_core+0x8f6/0x1450 kernel/rcu/tree.c:2794
__do_softirq+0x372/0x7a6 kernel/softirq.c:345
invoke_softirq kernel/softirq.c:221 [inline]
__irq_exit_rcu+0x22c/0x260 kernel/softirq.c:422
irq_exit_rcu+0x5/0x20 kernel/softirq.c:434
sysvec_apic_timer_interrupt+0x91/0xb0 arch/x86/kernel/apic/apic.c:1100
</IRQ>
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
RIP: 0010:lock_acquire+0x1f6/0x720 kernel/locking/lockdep.c:5515
Code: f6 84 24 a1 00 00 00 02 0f 85 8d 02 00 00 f7 c3 00 02 00 00 49 bd 00 00 00 00 00 fc ff df 74 01 fb 48 c7 44 24 40 0e 36 e0 45 <4b> c7 44 3d 00 00 00 00 00 4b c7 44 3d 09 00 00 00 00 43 c7 44 3d
RSP: 0018:ffffc90009e06560 EFLAGS: 00000206
RAX: 1ffff920013c0cc0 RBX: 0000000000000246 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90009e066e0 R08: dffffc0000000000 R09: fffffbfff1f992b1
R10: fffffbfff1f992b1 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 1ffff920013c0cb4
rcu_lock_acquire+0x2a/0x30 include/linux/rcupdate.h:267
rcu_read_lock include/linux/rcupdate.h:656 [inline]
ext4_get_group_info+0xea/0x340 fs/ext4/ext4.h:3231
ext4_mb_prefetch+0x123/0x5d0 fs/ext4/mballoc.c:2212
ext4_mb_regular_allocator+0x8a5/0x28f0 fs/ext4/mballoc.c:2379
ext4_mb_new_blocks+0xc6e/0x24f0 fs/ext4/mballoc.c:4982
ext4_ext_map_blocks+0x2be3/0x7210 fs/ext4/extents.c:4238
ext4_map_blocks+0xab3/0x1cb0 fs/ext4/inode.c:638
ext4_getblk+0x187/0x6c0 fs/ext4/inode.c:848
ext4_bread+0x2a/0x1c0 fs/ext4/inode.c:900
ext4_append+0x1a4/0x360 fs/ext4/namei.c:67
ext4_init_new_dir+0x337/0xa10 fs/ext4/namei.c:2768
ext4_mkdir+0x4b8/0xc00 fs/ext4/namei.c:2814
vfs_mkdir+0x45b/0x640 fs/namei.c:3819
ovl_do_mkdir fs/overlayfs/overlayfs.h:161 [inline]
ovl_mkdir_real+0x53/0x1a0 fs/overlayfs/dir.c:146
ovl_create_real+0x280/0x490 fs/overlayfs/dir.c:193
ovl_workdir_create+0x425/0x600 fs/overlayfs/super.c:788
ovl_make_workdir+0xed/0x1140 fs/overlayfs/super.c:1355
ovl_get_workdir fs/overlayfs/super.c:1492 [inline]
ovl_fill_super+0x39ee/0x5370 fs/overlayfs/super.c:2035
mount_nodev+0x52/0xe0 fs/super.c:1413
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1497
do_new_mount fs/namespace.c:2903 [inline]
path_mount+0x196f/0x2be0 fs/namespace.c:3233
do_mount fs/namespace.c:3246 [inline]
__do_sys_mount fs/namespace.c:3454 [inline]
__se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3431
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x4665f9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f68f2b87188 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 000000000056bf60 RCX: 00000000004665f9
RDX: 00000000200000c0 RSI: 0000000020000000 RDI: 000000000040000a
RBP: 00000000004bfbb9 R08: 0000000020000100 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000056bf60
R13: 00007ffe19002dff R14: 00007f68f2b87300 R15: 0000000000022000
Allocated by task 17768:
kasan_save_stack mm/kasan/common.c:38 [inline]
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:427 [inline]
____kasan_kmalloc+0xc2/0xf0 mm/kasan/common.c:506
kasan_kmalloc include/linux/kasan.h:233 [inline]
__kmalloc+0xb4/0x380 mm/slub.c:4055
kmalloc include/linux/slab.h:559 [inline]
kzalloc include/linux/slab.h:684 [inline]
fib6_info_alloc+0x2c/0xd0 net/ipv6/ip6_fib.c:154
ip6_route_info_create+0x55d/0x1a10 net/ipv6/route.c:3638
ip6_route_add+0x22/0x120 net/ipv6/route.c:3728
inet6_rtm_newroute+0x2cd/0x2260 net/ipv6/route.c:5352
rtnetlink_rcv_msg+0xb34/0xe70 net/core/rtnetlink.c:5553
netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2502
netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1338
netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1927
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x5a2/0x900 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmsg+0x319/0x400 net/socket.c:2433
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
Last potentially related work creation:
kasan_save_stack+0x27/0x50 mm/kasan/common.c:38
kasan_record_aux_stack+0xee/0x120 mm/kasan/generic.c:345
__call_rcu kernel/rcu/tree.c:3039 [inline]
call_rcu+0x1b1/0xa30 kernel/rcu/tree.c:3114
fib6_info_release include/net/ip6_fib.h:337 [inline]
ip6_route_info_create+0x10c4/0x1a10 net/ipv6/route.c:3718
ip6_route_add+0x22/0x120 net/ipv6/route.c:3728
inet6_rtm_newroute+0x2cd/0x2260 net/ipv6/route.c:5352
rtnetlink_rcv_msg+0xb34/0xe70 net/core/rtnetlink.c:5553
netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2502
netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1338
netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1927
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg net/socket.c:674 [inline]
____sys_sendmsg+0x5a2/0x900 net/socket.c:2350
___sys_sendmsg net/socket.c:2404 [inline]
__sys_sendmsg+0x319/0x400 net/socket.c:2433
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
Second to last potentially related work creation:
kasan_save_stack+0x27/0x50 mm/kasan/common.c:38
kasan_record_aux_stack+0xee/0x120 mm/kasan/generic.c:345
insert_work+0x54/0x400 kernel/workqueue.c:1331
__queue_work+0x981/0xcc0 kernel/workqueue.c:1497
queue_work_on+0x111/0x200 kernel/workqueue.c:1524
queue_work include/linux/workqueue.h:507 [inline]
call_usermodehelper_exec+0x283/0x470 kernel/umh.c:433
kobject_uevent_env+0x1349/0x1730 lib/kobject_uevent.c:617
kvm_uevent_notify_change+0x309/0x3b0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4809
kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:877 [inline]
kvm_put_kvm+0x9c/0xd10 arch/x86/kvm/../../../virt/kvm/kvm_main.c:920
kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3120
__fput+0x352/0x7b0 fs/file_table.c:280
task_work_run+0x146/0x1c0 kernel/task_work.c:140
tracehook_notify_resume include/linux/tracehook.h:189 [inline]
exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
exit_to_user_mode_prepare+0x10b/0x1e0 kernel/entry/common.c:208
__syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
syscall_exit_to_user_mode+0x26/0x70 kernel/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff8880145c7800
which belongs to the cache kmalloc-192 of size 192
The buggy address is located 56 bytes to the right of
192-byte region [ffff8880145c7800, ffff8880145c78c0)
The buggy address belongs to the page:
page:ffffea00005171c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x145c7
flags: 0xfff00000000200(slab)
raw: 00fff00000000200 ffffea00006474c0 0000000200000002 ffff888010c41a00
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880145c7780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff8880145c7800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8880145c7880: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff8880145c7900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880145c7980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================
In the ip6_route_info_create function, in the case that the nh pointer
is not NULL, the fib6_nh in fib6_info has not been allocated.
Therefore, when trying to free fib6_info in this error case using
fib6_info_release, the function will call fib6_info_destroy_rcu,
which it will access fib6_nh_release(f6i->fib6_nh);
However, f6i->fib6_nh doesn't have any refcount yet given the lack of allocation
causing the reported memory issue above.
Therefore, releasing the empty pointer directly instead would be the solution.
Fixes: f88d8ea67fbdb ("ipv6: Plumb support for nexthop object in a fib6_info")
Fixes: 706ec91916462 ("ipv6: Fix nexthop refcnt leak when creating ipv6 route info")
Signed-off-by: Coco Li <lixiaoyan@google.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 864db232dc7036aa2de19749c3d5be0143b24f8f upstream.
nlh is being checked for validtity two times when it is dereferenced in
this function. Check for validity again when updating the flags through
nlh pointer to make the dereferencing safe.
CC: <stable@vger.kernel.org>
Addresses-Coverity: ("NULL pointer dereference")
Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c4c877b2732466b4c63217baad05c96f775912c7 ]
Move generic blackhole dst ops to the core and use them from both
ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No
functional change otherwise. We need these also in other locations
and having to define them over and over again is not great.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Minor conflicts in net/mptcp/protocol.h and
tools/testing/selftests/net/Makefile.
In both cases code was added on both sides in the same place
so just keep both.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a ICMPV6_PKT_TOOBIG report a next-hop MTU that is less than the IPv6
minimum link MTU, the estimated path MTU is reduced to the minimum link
MTU. This behaviour breaks TAHI IPv6 Core Conformance Test v6LC4.1.6:
Packet Too Big Less than IPv6 MTU.
Referring to RFC 8201 section 4: "If a node receives a Packet Too Big
message reporting a next-hop MTU that is less than the IPv6 minimum link
MTU, it must discard it. A node must not reduce its estimate of the Path
MTU below the IPv6 minimum link MTU on receipt of a Packet Too Big
message."
Drop the path MTU update if reported MTU is less than the minimum link MTU.
Signed-off-by: Georg Kohmann <geokohma@cisco.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Two minor conflicts:
1) net/ipv4/route.c, adding a new local variable while
moving another local variable and removing it's
initial assignment.
2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
One pretty prints the port mode differently, whilst another
changes the driver to try and obtain the port mode from
the port node rather than the switch node.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Replace a comma between expression statements by a semicolon.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The variable err is being initialized with a value that is never read and
it is being updated later with a new value. The initialization is redundant
and can be removed. Also re-order variable declarations in reverse
Christmas tree ordering.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-08-04
The following pull-request contains BPF updates for your *net-next* tree.
We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).
The main changes are:
1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.
2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
in-kernel inspection, from Yonghong Song and Alexei Starovoitov.
3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
unwinder errors, from Song Liu.
4) Allow cgroup local storage map to be shared between programs on the same
cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.
5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
load instructions, from Jean-Philippe Brucker.
6) Follow-up fixes on BPF socket lookup in combination with reuseport group
handling. Also add related BPF selftests, from Jakub Sitnicki.
7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
socket create/release as well as bind functions, from Stanislav Fomichev.
8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
xdp_statistics, from Peilin Ye.
9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.
10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6}
fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin.
11) Fix a bpftool segfault due to missing program type name and make it more robust
to prevent them in future gaps, from Quentin Monnet.
12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
resolver issue, from John Fastabend.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Resolved kernel/bpf/btf.c using instructions from merge commit
69138b34a7248d2396ab85c8652e20c0c39beaba
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ip6_route_info_create() invokes nexthop_get(), which increases the
refcount of the "nh".
When ip6_route_info_create() returns, local variable "nh" becomes
invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of
ip6_route_info_create(). When nexthops can not be used with source
routing, the function forgets to decrease the refcnt increased by
nexthop_get(), causing a refcnt leak.
Fix this issue by pulling up the error source routing handling when
nexthops can not be used with source routing.
Fixes: f88d8ea67fbd ("ipv6: Plumb support for nexthop object in a fib6_info")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is no functionality change for this patch.
Struct bpf_iter_reg is used to register a bpf_iter target,
which includes information for both prog_load, link_create
and seq_file creation.
This patch puts fields related seq_file creation into
a different structure. This will be useful for map
elements iterator where one iterator covers different
map types and different map types may have different
seq_ops, init/fini private_data function and
private_data size.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184109.590030-1-yhs@fb.com
|
|
One additional field btf_id is added to struct
bpf_ctx_arg_aux to store the precomputed btf_ids.
The btf_id is computed at build time with
BTF_ID_LIST or BTF_ID_LIST_GLOBAL macro definitions.
All existing bpf iterators are changed to used
pre-compute btf_ids.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163403.1393551-1-yhs@fb.com
|
|
All conflicts seemed rather trivial, with some guidance from
Saeed Mameed on the tc_ct.c one.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Thomas reported a regression with IPv6 and anycast using the following
reproducer:
echo 1 > /proc/sys/net/ipv6/conf/all/forwarding
ip -6 a add fc12::1/16 dev lo
sleep 2
echo "pinging lo"
ping6 -c 2 fc12::
The conversion of addrconf_f6i_alloc to use ip6_route_info_create missed
the use of fib6_is_reject which checks addresses added to the loopback
interface and sets the REJECT flag as needed. Update fib6_is_reject for
loopback checks to handle RTF_ANYCAST addresses.
Fixes: c7a1ce397ada ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create")
Reported-by: thomas.gambier@nexedi.com
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Brian reported a crash in IPv6 code when using rpfilter with a setup
running FRR and external nexthop objects. The root cause of the crash
is fib6_select_path setting fib6_nh in the result to NULL because of
an improper check for nexthop objects.
More specifically, rpfilter invokes ip6_route_lookup with flowi6_oif
set causing fib6_select_path to be called with have_oif_match set.
fib6_select_path has early check on have_oif_match and jumps to the
out label which presumes a builtin fib6_nh. This path is invalid for
nexthop objects; for external nexthops fib6_select_path needs to just
return if the fib6_nh has already been set in the result otherwise it
returns after the call to nexthop_path_fib6_result. Update the check
on have_oif_match to not bail on external nexthops.
Update selftests for this problem.
Fixes: f88d8ea67fbd ("ipv6: Plumb support for nexthop object in a fib6_info")
Reported-by: Brian Rak <brak@choopa.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It was reported that a considerable amount of cycles were spent on the
expensive indirect calls on fib6_rule_lookup. This patch introduces an
inline helper called pol_route_func that uses the indirect_call_wrappers
to avoid the indirect calls.
This patch saves around 50ns per call.
Performance was measured on the receiver by checking the amount of
syncookies that server was able to generate under a synflood load.
Traffic was generated using trafgen[1] which was pushing around 1Mpps on
a single queue. Receiver was using only one rx queue which help to
create a bottle neck and make the experiment rx-bounded.
These are the syncookies generated over 10s from the different runs:
Whithout the patch:
TcpExtSyncookiesSent 3553749 0.0
TcpExtSyncookiesSent 3550895 0.0
TcpExtSyncookiesSent 3553845 0.0
TcpExtSyncookiesSent 3541050 0.0
TcpExtSyncookiesSent 3539921 0.0
TcpExtSyncookiesSent 3557659 0.0
TcpExtSyncookiesSent 3526812 0.0
TcpExtSyncookiesSent 3536121 0.0
TcpExtSyncookiesSent 3529963 0.0
TcpExtSyncookiesSent 3536319 0.0
With the patch:
TcpExtSyncookiesSent 3611786 0.0
TcpExtSyncookiesSent 3596682 0.0
TcpExtSyncookiesSent 3606878 0.0
TcpExtSyncookiesSent 3599564 0.0
TcpExtSyncookiesSent 3601304 0.0
TcpExtSyncookiesSent 3609249 0.0
TcpExtSyncookiesSent 3617437 0.0
TcpExtSyncookiesSent 3608765 0.0
TcpExtSyncookiesSent 3620205 0.0
TcpExtSyncookiesSent 3601895 0.0
Without the patch the average is 354263 pkt/s or 2822 ns/pkt and with
the patch the average is 360738 pkt/s or 2772 ns/pkt which gives an
estimate of 50 ns per packet.
[1] http://netsniff-ng.org/
Changelog since v1:
- Change ordering in the ICW (Paolo Abeni)
Cc: Luigi Rizzo <lrizzo@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch introduces ecmp nexthops and nexthop groups
for mac fdb entries. In subsequent patches this is used
by the vxlan driver fdb entries. The use case is
E-VPN multihoming [1,2,3] which requires bridged vxlan traffic
to be load balanced to remote switches (vteps) belonging to
the same multi-homed ethernet segment (This is analogous to
a multi-homed LAG but over vxlan).
Changes include new nexthop flag NHA_FDB for nexthops
referenced by fdb entries. These nexthops only have ip.
This patch includes appropriate checks to avoid routes
referencing such nexthops.
example:
$ip nexthop add id 12 via 172.16.1.2 fdb
$ip nexthop add id 13 via 172.16.1.3 fdb
$ip nexthop add id 102 group 12/13 fdb
$bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self
[1] E-VPN https://tools.ietf.org/html/rfc7432
[2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365
[3] LPC talk with mention of nexthop groups for L2 ecmp
http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf
v4 - fixed uninitialized variable reported by kernel test robot
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Prepare for better compat ioctl handling by moving the user copy out
of ipv6_route_ioctl.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move the bpf verifier trace check into the new switch statement in
HEAD.
Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-05-14
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.
2) support for narrow loads in bpf_sock_addr progs and additional
helpers in cg-skb progs, from Andrey.
3) bpf benchmark runner, from Andrii.
4) arm and riscv JIT optimizations, from Luke.
5) bpf iterator infrastructure, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit b121b341e598 ("bpf: Add PTR_TO_BTF_ID_OR_NULL
support") adds a field btf_id_or_null_non0_off to
bpf_prog->aux structure to indicate that the
first ctx argument is PTR_TO_BTF_ID reg_type and
all others are PTR_TO_BTF_ID_OR_NULL.
This approach does not really scale if we have
other different reg types in the future, e.g.,
a pointer to a buffer.
This patch enables bpf_iter targets registering ctx argument
reg types which may be different from the default one.
For example, for pointers to structures, the default reg_type
is PTR_TO_BTF_ID for tracing program. The target can register
a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can
be used by the verifier to enforce accesses.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com
|
|
Change func bpf_iter_unreg_target() parameter from target
name to target reg_info, similar to bpf_iter_reg_target().
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180220.2949737-1-yhs@fb.com
|
|
Currently bpf_iter_reg_target takes parameters from target
and allocates memory to save them. This is really not
necessary, esp. in the future we may grow information
passed from targets to bpf_iter manager.
The patch refactors the code so target reg_info
becomes static and bpf_iter manager can just take
a reference to it.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200513180219.2949605-1-yhs@fb.com
|
|
This patch added netlink and ipv6_route targets, using
the same seq_ops (except show() and minor changes for stop())
for /proc/net/{netlink,ipv6_route}.
The net namespace for these targets are the current net
namespace at file open stage, similar to
/proc/net/{netlink,ipv6_route} reference counting
the net namespace at seq_file open stage.
Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpf_iter.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175910.2476329-1-yhs@fb.com
|
|
We currently have to adjust ipv6 route gc_thresh/max_size depending
on number of cpus on a server, this makes very little sense.
If the kernels sets /proc/sys/net/ipv6/route/gc_thresh to 1024
and /proc/sys/net/ipv6/route/max_size to 4096, then we better
not track the percpu dst that our implementation uses.
Only routes not added (directly or indirectly) by the admin
should be tracked and limited.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Maciej Żenczykowski <maze@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
percpu_counter_add() uses a default batch size which is quite big
on platforms with 256 cpus. (2*256 -> 512)
This means dst_entries_get_fast() can be off by +/- 2*(nr_cpus^2)
(131072 on servers with 256 cpus)
Reduce the batch size to something more reasonable, and
add logic to ip6_dst_gc() to call dst_entries_get_slow()
before calling the _very_ expensive fib6_run_gc() function.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This reverts commit 19bda36c4299ce3d7e5bce10bebe01764a655a6d:
| ipv6: add mtu lock check in __ip6_rt_update_pmtu
|
| Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
| It leaded to that mtu lock doesn't really work when receiving the pkt
| of ICMPV6_PKT_TOOBIG.
|
| This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
| did in __ip_rt_update_pmtu.
The above reasoning is incorrect. IPv6 *requires* icmp based pmtu to work.
There's already a comment to this effect elsewhere in the kernel:
$ git grep -p -B1 -A3 'RTAX_MTU lock'
net/ipv6/route.c=4813=
static int rt6_mtu_change_route(struct fib6_info *f6i, void *p_arg)
...
/* In IPv6 pmtu discovery is not optional,
so that RTAX_MTU lock cannot disable it.
We still use this lock to block changes
caused by addrconf/ndisc.
*/
This reverts to the pre-4.9 behaviour.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Fixes: 19bda36c4299 ("ipv6: add mtu lock check in __ip6_rt_update_pmtu")
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts were all overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
|