summaryrefslogtreecommitdiff
path: root/include/linux/mroute.h
AgeCommit message (Collapse)AuthorFilesLines
2017-01-13ipmr: improve hash scalabilityNikolay Aleksandrov1-15/+42
Recently we started using ipmr with thousands of entries and easily hit soft lockups on smaller devices. The reason is that the hash function uses the high order bits from the src and dst, but those don't change in many common cases, also the hash table is only 64 elements so with thousands it doesn't scale at all. This patch migrates the hash table to rhashtable, and in particular the rhl interface which allows for duplicate elements to be chained because of the MFC_PROXY support (*,G; *,*,oif cases) which allows for multiple duplicate entries to be added with different interfaces (IMO wrong, but it's been in for a long time). And here are some results from tests I've run in a VM: mr_table size (default, allocated for all namespaces): Before After 49304 bytes 2400 bytes Add 65000 routes (the diff is much larger on smaller devices): Before After 1m42s 58s Forwarding 256 byte packets with 65000 routes (test done in a VM): Before After 3 Mbps / ~1465 pps 122 Mbps / ~59000 pps As a bonus we no longer see the soft lockups on smaller devices which showed up even with 2000 entries before. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09net: ipmr: Remove nowait arg to ipmr_get_routeDavid Ahern1-1/+1
ipmr_get_route has 1 caller and the nowait arg is 0. Remove the arg and simplify ipmr_get_route accordingly. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-26ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_routeNikolay Aleksandrov1-1/+1
Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid instead of the previous dst_pid which was copied from in_skb's portid. Since the skb is new the portid is 0 at that point so the packets are sent to the kernel and we get scheduling while atomic or a deadlock (depending on where it happens) by trying to acquire rtnl two times. Also since this is RTM_GETROUTE, it can be triggered by a normal user. Here's the sleeping while atomic trace: [ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 [ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0 [ 7858.212881] 2 locks held by swapper/0/0: [ 7858.213013] #0: (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350 [ 7858.213422] #1: (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130 [ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179 [ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 7858.214108] 0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000 [ 7858.214412] ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e [ 7858.214716] 000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f [ 7858.215251] Call Trace: [ 7858.215412] <IRQ> [<ffffffff813a7804>] dump_stack+0x85/0xc1 [ 7858.215662] [<ffffffff810a4a72>] ___might_sleep+0x192/0x250 [ 7858.215868] [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100 [ 7858.216072] [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0 [ 7858.216279] [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460 [ 7858.216487] [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40 [ 7858.216687] [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260 [ 7858.216900] [<ffffffff81573c70>] rtnl_unicast+0x20/0x30 [ 7858.217128] [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0 [ 7858.217351] [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130 [ 7858.217581] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.217785] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.217990] [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350 [ 7858.218192] [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350 [ 7858.218415] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.218656] [<ffffffff810fde10>] run_timer_softirq+0x260/0x640 [ 7858.218865] [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f [ 7858.219068] [<ffffffff816637c8>] __do_softirq+0xe8/0x54f [ 7858.219269] [<ffffffff8107a948>] irq_exit+0xb8/0xc0 [ 7858.219463] [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50 [ 7858.219678] [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0 [ 7858.219897] <EOI> [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10 [ 7858.220165] [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10 [ 7858.220373] [<ffffffff810298e3>] default_idle+0x23/0x190 [ 7858.220574] [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20 [ 7858.220790] [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60 [ 7858.221016] [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0 [ 7858.221257] [<ffffffff8164f995>] rest_init+0x135/0x140 [ 7858.221469] [<ffffffff81f83014>] start_kernel+0x50e/0x51b [ 7858.221670] [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120 [ 7858.221894] [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c [ 7858.222113] [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a Fixes: 2942e9005056 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-17net: ipmr/ip6mr: add support for keeping an entry ageNikolay Aleksandrov1-0/+1
In preparation for hardware offloading of ipmr/ip6mr we need an interface that allows to check (and later update) the age of entries. Relying on stats alone can show activity but not actual age of the entry, furthermore when there're tens of thousands of entries a lot of the hardware implementations only support "hit" bits which are cleared on read to denote that the entry was active and shouldn't be aged out, these can then be naturally translated into age timestamp and will be compatible with the software forwarding age. Using a lastuse entry doesn't affect performance because the members in that cache line are written to along with the age. Since all new users are encouraged to use ipmr via netlink, this is exported via the RTA_EXPIRES attribute. Also do a minor local variable declaration style adjustment - arrange them longest to shortest. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Shrijeet Mukherjee <shm@cumulusnetworks.com> CC: Satish Ashok <sashok@cumulusnetworks.com> CC: Donald Sharp <sharpd@cumulusnetworks.com> CC: David S. Miller <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: ipmr: move struct mr_table and VIF_EXISTS to mroute.hNikolay Aleksandrov1-2/+19
Move the definitions of VIF_EXISTS() and struct mr_table to mroute.h Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: ipmr: adjust mroute.h style and drop externNikolay Aleksandrov1-26/+21
Remove extra spaces and tabs, adjust function definitions, remove an unnecessary ifdef (already used below, just move code) and drop extern from the functions. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: ipmr: remove unused MFC_NOTIFY flag and make the flags enumNikolay Aleksandrov1-3/+7
MFC_NOTIFY was introduced in kernel 2.1.68 but afaik it hasn't been used and I couldn't find any users currently so just remove it. Only MFC_STATIC is left, so move it into an enum, add a description and use BIT(). Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21mcast: define and use MRT[6]_MAX in ip[6]_mroute_opt()Nicolas Dichtel1-1/+1
This will ease further addition of new MRT[6]_* values and avoid to update in6.h each time. Note that we reduce the maximum value from 210 to 209, but 210 does not match any known value in ip[6]_mroute_setsockopt(). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-13UAPI: (Scripted) Disintegrate include/linuxDavid Howells1-145/+1
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>
2011-05-04ipv4: Pass explicit saddr/daddr args to ipmr_get_route().David S. Miller1-0/+1
This eliminates the need to use rt->rt_{src,dst}. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-30net: Add compat ioctl support for the ipv4 multicast ioctl SIOCGETSGCNTEric W. Biederman1-0/+1
SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1, which unfortunately means the existing infrastructure for compat networking ioctls is insufficient. A trivial compact ioctl implementation would conflict with: SIOCAX25ADDUID SIOCAIPXPRISLT SIOCGETSGCNT_IN6 SIOCGETSGCNT SIOCRSSCAUSE SIOCX25SSUBSCRIP SIOCX25SDTEFACILITIES To make this work I have updated the compat_ioctl decode path to mirror the the normal ioctl decode path. I have added an ipv4 inet_compat_ioctl function so that I can have ipv4 specific compat ioctls. I have added a compat_ioctl function into struct proto so I can break out ioctls by which kind of ip socket I am using. I have added a compat_raw_ioctl function because SIOCGETSGCNT only works on raw sockets. I have added a ipmr_compat_ioctl that mirrors the normal ipmr_ioctl. This was necessary because unfortunately the struct layout for the SIOCGETSGCNT has unsigned longs in it so changes between 32bit and 64bit kernels. This change was sufficient to run a 32bit ip multicast routing daemon on a 64bit kernel. Reported-by: Bill Fenner <fenner@aristanetworks.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-04ipmr: RCU protection for mfc_cache_arrayEric Dumazet1-0/+1
Use RCU & RTNL protection for mfc_cache_array[] ipmr_cache_find() is called under rcu_read_lock(); Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-14ipv4: ipmr: support multiple tablesPatrick McHardy1-1/+2
This patch adds support for multiple independant multicast routing instances, named "tables". Userspace multicast routing daemons can bind to a specific table instance by issuing a setsockopt call using a new option MRT_TABLE. The table number is stored in the raw socket data and affects all following ipmr setsockopt(), getsockopt() and ioctl() calls. By default, a single table (RT_TABLE_DEFAULT) is created with a default routing rule pointing to it. Newly created pimreg devices have the table number appended ("pimregX"), with the exception of devices created in the default table, which are named just "pimreg" for compatibility reasons. Packets are directed to a specific table instance using routing rules, similar to how regular routing rules work. Currently iif, oif and mark are supported as keys, source and destination addresses could be supported additionally. Example usage: - bind pimd/xorp/... to a specific table: uint32_t table = 123; setsockopt(fd, IPPROTO_IP, MRT_TABLE, &table, sizeof(table)); - create routing rules directing packets to the new table: # ip mrule add iif eth0 lookup 123 # ip mrule add oif eth0 lookup 123 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-14ipv4: ipmr: convert struct mfc_cache to struct list_headPatrick McHardy1-1/+1
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-14ipv4: ipmr: remove net pointer from struct mfc_cachePatrick McHardy1-15/+0
Now that cache entries in unres_queue don't need to be distinguished by their network namespace pointer anymore, we can remove it from struct mfc_cache add pass the namespace as function argument to the functions that need it. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-04net: cleanup include/linuxEric Dumazet1-12/+6
This cleanup patch puts struct/union/enum opening braces, in first line to ease grep games. struct something { becomes : struct something { Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-07add vif using local interface index instead of IPIlia K1-4/+9
When routing daemon wants to enable forwarding of multicast traffic it performs something like: struct vifctl vc = { .vifc_vifi = 1, .vifc_flags = 0, .vifc_threshold = 1, .vifc_rate_limit = 0, .vifc_lcl_addr = ip, /* <--- ip address of physical interface, e.g. eth0 */ .vifc_rmt_addr.s_addr = htonl(INADDR_ANY), }; setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc)); This leads (in the kernel) to calling vif_add() function call which search the (physical) device using assigned IP address: dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr); The current API (struct vifctl) does not allow to specify an interface other way than using it's IP, and if there are more than a single interface with specified IP only the first one will be found. The attached patch (against 2.6.30.4) allows to specify an interface by its index, instead of IP address: struct vifctl vc = { .vifc_vifi = 1, .vifc_flags = VIFF_USE_IFINDEX, /* NEW */ .vifc_threshold = 1, .vifc_rate_limit = 0, .vifc_lcl_ifindex = if_nametoindex("eth0"), /* NEW */ .vifc_rmt_addr.s_addr = htonl(INADDR_ANY), }; setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc)); Signed-off-by: Ilia K. <mail4ilia@gmail.com> === modified file 'include/linux/mroute.h' Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-01net: Make setsockopt() optlen be unsigned.David S. Miller1-2/+2
This provides safety against negative optlen at the type level instead of depending upon (sometimes non-trivial) checks against this sprinkled all over the the place, in each and every implementation. Based upon work done by Arjan van de Ven and feedback from Linus Torvalds. Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-23netns: ipmr: enable namespace support in ipv4 multicast routing codeBenjamin Thery1-1/+2
This last patch makes the appropriate changes to use and propagate the network namespace where needed in IPv4 multicast routing code. This consists mainly in replacing all the remaining init_net occurences with current netns pointer retrieved from sockets, net devices or mfc_caches depending on the routines' contexts. Some routines receive a new 'struct net' parameter to propagate the current netns: * vif_add/vif_delete * ipmr_new_tunnel * mroute_clean_tables * ipmr_cache_find * ipmr_cache_report * ipmr_cache_unresolved * ipmr_mfc_add/ipmr_mfc_delete * ipmr_get_route * rt_fill_info (in route.c) Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-23netns: ipmr: store netns in struct mfc_cacheBenjamin Thery1-0/+15
This patch stores into struct mfc_cache the network namespace each mfc_cache belongs to. The new member is mfc_net. mfc_net is assigned at cache allocation and doesn't change during the rest of the cache entry life. A new net parameter is added to ipmr_cache_alloc/ipmr_cache_alloc_unres. This will help to retrieve the current netns around the IPv4 multicast routing code. At the moment, all mfc_cache are allocated in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-08-30net: Unbreak userspace usage of linux/mroute.hDavid S. Miller1-1/+1
Nothing in linux/pim.h should be exported to userspace. This should fix the XORP build failure reported by Jose Calhariz, the debain package maintainer. Nothing originally in linux/mroute.h was exported to userspace ever, but some of this stuff started to be when it was moved into this new linux/pim.h, and that was wrong. If we didn't provide these definitions for 10 years we can reasonably expect that applications defined this stuff locally or used GLIBC headers providing the protocol definitions. And as such the only result of this can be conflict and userland build breakage. Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-03ipv4,ipv6 mroute: Add some helper inline functions to remove ugly ifdefs.YOSHIFUJI Hideaki1-0/+27
ip{,v6}_mroute_{set,get}sockopt() should not matter by optimization but it would be better not to depend on optimization semantically. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-07-03ipv4: Do cleanup for ip_mr_initWang Chen1-2/+1
Same as ip6_mr_init(), make ip_mr_init() return errno if fails. But do not do error handling in inet_init(), just print a msg. Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-04-04[IPV4] MROUTE: Adjust include files for user-space.YOSHIFUJI Hideaki1-0/+3
<linux/mroute.h> needs <linux/types.h>. Avoid including <linux/in.h> in user-space, which conflicts with standard <netinet/in.h>. Add basic struct and constant in <linux/pim.h>. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-04-04[IPV4] MROUTE: Move PIM definitions to <linux/pim.h>.YOSHIFUJI Hideaki1-21/+1
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2007-11-07[IPV4]: Clean the ip_sockglue.c from some ugly ifdefsPavel Emelyanov1-0/+12
The #idfed CONFIG_IP_MROUTE is sometimes places inside the if-s, which looks completely bad. Similar ifdefs inside the functions looks a bit better, but they are also not recommended to be used. Provide an ifdef-ed ip_mroute_opt() helper to cleanup the code. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-29[IPV4]: mroute annotationsAl Viro1-5/+5
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-29[IPV4]: PIMv2 header annotationsAl Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-17Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds1-0/+226
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!