summaryrefslogtreecommitdiff
path: root/drivers/net/vrf.c
AgeCommit message (Collapse)AuthorFilesLines
2017-06-18net: remove DST_NOCACHE flagWei Wang1-1/+1
DST_NOCACHE flag check has been removed from dst_release() and dst_hold_safe() in a previous patch because all the dst are now ref counted properly and can be released based on refcnt only. Looking at the rest of the DST_NOCACHE use, all of them can now be removed or replaced with other checks. So this patch gets rid of all the DST_NOCACHE usage and remove this flag completely. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-18ipv6: take dst->__refcnt for insertion into fib6 treeWei Wang1-4/+0
In IPv6 routing code, struct rt6_info is created for each static route and RTF_CACHE route and inserted into fib6 tree. In both cases, dst ref count is not taken. As explained in the previous patch, this leads to the need of the dst garbage collector. This patch holds ref count of dst before inserting the route into fib6 tree and properly releases the dst when deleting it from the fib6 tree as a preparation in order to fully get rid of dst gc later. Also, correct fib6_age() logic to check dst->__refcnt to be 1 to indicate no user is referencing the dst. And remove dst_hold() in vrf_rt6_create() as ip6_dst_alloc() already puts dst->__refcnt to 1. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16networking: make skb_push & __skb_push return void pointersJohannes Berg1-1/+1
It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) @@ expression SKB, LEN; identifier fn = { skb_push, __skb_push, skb_push_rcsum }; @@ - fn(SKB, LEN)[0] + *(u8 *)fn(SKB, LEN) Note that the last part there converts from push(...)[0] to the more idiomatic *(u8 *)push(...). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-09net: vrf: Make add_fib_rules per network namespace flagDavid Ahern1-4/+32
Commit 1aa6c4f6b8cd8 ("net: vrf: Add l3mdev rules on first device create") adds the l3mdev FIB rule the first time a VRF device is created. However, it only creates the rule once and only in the namespace the first device is created - which may not be init_net. Fix by using the net_generic capability to make the add_fib_rules flag per network namespace. Fixes: 1aa6c4f6b8cd8 ("net: vrf: Add l3mdev rules on first device create") Reported-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07net: Fix inconsistent teardown and release of private netdev state.David S. Miller1-1/+1
Network devices can allocate reasources and private memory using netdev_ops->ndo_init(). However, the release of these resources can occur in one of two different places. Either netdev_ops->ndo_uninit() or netdev->destructor(). The decision of which operation frees the resources depends upon whether it is necessary for all netdev refs to be released before it is safe to perform the freeing. netdev_ops->ndo_uninit() presumably can occur right after the NETDEV_UNREGISTER notifier completes and the unicast and multicast address lists are flushed. netdev->destructor(), on the other hand, does not run until the netdev references all go away. Further complicating the situation is that netdev->destructor() almost universally does also a free_netdev(). This creates a problem for the logic in register_netdevice(). Because all callers of register_netdevice() manage the freeing of the netdev, and invoke free_netdev(dev) if register_netdevice() fails. If netdev_ops->ndo_init() succeeds, but something else fails inside of register_netdevice(), it does call ndo_ops->ndo_uninit(). But it is not able to invoke netdev->destructor(). This is because netdev->destructor() will do a free_netdev() and then the caller of register_netdevice() will do the same. However, this means that the resources that would normally be released by netdev->destructor() will not be. Over the years drivers have added local hacks to deal with this, by invoking their destructor parts by hand when register_netdevice() fails. Many drivers do not try to deal with this, and instead we have leaks. Let's close this hole by formalizing the distinction between what private things need to be freed up by netdev->destructor() and whether the driver needs unregister_netdevice() to perform the free_netdev(). netdev->priv_destructor() performs all actions to free up the private resources that used to be freed by netdev->destructor(), except for free_netdev(). netdev->needs_free_netdev is a boolean that indicates whether free_netdev() should be done at the end of unregister_netdevice(). Now, register_netdevice() can sanely release all resources after ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit() and netdev->priv_destructor(). And at the end of unregister_netdevice(), we invoke netdev->priv_destructor() and optionally call free_netdev(). Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11driver: vrf: Fix one possible use-after-free issueGao Feng1-1/+2
The current codes only deal with the case that the skb is dropped, it may meet one use-after-free issue when NF_HOOK returns 0 that means the skb is stolen by one netfilter rule or hook. When one netfilter rule or hook stoles the skb and return NF_STOLEN, it means the skb is taken by the rule, and other modules should not touch this skb ever. Maybe the skb is queued or freed directly by the rule. Now uses the nf_hook instead of NF_HOOK to get the result of netfilter, and check the return value of nf_hook. Only when its value equals 1, it means the skb could go ahead. Or reset the skb as NULL. BTW, because vrf_rcv_finish is empty function, so needn't invoke it even though nf_hook returns 1. But we need to modify vrf_rcv_finish to deal with the NF_STOLEN case. There are two cases when skb is stolen. 1. The skb is stolen and freed directly. There is nothing we need to do, and vrf_rcv_finish isn't invoked. 2. The skb is queued and reinjected again. The vrf_rcv_finish would be invoked as okfn, so need to free the skb in it. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-27net: vrf: Do not allow looback to be moved to a VRFDavid Ahern1-0/+6
Moving the loopback into a VRF breaks networking for the default VRF. Since the VRF device is the loopback for VRF domains, there is no reason to move the loopback. Given the repercussions, block attempts to set lo into a VRF. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
A function in kernel/bpf/syscall.c which got a bug fix in 'net' was moved to kernel/bpf/verifier.c in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17net: rtnetlink: plumb extended ack to doit functionDavid Ahern1-2/+2
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse for doit functions that call it directly. This is the first step to using extended error reporting in rtnetlink. >From here individual subsystems can be updated to set netlink_ext_ack as needed. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17net: vrf: Fix setting NLM_F_EXCL flag when adding l3mdev ruleDavid Ahern1-1/+1
Only need 1 l3mdev FIB rule. Fix setting NLM_F_EXCL in the nlmsghdr. Fixes: 1aa6c4f6b8cd8 ("net: vrf: Add l3mdev rules on first device create") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+3
Conflicts: drivers/net/ethernet/broadcom/genet/bcmmii.c drivers/net/hyperv/netvsc.c kernel/bpf/hashtab.c Almost entirely overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: vrf: performance improvements for IPv6David Ahern1-10/+56
The VRF driver allows users to implement device based features for an entire domain. For example, a qdisc or netfilter rules can be attached to a VRF device or tcpdump can be used to view packets for all devices in the L3 domain. The device-based features come with a performance penalty, most notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook to switch the dst on an skb to its private dst. This allows the skb to traverse the xmit stack with the device set to the VRF device which in turn enables the netfilter and qdisc features. The VRF driver then performs the FIB lookup again and reinserts the packet. This patch avoids the redirect for IPv6 packets if a qdisc has not been attached to a VRF device which is the default config. In this case the netfilter hooks and network taps are directly traversed in the l3mdev_l3_out handler. If a qdisc is attached to a VRF device, then the redirect using the vrf dst is done. Additional overhead is removed by only checking packet taps if a socket is open on the device (vrf_dev->ptype_all list is not empty). Packet sockets bound to any device will still get a copy of the packet via the real ingress or egress interface. The end result of this change is a decrease in the overhead of VRF for the default, baseline case (ie., no netfilter rules, no packet sockets, no qdisc) from a +3% improvement for UDP which has a lookup per packet (VRF being better than no l3mdev) to ~2% loss for TCP_CRR which connects a socket for each request-response. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: vrf: performance improvements for IPv4David Ahern1-10/+96
The VRF driver allows users to implement device based features for an entire domain. For example, a qdisc or netfilter rules can be attached to a VRF device or tcpdump can be used to view packets for all devices in the L3 domain. The device-based features come with a performance penalty, most notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook to switch the dst on an skb to its private dst. This allows the skb to traverse the xmit stack with the device set to the VRF device which in turn enables the netfilter and qdisc features. The VRF driver then performs the FIB lookup again and reinserts the packet. This patch avoids the redirect for IPv4 packets if a qdisc has not been attached to a VRF device which is the default config. In this case the netfilter hooks and network taps are directly traversed in the l3mdev_l3_out handler. If a qdisc is attached to a VRF device, then the redirect using the vrf dst is done. Additional overhead is removed by only checking packet taps if a socket is open on the device (vrf_dev->ptype_all list is not empty). Packet sockets bound to any device will still get a copy of the packet via the real ingress or egress interface. The end result of this change is a decrease in the overhead of VRF for the default, baseline case (ie., no netfilter rules, no packet sockets, no qdisc) to ~3% for UDP which has a lookup per packet and < 1% overhead for connected sockets that leverage early demux and avoid FIB lookups. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: vrf: Reset rt6i_idev in local dst after putDavid Ahern1-1/+3
The VRF driver takes a reference to the inet6_dev on the VRF device for its rt6_local dst when handling local traffic through the VRF device as a loopback. When the device is deleted the driver does a put on the idev but does not reset rt6i_idev in the rt6_info struct. When the dst is destroyed, dst_destroy calls ip6_dst_destroy which does a second put for what is essentially the same reference causing it to be prematurely freed. Reset rt6i_idev after the put in the vrf driver. Fixes: b4869aa2f881e ("net: vrf: ipv6 support for local traffic to local addresses") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16net: vrf: Set slave's private flag before linkingIdo Schimmel1-2/+6
Allow listeners of the subsequent CHANGEUPPER notification to retrieve the VRF's table ID by calling l3mdev_fib_table() with the slave netdev. Without this change, the netdev won't be considered an L3 slave and the function would return 0. This is consistent with other master device such as bridge and bond that set the slave's private flag before linking. It also makes do_vrf_{add,del}_slave() symmetric. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09vrf: Fix use-after-free in vrf_xmitDavid Ahern1-1/+2
KASAN detected a use-after-free: [ 269.467067] BUG: KASAN: use-after-free in vrf_xmit+0x7f1/0x827 [vrf] at addr ffff8800350a21c0 [ 269.467067] Read of size 4 by task ssh/1879 [ 269.467067] CPU: 1 PID: 1879 Comm: ssh Not tainted 4.10.0+ #249 [ 269.467067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 269.467067] Call Trace: [ 269.467067] dump_stack+0x81/0xb6 [ 269.467067] kasan_object_err+0x21/0x78 [ 269.467067] kasan_report+0x2f7/0x450 [ 269.467067] ? vrf_xmit+0x7f1/0x827 [vrf] [ 269.467067] ? ip_output+0xa4/0xdb [ 269.467067] __asan_load4+0x6b/0x6d [ 269.467067] vrf_xmit+0x7f1/0x827 [vrf] ... Which corresponds to the skb access after xmit handling. Fix by saving skb->len and using the saved value to update stats. Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-12net: rename dst_neigh_output back to neigh_outputJulian Anastasov1-2/+2
After the dst->pending_confirm flag was removed, we do not need anymore to provide dst arg to dst_neigh_output. So, rename it to neigh_output as before commit 5110effee8fd ("net: Do delayed neigh confirmation."). Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07net: add dst_pending_confirm flag to skbuffJulian Anastasov1-1/+4
Add new skbuff flag to allow protocols to confirm neighbour. When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. Add sock_confirm_neigh() helper to confirm the neighbour and use it for IPv4, IPv6 and VRF before dst_neigh_output. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+4
Two AF_* families adding entries to the lockdep tables at the same time. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11net: vrf: do not allow table id 0David Ahern1-0/+2
Frank reported that vrf devices can be created with a table id of 0. This breaks many of the run time table id checks and should not be allowed. Detect this condition at create time and fail with EINVAL. Fixes: 193125dbd8eb ("net: Introduce VRF device driver") Reported-by: Frank Kellermann <frank.kellermann@atos.net> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11net: ipv4: Fix multipath selection with vrfDavid Ahern1-0/+2
fib_select_path does not call fib_select_multipath if oif is set in the flow struct. For VRF use cases oif is always set, so multipath route selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif check similar to what is done in fib_table_lookup. Add saddr and proto to the flow struct for the fib lookup done by the VRF driver to better match hash computation for a flow. Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09net: make ndo_get_stats64 a void functionstephen hemminger1-3/+2
The network device operation for reading statistics is only called in one place, and it ignores the return value. Having a structure return value is potentially confusing because some future driver could incorrectly assume that the return value was used. Fix all drivers with ndo_get_stats64 to have a void function. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-04net: vrf: Add missing Rx countersDavid Ahern1-0/+3
The move from rx-handler to L3 receive handler inadvertantly dropped the rx counters. Restore them. Fixes: 74b20582ac38 ("net: l3mdev: Add hook in ip and ipv6") Reported-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net: vrf: Drop conntrack data after pass through VRF device on TxDavid Ahern1-0/+4
Locally originated traffic in a VRF fails in the presence of a POSTROUTING rule. For example, $ iptables -t nat -A POSTROUTING -s 11.1.1.0/24 -j MASQUERADE $ ping -I red -c1 11.1.1.3 ping: Warning: source address might be selected on device other than red. PING 11.1.1.3 (11.1.1.3) from 11.1.1.2 red: 56(84) bytes of data. ping: sendmsg: Operation not permitted Worse, the above causes random corruption resulting in a panic in random places (I have not seen a consistent backtrace). Call nf_reset to drop the conntrack info following the pass through the VRF device. The nf_reset is needed on Tx but not Rx because of the order in which NF_HOOK's are hit: on Rx the VRF device is after the real ingress device and on Tx it is is before the real egress device. Connection tracking should be tied to the real egress device and not the VRF device. Fixes: 8f58336d3f78a ("net: Add ethernet header for pass through VRF device") Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net: vrf: Fix NAT within a VRFDavid Ahern1-2/+0
Connection tracking with VRF is broken because the pass through the VRF device drops the connection tracking info. Removing the call to nf_reset allows DNAT and MASQUERADE to work across interfaces within a VRF. Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01net: Enable support for VRF with ipv4 multicastDavid Ahern1-5/+18
Enable support for IPv4 multicast: - similar to unicast the flow struct is updated to L3 master device if relevant prior to calling fib_rules_lookup. The table id is saved to the lookup arg so the rule action for ipmr can return the table associated with the device. - ip_mr_forward needs to check for master device mismatch as well since the skb->dev is set to it - allow multicast address on VRF device for Rx by checking for the daddr in the VRF device as well as the original ingress device - on Tx need to drop to __mkroute_output when FIB lookup fails for multicast destination address. - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates IPMR FIB rules on first device create similar to FIB rules. In addition the VRF driver does not divert IPv4 multicast packets: it breaks on Tx since the fib lookup fails on the mcast address. With this patch, ipmr forwarding and local rx/tx work. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17net: Require exact match for TCP socket lookups if dif is l3mdevDavid Ahern1-0/+2
Currently, socket lookups for l3mdev (vrf) use cases can match a socket that is bound to a port but not a device (ie., a global socket). If the sysctl tcp_l3mdev_accept is not set this leads to ack packets going out based on the main table even though the packet came in from an L3 domain. The end result is that the connection does not establish creating confusion for users since the service is running and a socket shows in ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the skb came through an interface enslaved to an l3mdev device and the tcp_l3mdev_accept is not set. skb's through an l3mdev interface are marked by setting a flag in inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the flag for IPv4. Using an skb flag avoids a device lookup on the dif. The flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the inet_skb_parm struct is moved in the cb per commit 971f10eca186, so the match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the move is done after the socket lookup, so IP6CB is used. The flags field in inet_skb_parm struct needs to be increased to add another flag. There is currently a 1-byte hole following the flags, so it can be expanded to u16 without increasing the size of the struct. Fixes: 193125dbd8eb ("net: Introduce VRF device driver") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17net: vrf: Remove RT_FL_TOSDavid Ahern1-3/+0
No longer used after d66f6c0a8f3c0 ("net: ipv4: Remove l3mdev_get_saddr") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-11net: flow: Remove FLOWI_FLAG_L3MDEV_SRC flagDavid Ahern1-3/+2
No longer used Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-11net: l3mdev: remove get_rtable methodDavid Ahern1-21/+0
No longer used Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-11net: ipv6: Remove l3mdev_get_saddr6David Ahern1-41/+0
No longer needed Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-11net: ipv4: Remove l3mdev_get_saddrDavid Ahern1-38/+0
No longer needed Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-11net: vrf: Flip IPv6 output path from FIB lookup hook to out hookDavid Ahern1-42/+82
Flip the IPv6 output path to use the l3mdev tx out hook. The VRF dst is not returned on the first FIB lookup. Instead, the dst on the skb is switched at the beginning of the IPv6 output processing to send the packet to the VRF driver on xmit. Link scope addresses (linklocal and multicast) need special handling: specifically the oif the flow struct can not be changed because we want the lookup tied to the enslaved interface. ie., the source address and the returned route MUST point to the interface scope passed in. Convert the existing vrf_get_rt6_dst to handle only link scope addresses. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-11net: vrf: Flip IPv4 output path from FIB lookup hook to out hookDavid Ahern1-1/+63
Flip the IPv4 output path to use the l3mdev tx out hook. The VRF dst is not returned on the first FIB lookup. Instead, the dst on the skb is switched at the beginning of the IPv4 output processing to send the packet to the VRF driver on xmit. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05net: vrf: Add support for PREROUTING rules on vrf deviceDavid Ahern1-0/+21
Add support for PREROUTING rules with skb->dev set to the vrf device. INPUT rules are already allowed. Provides symmetry with the output path which allows POSTROUTING rules. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-18net: vrf: Implement get_saddr for IPv6David Ahern1-0/+41
IPv6 source address selection needs to consider the real egress route. Similar to IPv4 implement a get_saddr6 method which is called if source address has not been set. The get_saddr6 method does a full lookup which means pulling a route from the VRF FIB table and properly considering linklocal/multicast destination addresses. Lookup failures (eg., unreachable) then cause the source address selection to fail which gets propagated back to the caller. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16net: vrf: Switch dst dev to loopback on device deleteDavid Ahern1-13/+42
Attempting to delete a VRF device with a socket bound to it can stall: unregister_netdevice: waiting for red to become free. Usage count = 1 The unregister is waiting for the dst to be released and with it references to the vrf device. Similar to dst_ifdown switch the dst dev to loopback on delete for all of the dst's for the vrf device and release the references to the vrf device. Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver") Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16net: vrf: Update flags and features settingsDavid Ahern1-0/+14
1. Default VRF devices to not having a qdisc (IFF_NO_QUEUE). Users can add one as desired. 2. Disable adding a VLAN to a VRF device. 3. Enable offloads and hardware features similar to other logical devices (e.g., dummy, veth) Change provides a significant boost in TCP stream Tx performance, from ~2,700 Mbps to ~18,100 Mbps and makes throughput close to the performance without a VRF (18,500 Mbps). netperf TCP_STREAM benchmark using qemu with virtio+vhost for the NICs Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net: vrf: Handle ipv6 multicast and link-local addressesDavid Ahern1-5/+93
IPv6 multicast and link-local addresses require special handling by the VRF driver: 1. Rather than using the VRF device index and full FIB lookups, packets to/from these addresses should use direct FIB lookups based on the VRF device table. 2. fail sends/receives on a VRF device to/from a multicast address (e.g, make ping6 ff02::1%<vrf> fail) 3. move the setting of the flow oif to the first dst lookup and revert the change in icmpv6_echo_reply made in ca254490c8dfd ("net: Add VRF support to IPv6 stack"). Linklocal/mcast addresses require use of the skb->dev. With this change connections into and out of a VRF enslaved device work for multicast and link-local addresses work (icmp, tcp, and udp) e.g., 1. packets into VM with VRF config: ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1 ping6 -c3 ff02::1%br1 ssh -6 fe80::e0:f9ff:fe1c:b974%br1 2. packets going out a VRF enslaved device: ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1 ping6 -c3 ff02::1%eth1 ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1 Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15net: l3mdev: Remove const from flowi6 arg to get_rt6_dstDavid Ahern1-1/+1
Allow drivers to pass flow arg to functions where the arg is not const and allow the driver to make updates as needed (eg., setting oif). Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10net: vrf: Fix crash when IPv6 is disabled at boot timeDavid Ahern1-0/+7
Frank Kellermann reported a kernel crash with 4.5.0 when IPv6 is disabled at boot using the kernel option ipv6.disable=1. Using current net-next with the boot option: $ ip link add red type vrf table 1001 Generates: [12210.919584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000748 [12210.921341] IP: [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a [12210.922537] PGD b79e3067 PUD bb32b067 PMD 0 [12210.923479] Oops: 0000 [#1] SMP [12210.924001] Modules linked in: ipvlan 8021q garp mrp stp llc [12210.925130] CPU: 3 PID: 1177 Comm: ip Not tainted 4.7.0-rc1+ #235 [12210.926168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [12210.928065] task: ffff8800b9ac4640 ti: ffff8800bacac000 task.ti: ffff8800bacac000 [12210.929328] RIP: 0010:[<ffffffff814b30e3>] [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a [12210.930697] RSP: 0018:ffff8800bacaf888 EFLAGS: 00010202 [12210.931563] RAX: 0000000000000748 RBX: ffffffff81a9e280 RCX: ffff8800b9ac4e28 [12210.932688] RDX: 00000000000000e9 RSI: 0000000000000002 RDI: 0000000000000286 [12210.933820] RBP: ffff8800bacaf898 R08: ffff8800b9ac4df0 R09: 000000000052001b [12210.934941] R10: 00000000657c0000 R11: 000000000000c649 R12: 00000000000003e9 [12210.936032] R13: 00000000000003e9 R14: ffff8800bace7800 R15: ffff8800bb3ec000 [12210.937103] FS: 00007faa1766c700(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000 [12210.938321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12210.939166] CR2: 0000000000000748 CR3: 00000000b79d6000 CR4: 00000000000406e0 [12210.940278] Stack: [12210.940603] ffff8800bb3ec000 ffffffff81a9e280 ffff8800bacaf8c8 ffffffff814b3135 [12210.941818] ffff8800bb3ec000 ffffffff81a9e280 ffffffff81a9e280 ffff8800bace7800 [12210.943040] ffff8800bacaf8f0 ffffffff81397c88 ffff8800bb3ec000 ffffffff81a9e280 [12210.944288] Call Trace: [12210.944688] [<ffffffff814b3135>] fib6_new_table+0x24/0x8a [12210.945516] [<ffffffff81397c88>] vrf_dev_init+0xd4/0x162 [12210.946328] [<ffffffff814091e1>] register_netdevice+0x100/0x396 [12210.947209] [<ffffffff8139823d>] vrf_newlink+0x40/0xb3 [12210.948001] [<ffffffff814187f0>] rtnl_newlink+0x5d3/0x6d5 ... The problem above is due to the fact that the fib hash table is not allocated when IPv6 is disabled at boot. As for the VRF driver it should not do any IPv6 initializations if IPv6 is disabled, so it needs to know if IPv6 is disabled at boot. The disable parameter is private to the IPv6 module, so provide an accessor for modules to determine if IPv6 was disabled at boot time. Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09net: vrf: call netdev_lockdep_set_classes()Eric Dumazet1-1/+1
In case a qdisc is used on a vrf device, we need to use different lockdep classes to avoid false positives. Use the new netdev_lockdep_set_classes() generic helper. Reported-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: vrf: Add l3mdev rules on first device createDavid Ahern1-1/+105
Add l3mdev rule per address family when the first VRF device is created. The rules are installed with a default preference of 1000. Users can replace the default rule as desired. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: vrf: ipv6 support for local traffic to local addressesDavid Ahern1-4/+85
Add support for locally originated traffic to VRF-local IPv6 addresses. Similar to IPv4 a local dst is set on the skb and the packet is reinserted with a call to netif_rx. With this patch, ping, tcp and udp packets to a local IPv6 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping6 -c1 -I red 2100:1::1 ping6: Warning: source address might be selected on device other than red. PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes 64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms ip6_input is exported so the VRF driver can use it for the dst input function. The dst_alloc function for IPv4 defaults to setting the input and output functions; IPv6's does not. VRF does not need to duplicate the Rx path so just export the ipv6 input function. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: vrf: ipv4 support for local traffic to local addressesDavid Ahern1-2/+98
Add support for locally originated traffic to VRF-local addresses. If destination device for an skb is the loopback or VRF device then set its dst to a local version of the VRF cached dst_entry and call netif_rx to insert the packet onto the rx queue - similar to what is done for loopback. This patch handles IPv4 support; follow on patch handles IPv6. With this patch, ping, tcp and udp packets to a local IPv4 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping -c1 -I red 10.100.1.1 ping: Warning: source address might be selected on device other than red. PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data. 64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms This patch also enables use of IPv4 loopback address on the VRF device: $ ip addr add dev red 127.0.0.1/8 $ ping -c1 -I red 127.0.0.1 PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08net: vrf: Minor refactoring for local address patchesDavid Ahern1-27/+18
Move the stripping of the ethernet header from is_ip_tx_frame into the ipv4 and ipv6 outbound functions and collapse vrf_send_v4_prep into vrf_process_v4_outbound. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: Revert vrf-local changes.David S. Miller1-201/+33
This reverts commit 2fb7ea455d57e22110c54fc2de0656b6f744263c. It results in build errors because ip6_input is not a symbol exported to modules. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: vrf: ipv6 support for local traffic to local addressesDavid Ahern1-4/+85
Add support for locally originated traffic to VRF-local IPv6 addresses. Similar to IPv4 a local dst is set on the skb and the packet is reinserted with a call to netif_rx. With this patch, ping, tcp and udp packets to a local IPv6 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping6 -c1 -I red 2100:1::1 ping6: Warning: source address might be selected on device other than red. PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes 64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms ip6_input is exported so the VRF driver can use it for the dst input function. The dst_alloc function for IPv4 defaults to setting the input and output functions; IPv6's does not. VRF does not need to duplicate the Rx path so just export the ipv6 input function. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: vrf: ipv4 support for local traffic to local addressesDavid Ahern1-2/+98
Add support for locally originated traffic to VRF-local addresses. If destination device for an skb is the loopback or VRF device then set its dst to a local version of the VRF cached dst_entry and call netif_rx to insert the packet onto the rx queue - similar to what is done for loopback. This patch handles IPv4 support; follow on patch handles IPv6. With this patch, ping, tcp and udp packets to a local IPv4 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping -c1 -I red 10.100.1.1 ping: Warning: source address might be selected on device other than red. PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data. 64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms This patch also enables use of IPv4 loopback address on the VRF device: $ ip addr add dev red 127.0.0.1/8 $ ping -c1 -I red 127.0.0.1 PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07net: vrf: Minor refactoring for local address patchesDavid Ahern1-27/+18
Move the stripping of the ethernet header from is_ip_tx_frame into the ipv4 and ipv6 outbound functions. If the packet is destined to a local address the header is retained since the packet is sent back to netif_rx. Collapse vrf_send_v4_prep into vrf_process_v4_outbound. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>