summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2017-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller53-269/+408
Two trivial overlapping changes conflicts in MPLS and mlx5. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds52-268/+407
Pull networking fixes from David Miller: 1) GTP fixes from Andreas Schultz (missing genl module alias, clear IP DF on transmit). 2) Netfilter needs to reflect the fwmark when sending resets, from Pau Espin Pedrol. 3) nftable dump OOPS fix from Liping Zhang. 4) Fix erroneous setting of VIRTIO_NET_HDR_F_DATA_VALID on transmit, from Rolf Neugebauer. 5) Fix build error of ipt_CLUSTERIP when procfs is disabled, from Arnd Bergmann. 6) Fix regression in handling of NETIF_F_SG in harmonize_features(), from Eric Dumazet. 7) Fix RTNL deadlock wrt. lwtunnel module loading, from David Ahern. 8) tcp_fastopen_create_child() needs to setup tp->max_window, from Alexey Kodanev. 9) Missing kmemdup() failure check in ipv6 segment routing code, from Eric Dumazet. 10) Don't execute unix_bind() under the bindlock, otherwise we deadlock with splice. From WANG Cong. 11) ip6_tnl_parse_tlv_enc_lim() potentially reallocates the skb buffer, therefore callers must reload cached header pointers into that skb. Fix from Eric Dumazet. 12) Fix various bugs in legacy IRQ fallback handling in alx driver, from Tobias Regnery. 13) Do not allow lwtunnel drivers to be unloaded while they are referenced by active instances, from Robert Shearman. 14) Fix truncated PHY LED trigger names, from Geert Uytterhoeven. 15) Fix a few regressions from virtio_net XDP support, from John Fastabend and Jakub Kicinski. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (102 commits) ISDN: eicon: silence misleading array-bounds warning net: phy: micrel: add support for KSZ8795 gtp: fix cross netns recv on gtp socket gtp: clear DF bit on GTP packet tx gtp: add genl family modules alias tcp: don't annotate mark on control socket from tcp_v6_send_response() ravb: unmap descriptors when freeing rings virtio_net: reject XDP programs using header adjustment virtio_net: use dev_kfree_skb for small buffer XDP receive r8152: check rx after napi is enabled r8152: re-schedule napi for tx r8152: avoid start_xmit to schedule napi when napi is disabled r8152: avoid start_xmit to call napi_schedule during autosuspend net: dsa: Bring back device detaching in dsa_slave_suspend() net: phy: leds: Fix truncated LED trigger names net: phy: leds: Break dependency of phy.h on phy_led_triggers.h net: phy: leds: Clear phy_num_led_triggers on failure to avoid crash net-next: ethernet: mediatek: change the compatible string Documentation: devicetree: change the mediatek ethernet compatible string bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status(). ...
2017-01-27net: adjust skb->truesize in pskb_expand_head()Eric Dumazet3-10/+14
Slava Shwartsman reported a warning in skb_try_coalesce(), when we detect skb->truesize is completely wrong. In his case, issue came from IPv6 reassembly coping with malicious datagrams, that forced various pskb_may_pull() to reallocate a bigger skb->head than the one allocated by NIC driver before entering GRO layer. Current code does not change skb->truesize, leaving this burden to callers if they care enough. Blindly changing skb->truesize in pskb_expand_head() is not easy, as some producers might track skb->truesize, for example in xmit path for back pressure feedback (sk->sk_wmem_alloc) We can detect the cases where it should be safe to change skb->truesize : 1) skb is not attached to a socket. 2) If it is attached to a socket, destructor is sock_edemux() My audit gave only two callers doing their own skb->truesize manipulation. I had to remove skb parameter in sock_edemux macro when CONFIG_INET is not set to avoid a compile error. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27tcp: don't annotate mark on control socket from tcp_v6_send_response()Pablo Neira5-9/+9
Unlike ipv4, this control socket is shared by all cpus so we cannot use it as scratchpad area to annotate the mark that we pass to ip6_xmit(). Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket family caches the flowi6 structure in the sctp_transport structure, so we cannot use to carry the mark unless we later on reset it back, which I discarded since it looks ugly to me. Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27net/ipv6: support more tunnel interfaces for EUI64 link-local generationFelix Jia3-0/+12
Signed-off-by: Felix Jia <felix.jia@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27net/ipv6: allow sysctl to change link-local address generation modeFelix Jia1-20/+84
The address generation mode for IPv6 link-local can only be configured by netlink messages. This patch adds the ability to change the address generation mode via sysctl. v1 -> v2 Removed the rtnl lock and switch to use RCU lock to iterate through the netdev list. v2 -> v3 Removed the addrgenmode variable from the idev structure and use the systcl storage for the flag. Simplifed the logic for sysctl handling by removing the supported for all operation. Added support for more types of tunnel interfaces for link-local address generation. Based the patches from net-next. v3 -> v4 Removed unnecessary whitespace changes. Signed-off-by: Felix Jia <felix.jia@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27net: ipv6: ignore null_entry on route dumpsDavid Ahern1-1/+5
lkp-robot reported a BUG: [ 10.151226] BUG: unable to handle kernel NULL pointer dereference at 00000198 [ 10.152525] IP: rt6_fill_node+0x164/0x4b8 [ 10.153307] *pdpt = 0000000012ee5001 *pde = 0000000000000000 [ 10.153309] [ 10.154492] Oops: 0000 [#1] [ 10.154987] CPU: 0 PID: 909 Comm: netifd Not tainted 4.10.0-rc4-00722-g41e8c70ee162-dirty #10 [ 10.156482] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 10.158254] task: d0deb000 task.stack: d0e0c000 [ 10.159059] EIP: rt6_fill_node+0x164/0x4b8 [ 10.159780] EFLAGS: 00010296 CPU: 0 [ 10.160404] EAX: 00000000 EBX: d10c2358 ECX: c1f7c6cc EDX: c1f6ff44 [ 10.161469] ESI: 00000000 EDI: c2059900 EBP: d0e0dc4c ESP: d0e0dbe4 [ 10.162534] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 10.163482] CR0: 80050033 CR2: 00000198 CR3: 10d94660 CR4: 000006b0 [ 10.164535] Call Trace: [ 10.164993] ? paravirt_sched_clock+0x9/0xd [ 10.165727] ? sched_clock+0x9/0xc [ 10.166329] ? sched_clock_cpu+0x19/0xe9 [ 10.166991] ? lock_release+0x13e/0x36c [ 10.167652] rt6_dump_route+0x4c/0x56 [ 10.168276] fib6_dump_node+0x1d/0x3d [ 10.168913] fib6_walk_continue+0xab/0x167 [ 10.169611] fib6_walk+0x2a/0x40 [ 10.170182] inet6_dump_fib+0xfb/0x1e0 [ 10.170855] netlink_dump+0xcd/0x21f This happens when the loopback device is set down and a ipv6 fib route dump is requested. ip6_null_entry is the root of all ipv6 fib tables making it integrated into the table and hence passed to the ipv6 route dump code. The null_entry route uses the loopback device for dst.dev but may not have rt6i_idev set because of the order in which initializations are done -- ip6_route_net_init is run before addrconf_init has initialized the loopback device. Fixing the initialization order is a much bigger problem with no obvious solution thus far. The BUG is triggered when the loopback is set down and the netif_running check added by a1a22c1206 fails. The fill_node descends to checking rt->rt6i_idev for ignore_routes_with_linkdown and since rt6i_idev is NULL it faults. The null_entry route should not be processed in a dump request. Catch and ignore. This check is done in rt6_dump_route as it is the highest place in the callchain with knowledge of both the route and the network namespace. Fixes: a1a22c1206("net: ipv6: Keep nexthop of multipath route on admin down") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-27net: ipv6: remove skb_reserve in getrouteDavid Ahern1-6/+0
Remove skb_reserve and skb_reset_mac_header from inet6_rtm_getroute. The allocated skb is not passed through the routing engine (like it is for IPv4) and has not since the beginning of git time. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26net: dsa: Move ports assignment closer to error checkingFlorian Fainelli1-1/+2
Move the assignment of ports in _dsa_register_switch() closer to where it is checked, no functional change. Re-order declarations to be preserve the inverted christmas tree style. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26net: dsa: Suffix function manipulating device_node with _dnFlorian Fainelli1-8/+8
Make it clear that these functions take a device_node structure pointer Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26net: dsa: Make most functions take a dsa_port argumentFlorian Fainelli3-36/+44
In preparation for allowing platform data, and therefore no valid device_node pointer, make most DSA functions takes a pointer to a dsa_port structure whenever possible. While at it, introduce a dsa_port_is_valid() helper function which checks whether port->dn is NULL or not at the moment. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26net: dsa: Pass device pointer to dsa_register_switchFlorian Fainelli1-3/+4
In preparation for allowing dsa_register_switch() to be supplied with device/platform data, pass down a struct device pointer instead of a struct device_node. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26Merge tag 'batadv-next-for-davem-20170126' of ↵David S. Miller59-69/+85
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - ignore self-generated loop detect MAC addresses in translation table, by Simon Wunderlich - install uapi batman_adv.h header, by Sven Eckelmann - bump copyright years, by Sven Eckelmann - Remove an unused variable in translation table code, by Sven Eckelmann - Handle NET_XMIT_CN like NET_XMIT_SUCCESS (revised according to Davids suggestion), and a follow up code clean up, by Gao Feng (2 patches) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller19-88/+103
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains a large batch with Netfilter fixes for your net tree, they are: 1) Two patches to solve conntrack garbage collector cpu hogging, one to remove GC_MAX_EVICTS and another to look at the ratio (scanned entries vs. evicted entries) to make a decision on whether to reduce or not the scanning interval. From Florian Westphal. 2) Two patches to fix incorrect set element counting if NLM_F_EXCL is is not set. Moreover, don't decrenent set->nelems from abort patch if -ENFILE which leaks a spare slot in the set. This includes a patch to deconstify the set walk callback to update set->ndeact. 3) Two fixes for the fwmark_reflect sysctl feature: Propagate mark to reply packets both from nf_reject and local stack, from Pau Espin Pedrol. 4) Fix incorrect handling of loopback traffic in rpfilter and nf_tables fib expression, from Liping Zhang. 5) Fix oops on stateful objects netlink dump, when no filter is specified. Also from Liping Zhang. 6) Fix a build error if proc is not available in ipt_CLUSTERIP, related to fix that was applied in the previous batch for net. From Arnd Bergmann. 7) Fix lack of string validation in table, chain, set and stateful object names in nf_tables, from Liping Zhang. Moreover, restrict maximum log prefix length to 127 bytes, otherwise explicitly bail out. 8) Two patches to fix spelling and typos in nf_tables uapi header file and Kconfig, patches from Alexander Alemayhu and William Breathitt Gray. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26batman-adv: Treat NET_XMIT_CN as transmit successfullyGao Feng1-1/+1
The tc could return NET_XMIT_CN as one congestion notification, but it does not mean the packet is lost. Other modules like ipvlan, macvlan, and others treat NET_XMIT_CN as success too. So batman-adv should handle NET_XMIT_CN also as NET_XMIT_SUCCESS. Signed-off-by: Gao Feng <gfree.wind@gmail.com> [sven@narfation.org: Moved NET_XMIT_CN handling to batadv_send_skb_packet] Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26batman-adv: Remove one condition check in batadv_route_unicast_packetGao Feng1-5/+4
It could decrease one condition check to collect some statements in the first condition block. Signed-off-by: Gao Feng <gfree.wind@gmail.com> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26batman-adv: Remove unused variable in batadv_tt_local_set_flagsSven Eckelmann1-2/+0
Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26batman-adv: update copyright years for 2017Sven Eckelmann59-59/+59
Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-01-26batman-adv: don't add loop detect macs to TTSimon Wunderlich2-1/+20
The bridge loop avoidance (BLA) feature of batman-adv sends packets to probe for Mesh/LAN packet loops. Those packets are not sent by real clients and should therefore not be added to the translation table (TT). Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
2017-01-26bridge: move maybe_deliver_addr() inside #ifdefArnd Bergmann1-25/+25
The only caller of this new function is inside of an #ifdef checking for CONFIG_BRIDGE_IGMP_SNOOPING, so we should move the implementation there too, in order to avoid this harmless warning: net/bridge/br_forward.c:177:13: error: 'maybe_deliver_addr' defined but not used [-Werror=unused-function] Fixes: 6db6f0eae605 ("bridge: multicast to unicast") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26Merge tag 'batadv-net-for-davem-20170125' of git://git.open-mesh.org/linux-mergeDavid S. Miller1-5/+5
Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - fix reference count handling on fragmentation error, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net: dsa: Bring back device detaching in dsa_slave_suspend()Florian Fainelli1-0/+2
Commit 448b4482c671 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat") removed the netif_device_detach() call done in dsa_slave_suspend() which is necessary, and paired with a corresponding netif_device_attach(), bring it back. Fixes: 448b4482c671 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net/tcp-fastopen: make connect()'s return case more consistent with non-TFOWilly Tarreau2-4/+4
Without TFO, any subsequent connect() call after a successful one returns -1 EISCONN. The last API update ensured that __inet_stream_connect() can return -1 EINPROGRESS in response to sendmsg() when TFO is in use to indicate that the connection is now in progress. Unfortunately since this function is used both for connect() and sendmsg(), it has the undesired side effect of making connect() now return -1 EINPROGRESS as well after a successful call, while at the same time poll() returns POLLOUT. This can confuse some applications which happen to call connect() and to check for -1 EISCONN to ensure the connection is usable, and for which EINPROGRESS indicates a need to poll, causing a loop. This problem was encountered in haproxy where a call to connect() is precisely used in certain cases to confirm a connection's readiness. While arguably haproxy's behaviour should be improved here, it seems important to aim at a more robust behaviour when the goal of the new API is to make it easier to implement TFO in existing applications. This patch simply ensures that we preserve the same semantics as in the non-TFO case on the connect() syscall when using TFO, while still returning -1 EINPROGRESS on sendmsg(). For this we simply tell __inet_stream_connect() whether we're doing a regular connect() or in fact connecting for a sendmsg() call. Cc: Wei Wang <weiwan@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net/tcp-fastopen: Add new API supportWei Wang5-9/+102
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an alternative way to perform Fast Open on the active side (client). Prior to this patch, a client needs to replace the connect() call with sendto(MSG_FASTOPEN). This can be cumbersome for applications who want to use Fast Open: these socket operations are often done in lower layer libraries used by many other applications. Changing these libraries and/or the socket call sequences are not trivial. A more convenient approach is to perform Fast Open by simply enabling a socket option when the socket is created w/o changing other socket calls sequence: s = socket() create a new socket setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …); newly introduced sockopt If set, new functionality described below will be used. Return ENOTSUPP if TFO is not supported or not enabled in the kernel. connect() With cookie present, return 0 immediately. With no cookie, initiate 3WHS with TFO cookie-request option and return -1 with errno = EINPROGRESS. write()/sendmsg() With cookie present, send out SYN with data and return the number of bytes buffered. With no cookie, and 3WHS not yet completed, return -1 with errno = EINPROGRESS. No MSG_FASTOPEN flag is needed. read() Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but write() is not called yet. Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is established but no msg is received yet. Return number of bytes read if socket is established and there is msg received. The new API simplifies life for applications that always perform a write() immediately after a successful connect(). Such applications can now take advantage of Fast Open by merely making one new setsockopt() call at the time of creating the socket. Nothing else about the application's socket call sequence needs to change. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net: Remove __sk_dst_reset() in tcp_v6_connect()Wei Wang1-1/+0
Remove __sk_dst_reset() in the failure handling because __sk_dst_reset() will eventually get called when sk is released. No need to handle it in the protocol specific connect call. This is also to make the code path consistent with ipv4. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net/tcp-fastopen: refactor cookie check logicWei Wang2-14/+23
Refactor the cookie check logic in tcp_send_syn_data() into a function. This function will be called else where in later changes. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tcp: correct memory barrier usage in tcp_check_space()Jason Baron1-1/+1
sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit(). Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient. Fixes: 3c7151275c0c ("tcp: add memory barriers to write space paths") Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tcp: reduce skb overhead in selected placesEric Dumazet2-2/+2
tcp_add_backlog() can use skb_condense() helper to get better gains and less SKB_TRUESIZE() magic. This only happens when socket backlog has to be used. Some attacks involve specially crafted out of order tiny TCP packets, clogging the ofo queue of (many) sockets. Then later, expensive collapse happens, trying to copy all these skbs into single ones. This unfortunately does not work if each skb has no neighbor in TCP sequence order. By using skb_condense() if the skb could not be coalesced to a prior one, we defeat these kind of threats, potentially saving 4K per skb (or more, since this is one page fragment). A typical NAPI driver allocates gro packets with GRO_MAX_HEAD bytes in skb->head, meaning the copy done by skb_condense() is limited to about 200 bytes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tipc: uninitialized return code in tipc_setsockopt()Dan Carpenter1-2/+1
We shuffled some code around and added some new case statements here and now "res" isn't initialized on all paths. Fixes: 01fd12bb189a ("tipc: make replicast a user selectable option") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net sched actions: Add support for user cookiesJamal Hadi Salim1-0/+45
Introduce optional 128-bit action cookie. Like all other cookie schemes in the networking world (eg in protocols like http or existing kernel fib protocol field, etc) the idea is to save user state that when retrieved serves as a correlator. The kernel _should not_ intepret it. The user can store whatever they wish in the 128 bits. Sample exercise(showing variable length use of cookie) .. create an accept action with cookie a1b2c3d4 sudo $TC actions add action ok index 1 cookie a1b2c3d4 .. dump all gact actions.. sudo $TC -s actions ls action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 0 installed 5 sec used 5 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 cookie a1b2c3d4 .. bind the accept action to a filter.. sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1 ... send some traffic.. $ ping 127.0.0.1 -c 3 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms 64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segmentXin Long1-1/+1
Now sctp gso puts segments into skb's frag_list, then processes these segments in skb_segment. But skb_segment handles them only when gs is enabled, as it's in the same branch with skb's frags. Although almost all the NICs support sg other than some old ones, but since commit 1e16aa3ddf86 ("net: gso: use feature flag argument in all protocol gso handlers"), features &= skb->dev->hw_enc_features, and xfrm_output_gso call skb_segment with features = 0, which means sctp gso would call skb_segment with sg = 0, and skb_segment would not work as expected. This patch is to fix it by setting features param with NETIF_F_SG when calling skb_segment so that it can go the right branch to process the skb's frag_list. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25sctp: sctp_addr_id2transport should verify the addr before looking up assocXin Long1-1/+5
sctp_addr_id2transport is a function for sockopt to look up assoc by address. As the address is from userspace, it can be a v4-mapped v6 address. But in sctp protocol stack, it always handles a v4-mapped v6 address as a v4 address. So it's necessary to convert it to a v4 address before looking up assoc by address. This patch is to fix it by calling sctp_verify_addr in which it can do this conversion before calling sctp_endpoint_lookup_assoc, just like what sctp_sendmsg and __sctp_connect do for the address from users. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25lwtunnel: Fix oops on state free after encap module unloadRobert Shearman1-1/+5
When attempting to free lwtunnel state after the module for the encap has been unloaded an oops occurs: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: lwtstate_free+0x18/0x40 [..] task: ffff88003e372380 task.stack: ffffc900001fc000 RIP: 0010:lwtstate_free+0x18/0x40 RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300 [..] Call Trace: <IRQ> free_fib_info_rcu+0x195/0x1a0 ? rt_fibinfo_free+0x50/0x50 rcu_process_callbacks+0x2d3/0x850 ? rcu_process_callbacks+0x296/0x850 __do_softirq+0xe4/0x4cb irq_exit+0xb0/0xc0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x93/0xa0 [..] Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8 The problem is after the module for the encap can be unloaded the corresponding ops is removed and is thus NULL here. Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so grab the module reference for the ops on creating lwtunnel state and of course release the reference when freeing the state. Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation") Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net: Specify the owning module for lwtunnel opsRobert Shearman5-0/+6
Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so specify the owning module for all lwtunnel ops. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tipc: fix cleanup at module unloadParthasarathy Bhuvaragan1-3/+1
In tipc_server_stop(), we iterate over the connections with limiting factor as server's idr_in_use. We ignore the fact that this variable is decremented in tipc_close_conn(), leading to premature exit. In this commit, we iterate until the we have no connections left. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Tested-by: John Thompson <thompa.atl@gmail.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tipc: ignore requests when the connection state is not CONNECTEDParthasarathy Bhuvaragan1-6/+7
In tipc_conn_sendmsg(), we first queue the request to the outqueue followed by the connection state check. If the connection is not connected, we should not queue this message. In this commit, we reject the messages if the connection state is not CF_CONNECTED. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Tested-by: John Thompson <thompa.atl@gmail.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tipc: fix nametbl_lock soft lockup at module exitParthasarathy Bhuvaragan1-11/+5
Commit 333f796235a527 ("tipc: fix a race condition leading to subscriber refcnt bug") reveals a soft lockup while acquiring nametbl_lock. Before commit 333f796235a527, we call tipc_conn_shutdown() from tipc_close_conn() in the context of tipc_topsrv_stop(). In that context, we are allowed to grab the nametbl_lock. Commit 333f796235a527, moved tipc_conn_release (renamed from tipc_conn_shutdown) to the connection refcount cleanup. This allows either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup. Since tipc_exit_net() first calls tipc_topsrv_stop() and then tipc_nametble_withdraw() increases the chances for the later to perform the connection cleanup. The soft lockup occurs in the call chain of tipc_nametbl_withdraw(), when it performs the tipc_conn_kref_release() as it tries to grab nametbl_lock again while holding it already. tipc_nametbl_withdraw() grabs nametbl_lock tipc_nametbl_remove_publ() tipc_subscrp_report_overlap() tipc_subscrp_send_event() tipc_conn_sendmsg() << if (con->flags != CF_CONNECTED) we do conn_put(), triggering the cleanup as refcount=0. >> tipc_conn_kref_release tipc_sock_release tipc_conn_release tipc_subscrb_delete tipc_subscrp_delete tipc_nametbl_unsubscribe << Soft Lockup >> The previous changes in this series fixes the race conditions fixed by commit 333f796235a527. Hence we can now revert the commit. Fixes: 333f796235a52727 ("tipc: fix a race condition leading to subscriber refcnt bug") Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tipc: fix connection refcount errorParthasarathy Bhuvaragan1-9/+10
Until now, the generic server framework maintains the connection id's per subscriber in server's conn_idr. At tipc_close_conn, we remove the connection id from the server list, but the connection is valid until we call the refcount cleanup. Hence we have a window where the server allocates the same connection to an new subscriber leading to inconsistent reference count. We have another refcount warning we grab the refcount in tipc_conn_lookup() for connections with flag with CF_CONNECTED not set. This usually occurs at shutdown when the we stop the topology server and withdraw TIPC_CFG_SRV publication thereby triggering a withdraw message to subscribers. In this commit, we: 1. remove the connection from the server list at recount cleanup. 2. grab the refcount for a connection only if CF_CONNECTED is set. Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tipc: add subscription refcount to avoid invalid deleteParthasarathy Bhuvaragan2-54/+71
Until now, the subscribers keep track of the subscriptions using reference count at subscriber level. At subscription cancel or subscriber delete, we delete the subscription only if the timer was pending for the subscription. This approach is incorrect as: 1. del_timer() is not SMP safe, if on CPU0 the check for pending timer returns true but CPU1 might schedule the timer callback thereby deleting the subscription. Thus when CPU0 is scheduled, it deletes an invalid subscription. 2. We export tipc_subscrp_report_overlap(), which accesses the subscription pointer multiple times. Meanwhile the subscription timer can expire thereby freeing the subscription and we might continue to access the subscription pointer leading to memory violations. In this commit, we introduce subscription refcount to avoid deleting an invalid subscription. Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tipc: fix nametbl_lock soft lockup at node/link eventsParthasarathy Bhuvaragan1-2/+7
We trigger a soft lockup as we grab nametbl_lock twice if the node has a pending node up/down or link up/down event while: - we process an incoming named message in tipc_named_rcv() and perform an tipc_update_nametbl(). - we have pending backlog items in the name distributor queue during a nametable update using tipc_nametbl_publish() or tipc_nametbl_withdraw(). The following are the call chain associated: tipc_named_rcv() Grabs nametbl_lock tipc_update_nametbl() (publish/withdraw) tipc_node_subscribe()/unsubscribe() tipc_node_write_unlock() << lockup occurs if an outstanding node/link event exits, as we grabs nametbl_lock again >> tipc_nametbl_withdraw() Grab nametbl_lock tipc_named_process_backlog() tipc_update_nametbl() << rest as above >> The function tipc_node_write_unlock(), in addition to releasing the lock processes the outstanding node/link up/down events. To do this, we need to grab the nametbl_lock again leading to the lockup. In this commit we fix the soft lockup by introducing a fast variant of node_unlock(), where we just release the lock. We adapt the node_subscribe()/node_unsubscribe() to use the fast variants. Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24netfilter: nf_tables: bump set->ndeact on set flushPablo Neira Ayuso1-0/+1
Add missing set->ndeact update on each deactivated element from the set flush path. Otherwise, sets with fixed size break after flush since accounting breaks. # nft add set x y { type ipv4_addr\; size 2\; } # nft add element x y { 1.1.1.1 } # nft add element x y { 1.1.1.2 } # nft flush set x y # nft add element x y { 1.1.1.1 } <cmdline>:1:1-28: Error: Could not process rule: Too many open files in system Fixes: 8411b6442e59 ("netfilter: nf_tables: support for set flushing") Reported-by: Elise Lennion <elise.lennion@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nf_tables: deconstify walk callback functionPablo Neira Ayuso3-14/+14
The flush operation needs to modify set and element objects, so let's deconstify this. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCLPablo Neira Ayuso1-7/+9
If the element exists and no NLM_F_EXCL is specified, do not bump set->nelems, otherwise we leak one set element slot. This problem amplifies if the set is full since the abort path always decrements the counter for the -ENFILE case too, giving one spare extra slot. Fix this by moving set->nelems update to nft_add_set_elem() after successful element insertion. Moreover, remove the element if the set is full so there is no need to rely on the abort path to undo things anymore. Fixes: c016c7e45ddf ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nft_log: restrict the log prefix length to 127Liping Zhang2-2/+2
First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127, at nf_log_packet(), so the extra part is useless. Second, after adding a log rule with a very very long prefix, we will fail to dump the nft rules after this _special_ one, but acctually, they do exist. For example: # name_65000=$(printf "%0.sQ" {1..65000}) # nft add rule filter output log prefix "$name_65000" # nft add rule filter output counter # nft add rule filter output counter # nft list chain filter output table ip filter { chain output { type filter hook output priority 0; policy accept; } } So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24net: sctp: fix array overrun read on sctp_timer_tblColin Ian King1-1/+4
Table sctp_timer_tbl is missing a TIMEOUT_RECONF string so add this in. Also compare timeout with the size of the array sctp_timer_tbl rather than SCTP_EVENT_TIMEOUT_MAX. Also add a build time check that SCTP_EVENT_TIMEOUT_MAX is correct so we don't ever get this kind of mismatch between the table and SCTP_EVENT_TIMEOUT_MAX in the future. Kudos to Marcelo Ricardo Leitner for spotting the missing string and suggesting the build time sanity check. Fixes CoverityScan CID#1397639 ("Out-of-bounds read") Fixes: 7b9438de0cd4 ("sctp: add stream reconf timer") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: dsa: Drop WARN() in tag_brcm.cFlorian Fainelli1-1/+2
We may be able to see invalid Broadcom tags when the hardware and drivers are misconfigured, or just while exercising the error path. Instead of flooding the console with messages, flat out drop the packet. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24ipv6: fix ip6_tnl_parse_tlv_enc_lim()Eric Dumazet1-12/+22
This function suffers from multiple issues. First one is that pskb_may_pull() may reallocate skb->head, so the 'raw' pointer needs either to be reloaded or not used at all. Second issue is that NEXTHDR_DEST handling does not validate that the options are present in skb->data, so we might read garbage or access non existent memory. With help from Willem de Bruijn. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()Eric Dumazet2-0/+5
Since ip6_tnl_parse_tlv_enc_lim() can call pskb_may_pull(), we must reload any pointer that was related to skb->head (or skb->data), or risk use after free. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: allow option for setting bpf_l4_csum_replace from scratchDaniel Borkmann1-3/+4
When programs need to calculate the csum from scratch for small UDP packets and use bpf_l4_csum_replace() to feed the result from helpers like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0 that would ignore the case of current csum being 0, and which would still allow for the helper to set the csum and transform when needed to CSUM_MANGLED_0. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: enable load bytes helper for filter/reuseport progsDaniel Borkmann1-15/+26
BPF_PROG_TYPE_SOCKET_FILTER are used in various facilities such as for SO_REUSEPORT and packet fanout demuxing, packet filtering, kcm, etc, and yet the only facility they can use is BPF_LD with {BPF_ABS, BPF_IND} for single byte/half/word access. Direct packet access is only restricted to tc programs right now, but we can still facilitate usage by allowing skb_load_bytes() helper added back then in 05c74e5e53f6 ("bpf: add bpf_skb_load_bytes helper") that calls skb_header_pointer() similarly to bpf_load_pointer(), but for stack buffers with larger access size. Name the previous sk_filter_func_proto() as bpf_base_func_proto() since this is used everywhere else as well, similarly for the ctx converter, that is, bpf_convert_ctx_access(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>