summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2018-02-27ip_tunnel: Rename & publish init_tunnel_flowPetr Machata1-28/+12
Initializing struct flowi4 is useful for drivers that need to emulate routing decisions made by a tunnel interface. Publish the function (appropriately renamed) so that the drivers in question don't need to cut'n'paste it around. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: GRE: Add is_gretap_dev, is_ip6gretap_devPetr Machata2-0/+12
Determining whether a device is a GRE device is easily done by inspecting struct net_device.type. However, for the tap variants, the type is just ARPHRD_ETHER. Therefore introduce two predicate functions that use netdev_ops to tell the tap devices. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27rds: deliver zerocopy completion notification with dataSowmini Varadhan4-25/+53
This commit is an optimization over commit 01883eda72bd ("rds: support for zcopy completion notification") for PF_RDS sockets. RDS applications are predominantly request-response transactions, so it is more efficient to reduce the number of system calls and have zerocopy completion notification delivered as ancillary data on the POLLIN channel. Cookies are passed up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when the returned value of recvmsg() is greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed with each message. This commit removes support for zerocopy completion notification on MSG_ERRQUEUE for PF_RDS sockets. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27Merge branch 'ieee802154-for-davem-2018-02-26' of ↵David S. Miller1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next Stefan Schmidt says: ==================== pull-request: ieee802154-next 2018-02-26 An update from ieee802154 for *net-next* Alexander corrected a setting which got lost during some 6lowpan rework a while back and Xue Liu provided us with a new driver for the MCR20A transceiver. If there are any issues let me know. If not, please pull. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert defrag6_net_opsKirill Tkhai1-0/+1
These pernet_operations only unregister nf hooks. So, they are able to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert ila_net_opsKirill Tkhai1-0/+1
These pernet_operations register and unregister nf hooks. Also they populate and depopulate ila_net_id-pointed hash table. The table is changed by hooks during skb processing and via netlink request. It looks impossible for another net pernet_operations to force the table reading or writing, so, they are able to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert defrag4_net_opsKirill Tkhai1-0/+1
These pernet_operations only unregister nf hooks. So, they are able to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert clusterip_net_opsKirill Tkhai1-0/+1
These pernet_operations register and unregister nf hooks, and populate and destroy /proc entry. So, they are able to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert brnf_net_opsKirill Tkhai1-0/+1
These pernet_operations only unregister nf hooks. So, they are able to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert cfg802154_pernet_opsKirill Tkhai1-0/+1
These pernet_operations have only exit method, which moves devices from cfg802154_rdev_list to init_net. This may occur in any time from nl802154_wpan_phy_netns(), so we are nice with rtnl_lock() synchronization. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert sit_net_opsKirill Tkhai1-0/+1
These pernet_operations are similar to ip6_tnl_net_ops. Exit method unregisters all net sit devices, and it looks like another pernet_operations are not interested in foreign net sit list. Init method registers netdevice. So, it's possible to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert vti6_net_opsKirill Tkhai1-0/+1
These pernet_operations are similar to ip6_tnl_net_ops. Exit method unregisters all net vti6 tunnels, and it looks like another pernet_operations are not interested in foreign net vti6 list. Init method registers netdevice. So, it's possible to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert ip6_tnl_net_opsKirill Tkhai1-0/+1
These pernet_operations are similar to ip6gre_net_ops. Exit method unregisters all net ip6_tnl tunnels, and it looks like another pernet_operations are not interested in foreign net tunnels list. So, it's possible to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert ip6gre_net_opsKirill Tkhai1-0/+1
These pernet_operations are similar to bond_net_ops. Exit method unregisters all net ip6gre devices, and it looks like another pernet_operations are not interested in foreign net ip6gre list or net_generic()->tunnels_wc. Init method registers net device. So, it's possible to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert ipgre_net_ops, ipgre_tap_net_ops, erspan_net_ops, vti_net_ops ↵Kirill Tkhai3-0/+5
and ipip_net_ops These pernet_operations are similar to bond_net_ops. Exit methods unregisters all net ipgre/ipgre_tap/erspan/vti/ipip devices, and it looks like another pernet_operations are not interested in foreign net ipgre/ipgre_tap/erspan/vti/ipip list. Init method also does not intersect with something pernet-specific. So, it's possible to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert br_net_opsKirill Tkhai1-0/+1
These pernet_operations are similar to bond_net_ops. Exit method unregisters all net bridge devices, and it looks like another pernet_operations are not interested in foreign net bridge list. So, it's possible to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert tc_action_net_init() and tc_action_net_exit() based ↵Kirill Tkhai16-0/+17
pernet_operations These pernet_operations are from net/sched directory, and they call only tc_action_net_init() and tc_action_net_exit(): bpf_net_ops connmark_net_ops csum_net_ops gact_net_ops ife_net_ops ipt_net_ops xt_net_ops mirred_net_ops nat_net_ops pedit_net_ops police_net_ops sample_net_ops simp_net_ops skbedit_net_ops skbmod_net_ops tunnel_key_net_ops vlan_net_ops 1)tc_action_net_init() just allocates and initializes per-net memory. 2)There should not be in-flight packets at the time of tc_action_net_exit() call, or another pernet_operations send packets to dying net (except netlink). So, it seems they can be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert sysctl creating and destroying pernet_operationsKirill Tkhai2-0/+2
These pernet_operations create and destroy sysctl tables, and they are able to be executed in parallel with any others: ip_vs_lblc_ops ip_vs_lblcr_ops Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert simple pernet_operationsKirill Tkhai3-0/+3
These pernet_operations make pretty simple actions like variable initialization on init, debug checks on exit, and so on, and they obviously are able to be executed in parallel with any others: vrf_net_ops lockd_net_ops grace_net_ops xfrm6_tunnel_net_ops kcm_net_ops tcf_net_ops Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert synproxy_net_opsKirill Tkhai1-0/+1
These pernet_operations create and destroy /proc entries and allocate extents to template ct, which depend on global nf_ct_ext_types[] array. So, we are able to mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert hashlimit_net_ops and recent_net_opsKirill Tkhai2-0/+2
These pernet_operations just create and destroy /proc entries. Also, new /proc entries also may come after new nf rules are added, but this is not possible, when net isn't alive. So, they are safe to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27net: Convert /proc creating and destroying pernet_operationsKirill Tkhai6-0/+6
These pernet_operations just create and destroy /proc entries, and they can safely marked as async: pppoe_net_ops vlan_net_ops canbcm_pernet_ops kcm_net_ops pfkey_net_ops pppol2tp_net_ops phonet_net_ops Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26net: make kmem caches as __ro_after_initAlexey Dobriyan6-8/+11
All kmem caches aren't reallocated once set up. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller70-337/+368
2018-02-23net: fib_rules: Add new attribute to set protocolDonald Sharp1-4/+11
For ages iproute2 has used `struct rtmsg` as the ancillary header for FIB rules and in the process set the protocol value to RTPROT_BOOT. Until ca56209a66 ("net: Allow a rule to track originating protocol") the kernel rules code ignored the protocol value sent from userspace and always returned 0 in notifications. To avoid incompatibility with existing iproute2, send the protocol as a new attribute. Fixes: cac56209a66 ("net: Allow a rule to track originating protocol") Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23net_sched: gen_estimator: fix broken estimators based on percpu statsEric Dumazet1-0/+1
pfifo_fast got percpu stats lately, uncovering a bug I introduced last year in linux-4.10. I missed the fact that we have to clear our temporary storage before calling __gnet_stats_copy_basic() in the case of percpu stats. Without this fix, rate estimators (tc qd replace dev xxx root est 1sec 4sec pfifo_fast) are utterly broken. Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-5/+1
Alexei Starovoitov says: ==================== pull-request: bpf 2018-02-22 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) two urgent fixes for bpf_tail_call logic for x64 and arm64 JITs, from Daniel. 2) cond_resched points in percpu array alloc/free paths, from Eric. 3) lockdep and other minor fixes, from Yonghong, Arnd, Anders, Li. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23rds: rds_msg_zcopy should return error of null rm->data.op_mmp_znotifierSowmini Varadhan1-1/+2
if either or both of MSG_ZEROCOPY and SOCK_ZEROCOPY have not been specified, the rm->data.op_mmp_znotifier allocation will be skipped. In this case, it is invalid ot pass down a cmsghdr with RDS_CMSG_ZCOPY_COOKIE, so return EINVAL from rds_msg_zcopy for this case. Reported-by: syzbot+f893ae7bb2f6456dfbc3@syzkaller.appspotmail.com Fixes: 0cebaccef3ac ("rds: zerocopy Tx support.") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23ipv6 sit: work around bogus gcc-8 -Wrestrict warningArnd Bergmann1-1/+1
gcc-8 has a new warning that detects overlapping input and output arguments in memcpy(). It triggers for sit_init_net() calling ipip6_tunnel_clone_6rd(), which is actually correct: net/ipv6/sit.c: In function 'sit_init_net': net/ipv6/sit.c:192:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict] The problem here is that the logic detecting the memcpy() arguments finds them to be the same, but the conditional that tests for the input and output of ipip6_tunnel_clone_6rd() to be identical is not a compile-time constant. We know that netdev_priv(t->dev) is the same as t for a tunnel device, and comparing "dev" directly here lets the compiler figure out as well that 'dev == sitn->fb_tunnel_dev' when called from sit_init_net(), so it no longer warns. This code is old, so Cc stable to make sure that we don't get the warning for older kernels built with new gcc. Cc: Martin Sebor <msebor@gmail.com> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83456 Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22rxrpc: Fix send in rxrpc_send_data_packet()David Howells1-1/+1
All the kernel_sendmsg() calls in rxrpc_send_data_packet() need to send both parts of the iov[] buffer, but one of them does not. Fix it so that it does. Without this, short IPv6 rxrpc DATA packets may be seen that have the rxrpc header included, but no payload. Fixes: 5a924b8951f8 ("rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22Merge tag 'mac80211-next-for-davem-2018-02-22' of ↵David S. Miller16-61/+287
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Various updates across wireless. One thing to note: I've included a new ethertype that wireless uses (ETH_P_PREAUTH) in if_ether.h. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22Merge tag 'mac80211-for-davem-2018-02-22' of ↵David S. Miller8-24/+38
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Various fixes across the tree, the shortlog basically says it all: cfg80211: fix cfg80211_beacon_dup -> old bug in this code cfg80211: clear wep keys after disconnection -> certain ways of disconnecting left the keys mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4 -> alignment issues with using 14 bytes mac80211: Do not disconnect on invalid operating class -> if the AP has a bogus operating class, let it be mac80211: Fix sending ADDBA response for an ongoing session -> don't send the same frame twice cfg80211: use only 1Mbps for basic rates in mesh -> interop issue with old versions of our code mac80211_hwsim: don't use WQ_MEM_RECLAIM -> it causes splats because it flushes work on a non-reclaim WQ regulatory: add NUL to request alpha2 -> nla_put_string() issue from Kees mac80211: mesh: fix wrong mesh TTL offset calculation -> protocol issue mac80211: fix a possible leak of station stats -> error path might leak memory mac80211: fix calling sleeping function in atomic context -> percpu allocations need to be made with gfp flags ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22mac80211: Call mgd_prep_tx before transmitting deauthenticationIlan Peer2-1/+16
In multi channel scenarios, when disassociating from the AP before a beacon was heard from the AP, it is not guaranteed that the virtual interface is granted air time for the transmission of the deauthentication frame. This in turn can lead to various issues as the AP might never get the deauthentication frame. To mitigate such possible issues, add a HW flag indicating that the driver requires mac80211 to call the mgd_prep_tx() driver callback to make sure that the virtual interface is granted immediate airtime to be able to transmit the frame, in case that no beacon was heard from the AP. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-02-22mac80211: add get TID helperSara Sharon7-21/+11
Extracting the TID from the QOS header is common enough to justify helper. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-02-22mac80211: support reporting A-MPDU EOF bit value/knownJohannes Berg1-0/+4
Support getting the EOF bit value reported from hardware and writing it out to radiotap. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-02-22net: ipv4: Set addr_type in hash_keys for forwarded caseDavid Ahern1-0/+2
The result of the skb flow dissect is copied from keys to hash_keys to ensure only the intended data is hashed. The original L4 hash patch overlooked setting the addr_type for this case; add it. Fixes: bf4e0a3db97eb ("net: ipv4: add support for ECMP hash policy choice") Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22tcp_bbr: better deal with suboptimal GSOEric Dumazet1-4/+5
BBR uses tcp_tso_autosize() in an attempt to probe what would be the burst sizes and to adjust cwnd in bbr_target_cwnd() with following gold formula : /* Allow enough full-sized skbs in flight to utilize end systems. */ cwnd += 3 * bbr->tso_segs_goal; But GSO can be lacking or be constrained to very small units (ip link set dev ... gso_max_segs 2) What we really want is to have enough packets in flight so that both GSO and GRO are efficient. So in the case GSO is off or downgraded, we still want to have the same number of packets in flight as if GSO/TSO was fully operational, so that GRO can hopefully be working efficiently. To fix this issue, we make tcp_tso_autosize() unaware of sk->sk_gso_max_segs Only tcp_tso_segs() has to enforce the gso_max_segs limit. Tested: ethtool -K eth0 tso off gso off tc qd replace dev eth0 root pfifo_fast Before patch: for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done     691  (ss -temoi shows cwnd is stuck around 6 )     667     651     631     517 After patch : # for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done    1733 (ss -temoi shows cwnd is around 386 )    1778    1746    1781    1718 Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22netlink: put module reference if dump start failsJason A. Donenfeld1-1/+3
Before, if cb->start() failed, the module reference would never be put, because cb->cb_running is intentionally false at this point. Users are generally annoyed by this because they can no longer unload modules that leak references. Also, it may be possible to tediously wrap a reference counter back to zero, especially since module.c still uses atomic_inc instead of refcount_inc. This patch expands the error path to simply call module_put if cb->start() fails. Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22bpf: clean up unused-variable warningArnd Bergmann1-5/+1
The only user of this variable is inside of an #ifdef, causing a warning without CONFIG_INET: net/core/filter.c: In function '____bpf_sock_ops_cb_flags_set': net/core/filter.c:3382:6: error: unused variable 'val' [-Werror=unused-variable] int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS; This replaces the #ifdef with a nicer IS_ENABLED() check that makes the code more readable and avoids the warning. Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-22net: Allow a rule to track originating protocolDonald Sharp1-1/+6
Allow a rule that is being added/deleted/modified or dumped to contain the originating protocol's id. The protocol is handled just like a routes originating protocol is. This is especially useful because there is starting to be a plethora of different user space programs adding rules. Allow the vrf device to specify that the kernel is the originator of the rule created for this device. Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller54-300/+310
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains large batch with Netfilter fixes for your net tree, mostly due to syzbot report fixups and pr_err() ratelimiting, more specifically, they are: 1) Get rid of superfluous unnecessary check in x_tables before vmalloc(), we don't hit BUG there anymore, patch from Michal Hock, suggested by Andrew Morton. 2) Race condition in proc file creation in ipt_CLUSTERIP, from Cong Wang. 3) Drop socket lock that results in circular locking dependency, patch from Paolo Abeni. 4) Drop packet if case of malformed blob that makes backpointer jump in x_tables, from Florian Westphal. 5) Fix refcount leak due to race in ipt_CLUSTERIP in clusterip_config_find_get(), from Cong Wang. 6) Several patches to ratelimit pr_err() for x_tables since this can be a problem where CAP_NET_ADMIN semantics can protect us in untrusted namespace, from Florian Westphal. 7) Missing .gitignore update for new autogenerated asn1 state machine for the SNMP NAT helper, from Zhu Lingshan. 8) Missing timer initialization in xt_LED, from Paolo Abeni. 9) Do not allow negative port range in NAT, also from Paolo. 10) Lock imbalance in the xt_hashlimit rate match mode, patch from Eric Dumazet. 11) Initialize workqueue before timer in the idletimer match, from Eric Dumazet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21tcp: remove dead code after CHECKSUM_PARTIAL adoptionEric Dumazet2-43/+8
Since all skbs in write/rtx queues have CHECKSUM_PARTIAL, we can remove dead code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21tcp: remove dead code from tcp_set_skb_tso_segs()Eric Dumazet1-1/+1
We no longer have skbs with skb->ip_summed == CHECKSUM_NONE in TCP write queues. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21tcp: tcp_sendmsg() only deals with CHECKSUM_PARTIALEric Dumazet1-8/+3
We no longer have skbs with skb->ip_summed == CHECKSUM_NONE in TCP write queues. We can remove dead code in tcp_sendmsg(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21tcp: remove sk_check_csum_caps()Eric Dumazet1-8/+3
Since TCP relies on GSO, we do not need this helper anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21tcp: remove sk_can_gso() useEric Dumazet2-29/+8
After previous commit, sk_can_gso() is always true. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21tcp: switch to GSO being always onEric Dumazet2-1/+2
Oleksandr Natalenko reported performance issues with BBR without FQ packet scheduler that were root caused to lack of SG and GSO/TSO on his configuration. In this mode, TCP internal pacing has to setup a high resolution timer for each MSS sent. We could implement in TCP a strategy similar to the one adopted in commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers drifts") or decide to finally switch TCP stack to a GSO only mode. This has many benefits : 1) Most TCP developments are done with TSO in mind. 2) Less high-resolution timers needs to be armed for TCP-pacing 3) GSO can benefit of xmit_more hint 4) Receiver GRO is more effective (as if TSO was used for real on sender) -> Lower ACK traffic 5) Write queues have less overhead (one skb holds about 64KB of payload) 6) SACK coalescing just works. 7) rtx rb-tree contains less packets, SACK is cheaper. This patch implements the minimum patch, but we can remove some legacy code as follow ups. Tested: On 40Gbit link, one netperf -t TCP_STREAM BBR+fq: sg on: 26 Gbits/sec sg off: 15.7 Gbits/sec (was 2.3 Gbit before patch) BBR+pfifo_fast: sg on: 24.2 Gbits/sec sg off: 14.9 Gbits/sec (was 0.66 Gbit before patch !!! ) BBR+fq_codel: sg on: 24.4 Gbits/sec sg off: 15 Gbits/sec (was 0.66 Gbit before patch !!! ) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21rds: send: mark expected switch fall-through in rds_rm_sizeGustavo A. R. Silva1-0/+2
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Addresses-Coverity-ID: 1465362 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21net: sched: add em_ipt ematch for calling xtables matchesEyal Birger3-0/+270
The commit a new tc ematch for using netfilter xtable matches. This allows early classification as well as mirroning/redirecting traffic based on logic implemented in netfilter extensions. Current supported use case is classification based on the incoming IPSec state used during decpsulation using the 'policy' iptables extension (xt_policy). The module dynamically fetches the netfilter match module and calls it using a fake xt_action_param structure based on validated userspace provided parameters. As the xt_policy match does not access skb->data, no skb modifications are needed on match. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21net: sched: report if filter is too large to dumpRoman Kapl1-1/+6
So far, if the filter was too large to fit in the allocated skb, the kernel did not return any error and stopped dumping. Modify the dumper so that it returns -EMSGSIZE when a filter fails to dump and it is the first filter in the skb. If we are not first, we will get a next chance with more room. I understand this is pretty near to being an API change, but the original design (silent truncation) can be considered a bug. Note: The error case can happen pretty easily if you create a filter with 32 actions and have 4kb pages. Also recent versions of iproute try to be clever with their buffer allocation size, which in turn leads to Signed-off-by: Roman Kapl <code@rkapl.cz> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>