summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-10-22openvswitch: Serialize nested ct actions if providedJoe Stringer1-4/+10
If userspace provides a ct action with no nested mark or label, then the storage for these fields is zeroed. Later when actions are requested, such zeroed fields are serialized even though userspace didn't originally specify them. Fix the behaviour by ensuring that no action is serialized in this case, and reject actions where userspace attempts to set these fields with mask=0. This should make netlink marshalling consistent across deserialization/reserialization. Reported-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22openvswitch: Mark connections new when not confirmed.Joe Stringer1-5/+5
New, related connections are marked as such as part of ovs_ct_lookup(), but they are not marked as "new" if the commit flag is used. Make this consistent by setting the "new" flag whenever !nf_ct_is_confirmed(ct). Reported-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22openvswitch: Clarify conntrack COMMIT behaviourJoe Stringer1-1/+2
The presence of this attribute does not modify the ct_state for the current packet, only future packets. Make this more clear in the header definition. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22openvswitch: Reject ct_state masks for unknown bitsJoe Stringer2-12/+9
Currently, 0-bits are generated in ct_state where the bit position is undefined, and matches are accepted on these bit-positions. If userspace requests to match the 0-value for this bit then it may expect only a subset of traffic to match this value, whereas currently all packets will have this bit set to 0. Fix this by rejecting such masks. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22tcp: remove improper preemption check in tcp_xmit_probe_skb()Renato Westphal1-1/+1
Commit e520af48c7e5a introduced the following bug when setting the TCP_REPAIR sockoption: [ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164 [ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20 [ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1 [ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012 [ 2860.657054] ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002 [ 2860.657058] 0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18 [ 2860.657062] ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38 [ 2860.657065] Call Trace: [ 2860.657072] [<ffffffff8185d765>] dump_stack+0x4f/0x7b [ 2860.657075] [<ffffffff8146ed91>] check_preemption_disabled+0xe1/0xf0 [ 2860.657078] [<ffffffff8146edd3>] __this_cpu_preempt_check+0x13/0x20 [ 2860.657082] [<ffffffff817e0bc7>] tcp_xmit_probe_skb+0xc7/0x100 [ 2860.657085] [<ffffffff817e1e2d>] tcp_send_window_probe+0x2d/0x30 [ 2860.657089] [<ffffffff817d1d8c>] do_tcp_setsockopt.isra.29+0x74c/0x830 [ 2860.657093] [<ffffffff817d1e9c>] tcp_setsockopt+0x2c/0x30 [ 2860.657097] [<ffffffff81767b74>] sock_common_setsockopt+0x14/0x20 [ 2860.657100] [<ffffffff817669e1>] SyS_setsockopt+0x71/0xc0 [ 2860.657104] [<ffffffff81865172>] entry_SYSCALL_64_fastpath+0x16/0x75 Since tcp_xmit_probe_skb() can be called from process context, use NET_INC_STATS() instead of NET_INC_STATS_BH(). Fixes: e520af48c7e5 ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters") Signed-off-by: Renato Westphal <renatow@taghos.com.br> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller5-4/+6
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains four Netfilter fixes for net, they are: 1) Fix Kconfig dependencies of new nf_dup_ipv4 and nf_dup_ipv6. 2) Remove bogus test nh_scope in IPv4 rpfilter match that is breaking --accept-local, from Xin Long. 3) Wait for RCU grace period after dropping the pending packets in the nfqueue, from Florian Westphal. 4) Fix sleeping allocation while holding spin_lock_bh, from Nikolay Borisov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22netlink: Rightsize IFLA_AF_SPEC size calculationArad, Ronen5-28/+11
if_nlmsg_size() overestimates the minimum allocation size of netlink dump request (when called from rtnl_calcit()) or the size of the message (when called from rtnl_getlink()). This is because ext_filter_mask is not supported by rtnl_link_get_af_size() and rtnl_link_get_size(). The over-estimation is significant when at least one netdev has many VLANs configured (8 bytes for each configured VLAN). This patch-set "rightsizes" the protocol specific attribute size calculation by propagating ext_filter_mask to rtnl_link_get_af_size() and adding this a argument to get_link_af_size op in rtnl_af_ops. Bridge module already used filtering aware sizing for notifications. br_get_link_af_size_filtered() is consistent with the modified get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops. br_get_link_af_size() becomes unused and thus removed. Signed-off-by: Ronen Arad <ronen.arad@intel.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22tipc: conditionally expand buffer headroom over udp tunnelJon Paul Maloy1-0/+5
In commit d999297c3dbbe ("tipc: reduce locking scope during packet reception") we altered the packet retransmission function. Since then, when restransmitting packets, we create a clone of the original buffer using __pskb_copy(skb, MIN_H_SIZE), where MIN_H_SIZE is the size of the area we want to have copied, but also the smallest possible TIPC packet size. The value of MIN_H_SIZE is 24. Unfortunately, __pskb_copy() also has the effect that the headroom of the cloned buffer takes the size MIN_H_SIZE. This is too small for carrying the packet over the UDP tunnel bearer, which requires a minimum headroom of 28 bytes. A change to just use pskb_copy() lets the clone inherit the original headroom of 80 bytes, but also assumes that the copied data area is of at least that size, something that is not always the case. So that is not a viable solution. We now fix this by adding a check for sufficient headroom in the transmit function of udp_media.c, and expanding it when necessary. Fixes: commit d999297c3dbbe ("tipc: reduce locking scope during packet reception") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22net: cavium: change NET_VENDOR_CAVIUM to boolAndreas Schwab1-1/+1
CONFIG_NET_VENDOR_CAVIUM is only used to hide/show config options and to include subdirectories in the build, so it doesn't make sense to make it tristate. Signed-off-by: Andreas Schwab <schwab@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22tipc: allow non-linear first fragment bufferJon Paul Maloy1-3/+9
The current code for message reassembly is erroneously assuming that the the first arriving fragment buffer always is linear, and then goes ahead resetting the fragment list of that buffer in anticipation of more arriving fragments. However, if the buffer already happens to be non-linear, we will inadvertently drop the already attached fragment list, and later on trig a BUG() in __pskb_pull_tail(). We see this happen when running fragmented TIPC multicast across UDP, something made possible since commit d0f91938bede ("tipc: add ip/udp media type") We fix this by not resetting the fragment list when the buffer is non- linear, and by initiatlizing our private fragment list tail pointer to the tail of the existing fragment list. Fixes: commit d0f91938bede ("tipc: add ip/udp media type") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22openvswitch: Allocate memory for ovs internal device stats.James Morse1-3/+43
"openvswitch: Remove vport stats" removed the per-vport statistics, in order to use the netdev's statistics fields. "openvswitch: Fix ovs_vport_get_stats()" fixed the export of these stats to user-space, by using the provided netdev_ops to collate them - but ovs internal devices still use an unallocated dev->tstats field to count packets, which are no longer exported by this api. Allocate the dev->tstats field for ovs internal devices, and wire up ndo_get_stats64 with the original implementation of ovs_vport_get_stats(). On its own, "openvswitch: Fix ovs_vport_get_stats()" fixes the OOPs, unmasking a full-on panic on arm64: =============%<============== [<ffffffbffc00ce4c>] internal_dev_recv+0xa8/0x170 [openvswitch] [<ffffffbffc0008b4>] do_output.isra.31+0x60/0x19c [openvswitch] [<ffffffbffc000bf8>] do_execute_actions+0x208/0x11c0 [openvswitch] [<ffffffbffc001c78>] ovs_execute_actions+0xc8/0x238 [openvswitch] [<ffffffbffc003dfc>] ovs_packet_cmd_execute+0x21c/0x288 [openvswitch] [<ffffffc0005e8c5c>] genl_family_rcv_msg+0x1b0/0x310 [<ffffffc0005e8e60>] genl_rcv_msg+0xa4/0xe4 [<ffffffc0005e7ddc>] netlink_rcv_skb+0xb0/0xdc [<ffffffc0005e8a94>] genl_rcv+0x38/0x50 [<ffffffc0005e76c0>] netlink_unicast+0x164/0x210 [<ffffffc0005e7b70>] netlink_sendmsg+0x304/0x368 [<ffffffc0005a21c0>] sock_sendmsg+0x30/0x4c [SNIP] Kernel panic - not syncing: Fatal exception in interrupt =============%<============== Fixes: 8c876639c985 ("openvswitch: Remove vport stats.") Signed-off-by: James Morse <james.morse@arm.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22net: Really fix vti6 with oif in dst lookupsDavid Ahern2-1/+5
6e28b000825d ("net: Fix vti use case with oif in dst lookups for IPv6") is missing the checks on FLOWI_FLAG_SKIP_NH_OIF. Add them. Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22tipc: extend broadcast link window sizeJon Paul Maloy1-3/+5
The default fix broadcast window size is currently set to 20 packets. This is a very low value, set at a time when we were still testing on 10 Mb/s hubs, and a change to it is long overdue. Commit 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure") revealed a problem with this low value. For messages of importance LOW, the backlog queue limit will be calculated to 30 packets, while a single, maximum sized message of 66000 bytes, carried across a 1500 MTU network consists of 46 packets. This leads to the following scenario (among others leading to the same situation): 1: Msg 1 of 46 packets is sent. 20 packets go to the transmit queue, 26 packets to the backlog queue. 2: Msg 2 of 46 packets is attempted sent, but rejected because there is no more space in the backlog queue at this level. The sender is added to the wakeup queue with a "pending packets chain size" number of 46. 3: Some packets in the transmit queue are acked and released. We try to wake up the sender, but the pending size of 46 is bigger than the LOW wakeup limit of 30, so this doesn't happen. 5: Subsequent acks releases all the remaining buffers. Each time we test for the wakeup criteria and find that 46 still is larger than 30, even after both the transmit and the backlog queues are empty. 6: The sender is never woken up and given a chance to send its message. He is stuck. We could now loosen the wakeup criteria (used by link_prepare_wakeup()) to become equal to the send criteria (used by tipc_link_xmit()), i.e., by ignoring the "pending packets chain size" value altogether, or we can just increase the queue limits so that the criteria can be satisfied anyway. There are good reasons (potentially multiple waiting senders) to not opt for the former solution, so we choose the latter one. This commit fixes the problem by giving the broadcast link window a default value of 50 packets. We also introduce a new minimum link window size BCLINK_MIN_WIN of 32, which is enough to always avoid the described situation. Finally, in order to not break any existing users which may set the window explicitly, we enforce that the window is set to the new minimum value in case the user is trying to set it to anything lower. Fixes: 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21Bluetooth: Remove unnecessary hci_explicit_connect_lookup functionJohan Hedberg3-22/+3
There's only one user of this helper which can be replaces with a call to hci_pend_le_action_lookup() and a check for params->explicit_connect. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21Bluetooth: Remove redundant (and possibly wrong) flag clearingJohan Hedberg1-1/+0
There's no need to clear the HCI_CONN_ENCRYPT_PEND flag in smp_failure. In fact, this may cause the encryption tracking to get out of sync as this has nothing to do with HCI activity. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21Bluetooth: Add hdev helper variable to hci_le_create_connection_cancelJohan Hedberg1-6/+7
The hci_le_create_connection_cancel() function needs to use the hdev pointer in many places so add a variable for it to avoid the need to dereference the hci_conn every time. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21Bluetooth: Remove unnecessary indentation in unpair_device()Johan Hedberg1-23/+32
Instead of doing all of the LE-specific handling in an else-branch in unpair_device() create a 'done' label for the BR/EDR branch to jump to and then remove the else-branch completely. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21Bluetooth: 6lowpan: Use hci_conn_hash_lookup_le() when possibleJohan Hedberg1-1/+1
Use the new hci_conn_hash_lookup_le() API to look up LE connections. This way we're guaranteed exact matches that also take into account the address type. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21Bluetooth: Use hci_conn_hash_lookup_le() when possibleJohan Hedberg3-11/+11
Use the new hci_conn_hash_lookup_le() API to look up LE connections. This way we're guaranteed exact matches that also take into account the address type. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21Bluetooth: Add hci_conn_hash_lookup_le() helper functionJohan Hedberg1-0/+24
Many of the existing LE connection lookups are forced to use hci_conn_hash_lookup_ba() which doesn't take into account the address type. What's worse, most of the users don't bother checking that the returned address type matches what was wanted. This patch adds a new helper API to look up LE connections based on their address and address type, paving the way to have the hci_conn_hash_lookup_ba() users converted to do more precise lookups. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21Bluetooth: Add le_addr_type() helper functionJohan Hedberg1-38/+18
The mgmt code needs to convert from mgmt/L2CAP address types to HCI in many places. Having a dedicated helper function for this simplifies code by shortening it and removing unnecessary 'addr_type' variables. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-10-21Adding switchdev ageing notification on port bridgedElad Raz1-0/+12
Configure ageing time to the HW for newly bridged device CC: Scott Feldman <sfeldma@gmail.com> CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Elad Raz <eladr@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21irda: precedence bug in irlmp_seq_hb_idx()Dan Carpenter1-1/+1
This is decrementing the pointer, instead of the value stored in the pointer. KASan detects it as an out of bounds reference. Reported-by: "Berry Cheng 程君(成淼)" <chengmiao.cj@alibaba-inc.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21xen-netfront: update num_queues to real createdJoe Jin1-7/+7
Sometimes xennet_create_queues() may failed to created all requested queues, we need to update num_queues to real created to avoid NULL pointer dereference. Signed-off-by: Joe Jin <joe.jin@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21vsock: fix missing cleanup when misc_register failedGao feng1-3/+4
reset transport and unlock if misc_register failed. Signed-off-by: Gao feng <omarapazanadi@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21Merge branch 'mv643xx-fixes'David S. Miller1-8/+40
Philipp Kirchhofer says: ==================== net: mv643xx_eth: TSO TX data corruption fixes as previously discussed [1] the mv643xx_eth driver has some issues with data corruption when using TCP segmentation offload (TSO). The following patch set improves this situation by fixing two data corruption bugs in the TSO TX path. Before applying the patches repeatedly accessing large files located on a SMB share on my NSA325 NAS with TSO enabled resulted in different hash sums, which confirmed that data corruption is happening during file transfer. After applying the patches the hash sums were the same. As this is my first patch submission please feel free to point out any issues with the patch set. [1] http://thread.gmane.org/gmane.linux.network/336530 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21net: mv643xx_eth: Defer writing the first TX descriptor when using TSOPhilipp Kirchhofer1-3/+23
To prevent a race between the TX DMA engine and the CPU the writing of the first transmit descriptor must be deferred until all following descriptors have been updated. The network card may otherwise start transmitting before all packet descriptors are set up correctly, which leads to data corruption or an aborted transmit operation. This deferral is already done in the non-TSO TX path, implement it also in the TSO TX path. Signed-off-by: Philipp Kirchhofer <philipp@familie-kirchhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21net: mv643xx_eth: Ensure proper data alignment in TSO TX pathPhilipp Kirchhofer1-5/+17
The TX DMA engine requires that buffers with a size of 8 bytes or smaller must be 64 bit aligned. This requirement may be violated when doing TSO, as in this case larger skb frags can be broken up and transmitted in small parts with then inappropriate alignment. Fix this by checking for proper alignment before handing a buffer to the DMA engine. If the data is misaligned realign it by copying it into the TSO header data area. Signed-off-by: Philipp Kirchhofer <philipp@familie-kirchhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21Merge branch 'tcp-rack'David S. Miller11-86/+286
Yuchung Cheng says: ==================== RACK loss detection RACK (Recent ACK) loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh). It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a limited transmit (new data packet) is sacked in recovery, then any retransmission sent before that newly sacked packet was sent must have been lost, since at least one round trip time has elapsed. But that existing heuristic from tcp_mark_lost_retrans() has several limitations: 1) it can't detect tail drops since it depends on limited transmit 2) it's disabled upon reordering (assumes no reordering) 3) it's only enabled in fast recovery but not timeout recovery RACK addresses these limitations with a core idea: an unacknowledged packet P1 is deemed lost if a packet P2 that was sent later is is s/acked, since at least one round trip has passed. Since RACK cares about the time sequence instead of the data sequence of packets, it can detect tail drops when a later retransmission is s/acked, while FACK or dupthresh can't. For reordering RACK uses a dynamically adjusted reordering window ("reo_wnd") to reduce false positives on ever (small) degree of reordering, similar to the delayed Early Retransmit. In the current patch set RACK is only a supplemental loss detection and does not trigger fast recovery. However we are developing RACK to replace or consolidate FACK/dupthresh, early retransmit, and thin-dupack. These heuristics all implicitly bear the time notion. For example, the delayed Early Retransmit is simply applying RACK to trigger the fast recovery with small inflight. RACK requires measuring the minimum RTT. Tracking a global min is less robust due to traffic engineering pathing changes. Therefore it uses a windowed filter by Kathleen Nichols. The min RTT can also be useful for various other purposes like congestion control or stat monitoring. This patch has been used on Google servers for well over 1 year. RACK has also been implemented in the QUIC protocol. We are submitting an IETF draft as well. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: use RACK to detect lossesYuchung Cheng5-2/+109
This patch implements the second half of RACK that uses the the most recent transmit time among all delivered packets to detect losses. tcp_rack_mark_lost() is called upon receiving a dubious ACK. It then checks if an not-yet-sacked packet was sent at least "reo_wnd" prior to the sent time of the most recently delivered. If so the packet is deemed lost. The "reo_wnd" reordering window starts with 1msec for fast loss detection and changes to min-RTT/4 when reordering is observed. We found 1msec accommodates well on tiny degree of reordering (<3 pkts) on faster links. We use min-RTT instead of SRTT because reordering is more of a path property but SRTT can be inflated by self-inflicated congestion. The factor of 4 is borrowed from the delayed early retransmit and seems to work reasonably well. Since RACK is still experimental, it is now used as a supplemental loss detection on top of existing algorithms. It is only effective after the fast recovery starts or after the timeout occurs. The fast recovery is still triggered by FACK and/or dupack threshold instead of RACK. We introduce a new sysctl net.ipv4.tcp_recovery for future experiments of loss recoveries. For now RACK can be disabled by setting it to 0. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: track the packet timings in RACKYuchung Cheng6-0/+60
This patch is the first half of the RACK loss recovery. RACK loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh). It's inspired by the previous FACK heuristic in tcp_mark_lost_retrans(): when a limited transmit (new data packet) is sacked, then current retransmitted sequence below the newly sacked sequence must been lost, since at least one round trip time has elapsed. But it has several limitations: 1) can't detect tail drops since it depends on limited transmit 2) is disabled upon reordering (assumes no reordering) 3) only enabled in fast recovery ut not timeout recovery RACK (Recently ACK) addresses these limitations with the notion of time instead: a packet P1 is lost if a later packet P2 is s/acked, as at least one round trip has passed. Since RACK cares about the time sequence instead of the data sequence of packets, it can detect tail drops when later retransmission is s/acked while FACK or dupthresh can't. For reordering RACK uses a dynamically adjusted reordering window ("reo_wnd") to reduce false positives on ever (small) degree of reordering. This patch implements tcp_advanced_rack() which tracks the most recent transmission time among the packets that have been delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp is the key to determine which packet has been lost. Consider an example that the sender sends six packets: T1: P1 (lost) T2: P2 T3: P3 T4: P4 T100: sack of P2. rack.mstamp = T2 T101: retransmit P1 T102: sack of P2,P3,P4. rack.mstamp = T4 T205: ACK of P4 since the hole is repaired. rack.mstamp = T101 We need to be careful about spurious retransmission because it may falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK to falsely mark all packets lost, just like a spurious timeout. We identify spurious retransmission by the ACK's TS echo value. If TS option is not applicable but the retransmission is acknowledged less than min-RTT ago, it is likely to be spurious. We refrain from using the transmission time of these spurious retransmissions. The second half is implemented in the next patch that marks packet lost using RACK timestamp. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: skb_mstamp_after helperYuchung Cheng1-0/+9
a helper to prepare the first main RACK patch. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: add tcp_tsopt_ecr_before helperYuchung Cheng1-2/+7
a helper to prepare the main RACK patch Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: remove tcp_mark_lost_retrans()Yuchung Cheng3-73/+0
Remove the existing lost retransmit detection because RACK subsumes it completely. This also stops the overloading the ack_seq field of the skb control block. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: track min RTT using windowed min-filterYuchung Cheng7-5/+100
Kathleen Nichols' algorithm for tracking the minimum RTT of a data stream over some measurement window. It uses constant space and constant time per update. Yet it almost always delivers the same minimum as an implementation that has to keep all the data in the window. The measurement window is tunable via sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes. The algorithm keeps track of the best, 2nd best & 3rd best min values, maintaining an invariant that the measurement time of the n'th best >= n-1'th best. It also makes sure that the three values are widely separated in the time window since that bounds the worse case error when that data is monotonically increasing over the window. Upon getting a new min, we can forget everything earlier because it has no value - the new min is less than everything else in the window by definition and it's the most recent. So we restart fresh on every new min and overwrites the 2nd & 3rd choices. The same property holds for the 2nd & 3rd best. Therefore we have to maintain two invariants to maximize the information in the samples, one on values (1st.v <= 2nd.v <= 3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <= now). These invariants determine the structure of the code The RTT input to the windowed filter is the minimum RTT measured from ACK or SACK, or as the last resort from TCP timestamps. The accessor tcp_min_rtt() returns the minimum RTT seen in the window. ~0U indicates it is not available. The minimum is 1usec even if the true RTT is below that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21tcp: apply Kern's check on RTTs used for congestion controlYuchung Cheng1-4/+1
Currently ca_seq_rtt_us does not use Kern's check. Fix that by checking if any packet acked is a retransmit, for both RTT used for RTT estimation and congestion control. Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21Merge branch 'smsc-energy-detect'David S. Miller4-9/+50
Heiko Schocher says: ==================== net, phy, smsc: add posibility to disable energy detect mode On some boards the energy enable detect mode leads in trouble with some switches, so make the enabling of this mode configurable through DT. Therefore the property "smsc,disable-energy-detect" is introduced. Patch 1 introduces phy-handle support for the ti,cpsw driver. This is needed now for the smsc phy. Patch 2 adds the disable energy mode functionality to the smsc phy Changes in v2: - add comments from Florian Fainelli - I did not change disable property name into enable because I fear to break existing behaviour - add smsc vendor prefix - remove CONFIG_OF and use __maybe_unused - introduce "phy-handle" ability into ti,cpsw driver, so I can remove bogus: if (!of_node && dev->parent->of_node) of_node = dev->parent->of_node; construct. Therefore new patch for the ti,cpsw driver is necessary. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21net: phy: smsc: disable energy detect modeHeiko Schocher2-5/+38
On some boards the energy enable detect mode leads in trouble with some switches, so make the enabling of this mode configurable through DT. Signed-off-by: Heiko Schocher <hs@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21drivers: net: cpsw: add phy-handle parsingHeiko Schocher2-4/+12
add the ability to parse "phy-handle". This is needed for phys, which have a DT node, and need to parse DT properties. Signed-off-by: Heiko Schocher <hs@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21bcm63xx_enet: check 1000BASE-T advertisement configurationSimon Arlott1-14/+19
If a gigabit ethernet PHY is connected to a fast ethernet MAC, then it can detect 1000 support from the partner but not use it. This results in a forced speed of 1000 and RX/TX failure. Check for 1000BASE-T support and then check the advertisement configuration before setting the MAC speed to 1000mbit. Signed-off-by: Simon Arlott <simon@fire.lp0.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21Merge branch 'master' of ↵David S. Miller16-219/+611
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2015-10-19 This series contains updates to i40e and i40evf only. Kiran adds a spinlock around code accessing VSI MAC filter list to ensure that we are synchronizing access to the filter list, otherwise we can end up with multiple accesses at the same time which can cause the VSI MAC filter list to get in an unstable or corrupted state. Jesse fixes overlong BIT defines, where the RSS enabling call were mistakenly missed. Also fixes a bug where the enable function was enabling the interrupt twice while trying to update the two interrupt throttle rate thresholds for Rx and Tx, while refactoring the IRQ enable function to simplify reading the flow. Addressed the high CPU utilization of some small streaming workloads that the driver should reduce CPU in. Anjali fixes two X722 issues with respect to EEPROM checksum verify and reading NVM version info. Fixed where a mask value was accidentally replaced with a bit mask causing Flow Director sideband to be broken. Alex Duyck fixes areas of the drivers which run from hard interrupt context or with interrupts already disabled in netpoll, so use napi_schedule_irqoff() instead of napi_schedule(). Mitch fixes the VF drivers to not easily give up when it is not able to communicate with the PF driver. Carolyn fixes a problem where our tools MAC loopback test, after driver unbind would fail because the hardware was configured for multiqueue and unbind operation did not clear this configuration. Also fixed a issue where the NVMUpdate tool gets bad data from the PHY when using the PHY NVM feature because of contention on the MDIO interface from getting PHY capability calls from the driver during regular operations. Catherine fixed an issue where we were checking if autoneg was allowed to change before checking if autoneg was changing, these checks need to be in the reverse order. Jean Sacren fixes up an function header comment to align the kernel-docs with the actual code. v2: Cleaned up the use of spin_is_locked() in patch 1 based on feedback from David Miller, since it always evaluates to zero on uni-processor builds ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21Bluetooth: Fix missing hdev locking for LE scan cleanupJohan Hedberg2-9/+44
The hci_conn objects don't have a dedicated lock themselves but rely on the caller to hold the hci_dev lock for most types of access. The hci_conn_timeout() function has so far sent certain HCI commands based on the hci_conn state which has been possible without holding the hci_dev lock. The recent changes to do LE scanning before connect attempts added even more operations to hci_conn and hci_dev from hci_conn_timeout, thereby exposing potential race conditions with the hci_dev and hci_conn states. As an example of such a race, here there's a timeout but an l2cap_sock_connect() call manages to race with the cleanup routine: [Oct21 08:14] l2cap_chan_timeout: chan ee4b12c0 state BT_CONNECT [ +0.000004] l2cap_chan_close: chan ee4b12c0 state BT_CONNECT [ +0.000002] l2cap_chan_del: chan ee4b12c0, conn f3141580, err 111, state BT_CONNECT [ +0.000002] l2cap_sock_teardown_cb: chan ee4b12c0 state BT_CONNECT [ +0.000005] l2cap_chan_put: chan ee4b12c0 orig refcnt 4 [ +0.000010] hci_conn_drop: hcon f53d56e0 orig refcnt 1 [ +0.000013] l2cap_chan_put: chan ee4b12c0 orig refcnt 3 [ +0.000063] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT [ +0.000049] hci_conn_params_del: addr ee:0d:30:09:53:1f (type 1) [ +0.000002] hci_chan_list_flush: hcon f53d56e0 [ +0.000001] hci_chan_del: hci0 hcon f53d56e0 chan f4e7ccc0 [ +0.004528] l2cap_sock_create: sock e708fc00 [ +0.000023] l2cap_chan_create: chan ee4b1770 [ +0.000001] l2cap_chan_hold: chan ee4b1770 orig refcnt 1 [ +0.000002] l2cap_sock_init: sk ee4b3390 [ +0.000029] l2cap_sock_bind: sk ee4b3390 [ +0.000010] l2cap_sock_setsockopt: sk ee4b3390 [ +0.000037] l2cap_sock_connect: sk ee4b3390 [ +0.000002] l2cap_chan_connect: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f (type 2) psm 0x00 [ +0.000002] hci_get_route: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f [ +0.000001] hci_dev_hold: hci0 orig refcnt 8 [ +0.000003] hci_conn_hold: hcon f53d56e0 orig refcnt 0 Above the l2cap_chan_connect() shouldn't have been able to reach the hci_conn f53d56e0 anymore but since hci_conn_timeout didn't do proper locking that's not the case. The end result is a reference to hci_conn that's not in the conn_hash list, resulting in list corruption when trying to remove it later: [Oct21 08:15] l2cap_chan_timeout: chan ee4b1770 state BT_CONNECT [ +0.000004] l2cap_chan_close: chan ee4b1770 state BT_CONNECT [ +0.000003] l2cap_chan_del: chan ee4b1770, conn f3141580, err 111, state BT_CONNECT [ +0.000001] l2cap_sock_teardown_cb: chan ee4b1770 state BT_CONNECT [ +0.000005] l2cap_chan_put: chan ee4b1770 orig refcnt 4 [ +0.000002] hci_conn_drop: hcon f53d56e0 orig refcnt 1 [ +0.000015] l2cap_chan_put: chan ee4b1770 orig refcnt 3 [ +0.000038] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT [ +0.000003] hci_chan_list_flush: hcon f53d56e0 [ +0.000002] hci_conn_hash_del: hci0 hcon f53d56e0 [ +0.000001] ------------[ cut here ]------------ [ +0.000461] WARNING: CPU: 0 PID: 1782 at lib/list_debug.c:56 __list_del_entry+0x3f/0x71() [ +0.000839] list_del corruption, f53d56e0->prev is LIST_POISON2 (00000200) The necessary fix is unfortunately more complicated than just adding hci_dev_lock/unlock calls to the hci_conn_timeout() call path. Particularly, the hci_conn_del() API, which expects the hci_dev lock to be held, performs a cancel_delayed_work_sync(&hcon->disc_work) which would lead to a deadlock if the hci_conn_timeout() call path tries to acquire the same lock. This patch solves the problem by deferring the cleanup work to a separate work callback. To protect against the hci_dev or hci_conn going away meanwhile temporary references are taken with the help of hci_dev_hold() and hci_conn_get(). Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org # 4.3
2015-10-21mac80211: move station statistics into sub-structsJohannes Berg11-185/+178
Group station statistics by where they're (mostly) updated (TX, RX and TX-status) and group them into sub-structs of the struct sta_info. Also rename the variables since the grouping now makes it obvious where they belong. This makes it easier to identify where the statistics are updated in the code, and thus easier to think about them. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21mac80211: move beacon_loss_count into ifmgdJohannes Berg5-14/+12
There's little point in keeping (and even sending to userspace) the beacon_loss_count value per station, since it can only apply to the AP on a managed-mode connection. Move the value to ifmgd, advertise it only in managed mode, and remove it from ethtool as it's available through better interfaces. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21mac80211: remove sta->last_ack_signalJohannes Berg3-7/+0
This file only feeds a debugfs file that isn't very useful, so remove it. If necessary, we can add other ways to get this information, for example in the NL80211_CMD_PROBE_CLIENT response. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-21Merge ath-next from ath.gitKalle Valo39-548/+1855
Major changes: ath10k * add board 2 API support for automatically choosing correct board file * data path optimisations * disable PCI power save for qca988x and QCA99x0 due to interop reasons wil6210 * BlockAckReq support * firmware crashdump using devcoredump * capture all frames with sniffer
2015-10-21brcm80211: Add support for brcm4371Eric Caruso3-0/+12
This is a new Broadcom chip and we should be able to recognize it. Signed-off-by: Eric Caruso <ejcaruso@google.com> Acked-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2015-10-21brcmfmac: Properly set carrier state of netdev.Hante Meuleman3-6/+31
Use the netif_carrier api to correctly set carrier state on the different modes. Reviewed-by: Arend Van Spriel <arend@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com> Reviewed-by: Franky (Zhenhui) Lin <frankyl@broadcom.com> Signed-off-by: Hante Meuleman <meuleman@broadcom.com> Signed-off-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2015-10-21brcmfmac: Remove unused state AP creating.Hante Meuleman2-5/+0
Reviewed-by: Arend Van Spriel <arend@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com> Signed-off-by: Hante Meuleman <meuleman@broadcom.com> Signed-off-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2015-10-21brcmfmac: Move brcmf_c_preinit_dcmds prototype to correct file.Hante Meuleman3-3/+4
Reviewed-by: Arend Van Spriel <arend@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com> Signed-off-by: Hante Meuleman <meuleman@broadcom.com> Signed-off-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>