summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2014-05-19netfilter: nf_tables: use new transaction infrastructure to handle setsPablo Neira Ayuso2-18/+115
This patch reworks the nf_tables API so set updates are included in the same batch that contains rule updates. This speeds up rule-set updates since we skip a dialog of four messages between kernel and user-space (two on each direction), from: 1) create the set and send netlink message to the kernel 2) process the response from the kernel that contains the allocated name. 3) add the set elements and send netlink message to the kernel. 4) process the response from the kernel (to check for errors). To: 1) add the set to the batch. 2) add the set elements to the batch. 3) add the rule that points to the set. 4) send batch to the kernel. This also introduces an internal set ID (NFTA_SET_ID) that is unique in the batch so set elements and rules can refer to new sets. Backward compatibility has been only retained in userspace, this means that new nft versions can talk to the kernel both in the new and the old fashion. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: add message type to transactionsPablo Neira Ayuso1-31/+43
The patch adds message type to the transaction to simplify the commit the and abort routines. Yet another step forward in the generalisation of the transaction infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: relocate commit and abort routines in the source filePablo Neira Ayuso1-80/+80
Move the commit and abort routines to the bottom of the source code file. This change is required by the follow up patches that add the set, chain and table transaction support. This patch is just a cleanup to access several functions without having to declare their prototypes. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: generalise transaction infrastructurePablo Neira Ayuso1-54/+69
This patch generalises the existing rule transaction infrastructure so it can be used to handle set, table and chain object transactions as well. The transaction provides a data area that stores private information depending on the transaction type. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19netfilter: nf_tables: deconstify table and chain in context structurePablo Neira Ayuso1-29/+29
The new transaction infrastructure updates the family, table and chain objects in the context structure, so let's deconstify them. While at it, move the context structure initialization routine to the top of the source file as it will be also used from the table and chain routines. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-05-19can: add hash based access to single EFF frame filtersOliver Hartkopp3-9/+75
In contrast to the direct access to the single SFF frame filters (which are indexed by the SFF CAN ID itself) the single EFF frame filters are arranged in a single linked hlist. To reduce the hlist traversal in the case of many filter subscriptions a hash based access is introduced for single EFF filters. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2014-05-19can: proc: make array printing function indenpendent from sff framesOliver Hartkopp2-13/+19
The can_rcvlist_sff_proc_show_one() function which prints the array of filters for the single SFF CAN identifiers is prepared to be used by a second caller. Therefore it is also renamed to properly describe its future functionality. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2014-05-19Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller5-5/+26
Included changes: - fix codestyle to respect new checkpatch warnings - increase internal version number
2014-05-19net: rds: Use time_after() for time comparisonManuel Schölling2-4/+4
To be future-proof and for better readability the time comparisons are modified to use time_after() instead of raw math. Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-19ipv4: minor spelling fixstephen hemminger1-1/+1
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-19bridge: fix spelling of promiscuousstephen hemminger1-1/+1
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-19ethtool: Disallow ETHTOOL_SRSSH with both indir table and hash key unchangedBen Hutchings1-1/+4
This would be a no-op, so there is no reason to request it. This also allows conversion of the current implementations of ethtool_ops::{get,set}_rxfh_indir to ethtool_ops::{get,set}_rxfh with no change other than their parameters. Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-05-19ethtool: Name the 'no change' value for setting RSS hash key but not indir tableBen Hutchings1-5/+7
We usually allocate special values of u32 fields starting from the top down, so also change the value to 0xffffffff. As these operations haven't been included in a stable release yet, it's not too late to change. Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-05-19ethtool: Return immediately on error in ethtool_copy_validate_indir()Ben Hutchings1-9/+7
We must return -EFAULT immediately rather than continuing into the loop. Similarly, we may as well return -EINVAL directly. Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-05-19net: bridge: fix buildAlexei Starovoitov1-1/+1
fix build when BRIDGE_VLAN_FILTERING is not set Fixes: 2796d0c648c94 ("bridge: Automatically manage port promiscuous mode") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-18SUNRPC: Fix a module reference issue in rpcsec_gssTrond Myklebust1-3/+1
We're not taking a reference in the case where _gss_mech_get_by_pseudoflavor loops without finding the correct rpcsec_gss flavour, so why are we releasing it? Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-05-18batman-adv: Start new development cycleSimon Wunderlich1-1/+1
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2014-05-18batman-adv: remove semi-colon after macro definitionAntonio Quartulli2-4/+4
Reported by checkpatch with the following warning: "WARNING: macros should not use a trailing semicolon" Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-05-18batman-adv: add blank line between declarations and the rest of the codeAntonio Quartulli4-0/+21
Reported by checkpatch with the following message: "WARNING: Missing a blank line after declarations" Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-05-17bonding: Fix stacked device detection in arp monitoringVlad Yasevich1-0/+26
Prior to commit fbd929f2dce460456807a51e18d623db3db9f077 bonding: support QinQ for bond arp interval the arp monitoring code allowed for proper detection of devices stacked on top of vlans. Since the above commit, the code can still detect a device stacked on top of single vlan, but not a device stacked on top of Q-in-Q configuration. The search will only set the inner vlan tag if the route device is the vlan device. However, this is not always the case, as it is possible to extend the stacked configuration. With this patch it is possible to provision devices on top Q-in-Q vlan configuration that should be used as a source of ARP monitoring information. For example: ip link add link bond0 vlan10 type vlan proto 802.1q id 10 ip link add link vlan10 vlan100 type vlan proto 802.1q id 100 ip link add link vlan100 type macvlan Note: This patch limites the number of stacked VLANs to 2, just like before. The original, however had another issue in that if we had more then 2 levels of VLANs, we would end up generating incorrectly tagged traffic. This is no longer possible. Fixes: fbd929f2dce460456807a51e18d623db3db9f077 (bonding: support QinQ for bond arp interval) CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@redhat.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: Ding Tianhong <dingtianhong@huawei.com> CC: Patric McHardy <kaber@trash.net> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17vlan: Fix lockdep warning with stacked vlan devices.Vlad Yasevich3-44/+10
This reverts commit dc8eaaa006350d24030502a4521542e74b5cb39f. vlan: Fix lockdep warning when vlan dev handle notification Instead we use the new new API to find the lock subclass of our vlan device. This way we can support configurations where vlans are interspersed with other devices: bond -> vlan -> macvlan -> vlan Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17net: Find the nesting level of a given device by type.Vlad Yasevich1-0/+50
Multiple devices in the kernel can be stacked/nested and they need to know their nesting level for the purposes of lockdep. This patch provides a generic function that determines a nesting level of a particular device by its type (ex: vlan, macvlan, etc). We only care about nesting of the same type of devices. For example: eth0 <- vlan0.10 <- macvlan0 <- vlan1.20 The nesting level of vlan1.20 would be 1, since there is another vlan in the stack under it. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17pktgen: Use seq_puts() where seq_printf() is not neededThomas Graf1-25/+25
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17net: gro: make sure skb->cb[] initial content has not to be zeroEric Dumazet2-2/+3
Starting from linux-3.13, GRO attempts to build full size skbs. Problem is the commit assumed one particular field in skb->cb[] was clean, but it is not the case on some stacked devices. Timo reported a crash in case traffic is decrypted before reaching a GRE device. Fix this by initializing NAPI_GRO_CB(skb)->last at the right place, this also removes one conditional. Thanks a lot to Timo for providing full reports and bisecting this. Fixes: 8a29111c7ca6 ("net: gro: allow to build full sized skb") Bisected-by: Timo Teras <timo.teras@iki.fi> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17ieee802154, mac802154: implement devkey record optionPhoebe Buckheister1-0/+38
The 802.15.4-2011 standard states that for each key, a list of devices that use this key shall be kept. Previous patches have only considered two options: * a device "uses" (or may use) all keys, rendering the list useless * a device is restricted to a certain set of keys Another option would be that a device *may* use all keys, but need not do so, and we are interested in the actual set of keys the device uses. Recording keys used by any given device may have a noticable performance impact and might not be needed as often. The common case, in which a device will not switch keys too often, should still perform well. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17ieee802154: add netlink interfaces for llsecPhoebe Buckheister4-0/+862
This patch adds user-visible interfaces for the llsec infrastructure. For the added methods, the only major difference between all add/remove implementation lies in how the specific object is parsed, and for dump requests, how objects are written into netlink messages. To save on boilerplate code, table dumps are routed through a helper function that handles netlink dump state, leaving the actual dumping code to care only about iterating over the table to be dumped and filling netlink messages. For add/remove methods, the boilerplate required to work is not quite as large, but still enough to also move into a local helper. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17mac802154: propagate device address changes to llsecPhoebe Buckheister2-3/+47
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17mac802154: add llsec configuration functionsPhoebe Buckheister3-0/+238
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17ieee802154: add dgram sockopts for security controlPhoebe Buckheister1-0/+66
Allow datagram sockets to override the security settings of the device they send from on a per-socket basis. Requires CAP_NET_ADMIN or CAP_NET_RAW, since raw sockets can send arbitrary packets anyway. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17mac802154: integrate llsec with wpan devicesPhoebe Buckheister2-28/+100
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17mac802154: add llsec decryption methodPhoebe Buckheister2-0/+248
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17mac802154: add llsec encryption methodPhoebe Buckheister2-0/+255
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17mac802154: add llsec structures and mutatorsPhoebe Buckheister4-1/+637
This patch adds containers and mutators for the major ieee802154_llsec structures to mac802154. Most of the (rather simple) ieee802154_llsec structs are wrapped only to provide an rcu_head for orderly disposal, but some structs - llsec keys notably - require more complex bookkeeping. Since each llsec key may be referenced by a number of llsec key table entries (with differing key ids, but the same actual key), we want to save memory and not allocate crypto transforms for each entry in the table. Thus, the mac802154 llsec key is reference-counted instead. Further, each key will have four associated crypto transforms - three CCM transforms for the authsizes 4/8/16 and one CTR transform for unauthenticated encryption. If we had a CCM* transform that allowed authsize 0, and authsize as part of requests instead of transforms, this would not be necessary. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17mac802154: update KconfigPhoebe Buckheister1-0/+4
Link-layer security requires AES CCM for authenticated modes and AES CTR for the unauthenticated encryption mode. Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17Merge branch 'master' of ↵David S. Miller11-184/+176
git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch Jesse Gross says: ==================== A set of OVS changes for net-next/3.16. The major change here is a switch from per-CPU to per-NUMA flow statistics. This improves scalability by reducing kernel overhead in flow setup and maintenance. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17bridge: Automatically manage port promiscuous mode.Vlad Yasevich4-7/+116
There exist configurations where the administrator or another management entity has the foreknowledge of all the mac addresses of end systems that are being bridged together. In these environments, the administrator can statically configure known addresses in the bridge FDB and disable flooding and learning on ports. This makes it possible to turn off promiscuous mode on the interfaces connected to the bridge. Here is why disabling flooding and learning allows us to control promiscuity: Consider port X. All traffic coming into this port from outside the bridge (ingress) will be either forwarded through other ports of the bridge (egress) or dropped. Forwarding (egress) is defined by FDB entries and by flooding in the event that no FDB entry exists. In the event that flooding is disabled, only FDB entries define the egress. Once learning is disabled, only static FDB entries provided by a management entity define the egress. If we provide information from these static FDBs to the ingress port X, then we'll be able to accept all traffic that can be successfully forwarded and drop all the other traffic sooner without spending CPU cycles to process it. Another way to define the above is as following equations: ingress = egress + drop expanding egress ingress = static FDB + learned FDB + flooding + drop disabling flooding and learning we a left with ingress = static FDB + drop By adding addresses from the static FDB entries to the MAC address filter of an ingress port X, we fully define what the bridge can process without dropping and can thus turn off promiscuous mode, thus dropping packets sooner. There have been suggestions that we may want to allow learning and update the filters with learned addresses as well. This would require mac-level authentication similar to 802.1x to prevent attacks against the hw filters as they are limited resource. Additionally, if the user places the bridge device in promiscuous mode, all ports are placed in promiscuous mode regardless of the changes to flooding and learning. Since the above functionality depends on full static configuration, we have also require that vlan filtering be enabled to take advantage of this. The reason is that the bridge has to be able to receive and process VLAN-tagged frames and the there are only 2 ways to accomplish this right now: promiscuous mode or vlan filtering. Suggested-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17bridge: Add addresses from static fdbs to non-promisc portsVlad Yasevich1-6/+69
When a static fdb entry is created, add the mac address from this fdb entry to any ports that are currently running in non-promiscuous mode. These ports need this data so that they can receive traffic destined to these addresses. By default ports start in promiscuous mode, so this feature is disabled. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17bridge: Introduce BR_PROMISC flagVlad Yasevich2-1/+3
Introduce a BR_PROMISC per-port flag that will help us track if the current port is supposed to be in promiscuous mode or not. For now, always start in promiscuous mode. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17bridge: Add functionality to sync static fdb entries to hwVlad Yasevich2-0/+58
Add code that allows static fdb entires to be synced to the hw list for a specified port. This will be used later to program ports that can function in non-promiscuous mode. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17bridge: Keep track of ports capable of automatic discovery.Vlad Yasevich4-1/+36
By default, ports on the bridge are capable of automatic discovery of nodes located behind the port. This is accomplished via flooding of unknown traffic (BR_FLOOD) and learning the mac addresses from these packets (BR_LEARNING). If the above functionality is disabled by turning off these flags, the port requires static configuration in the form of static FDB entries to function properly. This patch adds functionality to keep track of all ports capable of automatic discovery. This will later be used to control promiscuity settings. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17bridge: Turn flag change macro into a function.Vlad Yasevich1-10/+17
Turn the flag change macro into a function to allow easier updates and to reduce space. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17ipv4: ip_tunnels: disable cache for nbma gre tunnelsTimo Teräs1-1/+2
The connected check fails to check for ip_gre nbma mode tunnels properly. ip_gre creates temporary tnl_params with daddr specified to pass-in the actual target on per-packet basis from neighbor layer. Detect these tunnels by inspecting the actual tunnel configuration. Minimal test case: ip route add 192.168.1.1/32 via 10.0.0.1 ip route add 192.168.1.2/32 via 10.0.0.2 ip tunnel add nbma0 mode gre key 1 tos c0 ip addr add 172.17.0.0/16 dev nbma0 ip link set nbma0 up ip neigh add 172.17.0.1 lladdr 192.168.1.1 dev nbma0 ip neigh add 172.17.0.2 lladdr 192.168.1.2 dev nbma0 ping 172.17.0.1 ping 172.17.0.2 The second ping should be going to 192.168.1.2 and head 10.0.0.2; but cached gre tunnel level route is used and it's actually going to 192.168.1.1 via 10.0.0.1. The lladdr's need to go to separate dst for the bug to trigger. Test case uses separate route entries, but this can also happen when the route entry is same: if there is a nexthop exception or the GRE tunnel is IPsec'ed in which case the dst points to xfrm bundle unique to the gre lladdr. Fixes: 7d442fab0a67 ("ipv4: Cache dst in tunnels") Signed-off-by: Timo Teräs <timo.teras@iki.fi> Cc: Tom Herbert <therbert@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17ip_tunnel: don't add tunnel twiceDuan Jiong1-4/+2
When using command "ip tunnel add" to add a tunnel, the tunnel will be added twice, through ip_tunnel_create() and ip_tunnel_update(). Because the second is unnecessary, so we can just break after adding tunnel through ip_tunnel_create(). Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17net/dsa/dsa.c: increment chip_index during of_node handling on dsa_of_probe()Fabian Godehardt1-1/+2
Adding more than one chip on device-tree currently causes the probing routine to always use the first chips data pointer. Signed-off-by: Fabian Godehardt <fg@emlix.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17net: ipv6: make "ip -6 route get mark xyz" work.Lorenzo Colitti1-0/+3
Currently, "ip -6 route get mark xyz" ignores the mark passed in by userspace. Make it honour the mark, just like IPv4 does. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-17net/openvswitch: Use with RCU_INIT_POINTER(x, NULL) in vport-gre.cMonam Agarwal1-1/+1
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-17openvswitch: Use TCP flags in the flow key for stats.Jarno Rajahalme1-7/+5
We already extract the TCP flags for the key, might as well use that for stats. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-17openvswitch: Fix output of SCTP mask.Jarno Rajahalme1-4/+4
The 'output' argument of the ovs_nla_put_flow() is the one from which the bits are written to the netlink attributes. For SCTP we accidentally used the bits from the 'swkey' instead. This caused the mask attributes to include the bits from the actual flow key instead of the mask. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-17openvswitch: Per NUMA node flow stats.Jarno Rajahalme4-55/+122
Keep kernel flow stats for each NUMA node rather than each (logical) CPU. This avoids using the per-CPU allocator and removes most of the kernel-side OVS locking overhead otherwise on the top of perf reports and allows OVS to scale better with higher number of threads. With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup rate doubles on a server with two hyper-threaded physical CPUs (16 logical cores each) compared to the current OVS master. Tested with non-trivial flow table with a TCP port match rule forcing all new connections with unique port numbers to OVS userspace. The IP addresses are still wildcarded, so the kernel flows are not considered as exact match 5-tuple flows. This type of flows can be expected to appear in large numbers as the result of more effective wildcarding made possible by improvements in OVS userspace flow classifier. Perf results for this test (master): Events: 305K cycles + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area + 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range + 2.03% swapper [kernel.kallsyms] [k] intel_idle + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6 + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock ... And after this patch: Events: 356K cycles + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock + 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner + 1.47% swapper [kernel.kallsyms] [k] intel_idle + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock ... There is a small increase in kernel spinlock overhead due to the same spinlock being shared between multiple cores of the same physical CPU, but that is barely visible in the netperf TCP_CRR test performance (maybe ~1% performance drop, hard to tell exactly due to variance in the test results), when testing for kernel module throughput (with no userspace activity, handful of kernel flows). On flow setup, a single stats instance is allocated (for the NUMA node 0). As CPUs from multiple NUMA nodes start updating stats, new NUMA-node specific stats instances are allocated. This allocation on the packet processing code path is made to never block or look for emergency memory pools, minimizing the allocation latency. If the allocation fails, the existing preallocated stats instance is used. Also, if only CPUs from one NUMA-node are updating the preallocated stats instance, no additional stats instances are allocated. This eliminates the need to pre-allocate stats instances that will not be used, also relieving the stats reader from the burden of reading stats that are never used. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2014-05-17openvswitch: Remove 5-tuple optimization.Jarno Rajahalme7-113/+32
The 5-tuple optimization becomes unnecessary with a later per-NUMA node stats patch. Remove it first to make the changes easier to grasp. Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>