summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-02-04lib: Introduce priority array area managerJiri Pirko7-0/+871
This introduces a infrastructure for management of linear priority areas. Priority order in an array matters, however order of items inside a priority group does not matter. As an initial implementation, L-sort algorithm is used. It is quite trivial. More advanced algorithm called P-sort will be introduced as a follow-up. The infrastructure is prepared for other algos. Alongside this, a testing module is introduced as well. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04list: introduce list_for_each_entry_from_reverse helperJiri Pirko1-0/+13
Similar to list_for_each_entry_continue and its reverse variant list_for_each_entry_continue_reverse, introduce reverse helper for list_for_each_entry_from. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: resources: Add ACL related resourcesJiri Pirko1-2/+18
Add couple of resource limits related to ACL. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: spectrum: Introduce basic set of flexible key blocksJiri Pirko1-0/+109
Introduce basic set of Spectrum flexible key blocks. It contains blocks needed to carry all elements defined so far. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: core: Introduce flexible actions supportJiri Pirko3-1/+753
Each entry which is matched during ACL lookup points to an action set. This action set contains up to three separate actions. If more actions are needed to be chained, the extended set is created to hold them in KVD linear area. This patch implements handling of sets and encoding of actions. Currectly, only two actions are supported. Drop and forward. Forward action uses PBS pointer to KVD linear area, so the action code needs to take care of this as well. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: core: Introduce flexible keys supportJiri Pirko3-1/+714
Hardware supports matching on so called "flexible keys". The idea is to assemble an optimal key to use for matching according to the fields in packet (elements) requested by user. Certain sets of elements are combined into pre-defined blocks. There is a picker to find needed blocks. Keys consist of 1..n blocks. Alongside with that, an initial portion of elements is introduced in order to be able to offload basic cls_flower rules. Picked keys are cached so multiple rules could share them. There is an encode function provided that takes care of encoding key and mask values according to given key. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: reg: Add Policy-Engine Extended Flexible Action RegisterJiri Pirko1-3/+36
PEFA register is used for accessing an extended flexible action entry in the central KVD Linear Database. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: reg: Add Policy-Engine Policy Based Switching RegisterJiri Pirko1-0/+31
The PPBS register retrieves and sets Policy Based Switching Table entries. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: reg: Add Policy-Engine Rules Copy RegisterJiri Pirko1-0/+77
The PRCR register is used for accessing rules within a TCAM region. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: reg: Add Policy-Engine Port Binding TableJiri Pirko1-0/+63
The PPBT is used for configuration of the Port Binding Table. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: reg: Add Policy-Engine TCAM Entry Register Version 2Jiri Pirko1-0/+100
The PTCE-V2 register is used for accessing rules within a TCAM region. It is a new version of PTCE in order to support wider key, mask and action within a TCAM region. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: reg: Add Policy-Engine TCAM Allocation RegisterJiri Pirko1-0/+106
The PTAR register is used for allocation of regions in the TCAM. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: reg: Add Policy-Engine ACL Group Table registerJiri Pirko1-0/+54
The PAGT register is used for configuration of the ACL Group Table. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: reg: Add Policy-Engine ACL RegisterJiri Pirko1-2/+45
The PACL register is used for configuration of the ACL. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: item: Add helpers for getting pointer into payload for char buffer itemJiri Pirko1-0/+19
Sometimes it is handy to get a pointer to a char buffer item and use it direcly to write/read data. So add these helpers. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04mlxsw: item: Add 8bit item helpersJiri Pirko1-2/+77
Item heplers for 8bit values are needed, let's add them. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04bonding: Remove unnecessary returned value checkZhu Yanjun1-6/+4
The function bond_info_query alwarys returns 0. As such, in the function bond_do_ioctl, it is not necessary to check the returned value. So the interface type of the function bond_info_query is changed to void. The redundant check is removed. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04tcp: clear pfmemalloc on outgoing skbEric Dumazet1-0/+7
Josef Bacik diagnosed following problem : I was seeing random disconnects while testing NBD over loopback. This turned out to be because NBD sets pfmemalloc on it's socket, however the receiving side is a user space application so does not have pfmemalloc set on its socket. This means that sk_filter_trim_cap will simply drop this packet, under the assumption that the other side will simply retransmit. Well we do retransmit, and then the packet is just dropped again for the same reason. It seems the better way to address this problem is to clear pfmemalloc in the TCP transmit path. pfmemalloc strict control really makes sense on the receive path. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04Merge tag 'usb-serial-4.10-rc7' of ↵Greg Kroah-Hartman2-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus Johan writes: USB-serial fixes for v4.10-rc7 One more device ID for pl2303. Signed-off-by: Johan Hovold <johan@kernel.org>
2017-02-04cxgb4: get rid of custom busy poll codeEric Dumazet4-174/+6
In linux-4.5, busy polling was implemented in core NAPI stack, meaning that all custom implementation can be removed from drivers. Not only we remove lot of code, we also remove one spin_lock() from driver fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04myri10ge: get rid of custom busy poll codeEric Dumazet1-193/+4
Compared to custom busy_poll, the generic NAPI one is simpler and removes a lot of code. It removes one atomic in the fast path (when busy poll is not in action) since we do not have to use an extra spinlock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: use a work queue to defer net_disable_timestamp() workEric Dumazet1-18/+13
Dmitry reported a warning [1] showing that we were calling net_disable_timestamp() -> static_key_slow_dec() from a non process context. Grabbing a mutex while holding a spinlock or rcu_read_lock() is not allowed. As Cong suggested, we now use a work queue. It is possible netstamp_clear() exits while netstamp_needed_deferred is not zero, but it is probably not worth trying to do better than that. netstamp_needed_deferred atomic tracks the exact number of deferred decrements. [1] [ INFO: suspicious RCU usage. ] 4.10.0-rc5+ #192 Not tainted ------------------------------- ./include/linux/rcupdate.h:561 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 2 locks held by syz-executor14/23111: #0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>] lock_sock include/net/sock.h:1454 [inline] #0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>] rawv6_sendmsg+0x1e65/0x3ec0 net/ipv6/raw.c:919 #1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>] nf_hook include/linux/netfilter.h:201 [inline] #1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>] __ip6_local_out+0x258/0x840 net/ipv6/output_core.c:160 stack backtrace: CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:15 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51 lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4452 rcu_preempt_sleep_check include/linux/rcupdate.h:560 [inline] ___might_sleep+0x560/0x650 kernel/sched/core.c:7748 __might_sleep+0x95/0x1a0 kernel/sched/core.c:7739 mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752 atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060 __static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149 static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174 net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728 sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403 __sk_destruct+0x27d/0x6b0 net/core/sock.c:1441 sk_destruct+0x47/0x80 net/core/sock.c:1460 __sk_free+0x57/0x230 net/core/sock.c:1468 sock_wfree+0xae/0x120 net/core/sock.c:1645 skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655 skb_release_all+0x15/0x60 net/core/skbuff.c:668 __kfree_skb+0x15/0x20 net/core/skbuff.c:684 kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705 inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304 inet_frag_put include/net/inet_frag.h:133 [inline] nf_ct_frag6_gather+0x1106/0x3840 net/ipv6/netfilter/nf_conntrack_reasm.c:617 ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68 nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline] nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310 nf_hook include/linux/netfilter.h:212 [inline] __ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160 ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170 ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722 ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742 rawv6_push_pending_frames net/ipv6/raw.c:613 [inline] rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744 sock_sendmsg_nosec net/socket.c:635 [inline] sock_sendmsg+0xca/0x110 net/socket.c:645 sock_write_iter+0x326/0x600 net/socket.c:848 do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695 do_readv_writev+0x42c/0x9b0 fs/read_write.c:872 vfs_writev+0x87/0xc0 fs/read_write.c:911 do_writev+0x110/0x2c0 fs/read_write.c:944 SYSC_writev fs/read_write.c:1017 [inline] SyS_writev+0x27/0x30 fs/read_write.c:1014 entry_SYSCALL_64_fastpath+0x1f/0xc2 RIP: 0033:0x445559 RSP: 002b:00007f6f46fceb58 EFLAGS: 00000292 ORIG_RAX: 0000000000000014 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000445559 RDX: 0000000000000001 RSI: 0000000020f1eff0 RDI: 0000000000000005 RBP: 00000000006e19c0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000700000 R13: 0000000020f59000 R14: 0000000000000015 R15: 0000000000020400 BUG: sleeping function called from invalid context at kernel/locking/mutex.c:752 in_atomic(): 1, irqs_disabled(): 0, pid: 23111, name: syz-executor14 INFO: lockdep is turned off. CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:15 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51 ___might_sleep+0x47e/0x650 kernel/sched/core.c:7780 __might_sleep+0x95/0x1a0 kernel/sched/core.c:7739 mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752 atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060 __static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149 static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174 net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728 sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403 __sk_destruct+0x27d/0x6b0 net/core/sock.c:1441 sk_destruct+0x47/0x80 net/core/sock.c:1460 __sk_free+0x57/0x230 net/core/sock.c:1468 sock_wfree+0xae/0x120 net/core/sock.c:1645 skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655 skb_release_all+0x15/0x60 net/core/skbuff.c:668 __kfree_skb+0x15/0x20 net/core/skbuff.c:684 kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705 inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304 inet_frag_put include/net/inet_frag.h:133 [inline] nf_ct_frag6_gather+0x1106/0x3840 net/ipv6/netfilter/nf_conntrack_reasm.c:617 ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68 nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline] nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310 nf_hook include/linux/netfilter.h:212 [inline] __ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160 ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170 ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722 ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742 rawv6_push_pending_frames net/ipv6/raw.c:613 [inline] rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744 sock_sendmsg_nosec net/socket.c:635 [inline] sock_sendmsg+0xca/0x110 net/socket.c:645 sock_write_iter+0x326/0x600 net/socket.c:848 do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695 do_readv_writev+0x42c/0x9b0 fs/read_write.c:872 vfs_writev+0x87/0xc0 fs/read_write.c:911 do_writev+0x110/0x2c0 fs/read_write.c:944 SYSC_writev fs/read_write.c:1017 [inline] SyS_writev+0x27/0x30 fs/read_write.c:1014 entry_SYSCALL_64_fastpath+0x1f/0xc2 RIP: 0033:0x445559 Fixes: b90e5794c5bd ("net: dont call jump_label_dec from irq context") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04be2net: get rid of custom busy poll codeEric Dumazet2-147/+9
Compared to custom busy_poll, the generic NAPI one is better, since it allows to use GRO, and it removes a lot of code and extra locked operations in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sathya Perla <sathya.perla@broadcom.com> Cc: Ajit Khaparde <ajit.khaparde@broadcom.com> Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-04net: ipv6: Set protocol to kernel for local routesDavid Ahern1-0/+1
IPv6 stack does not set the protocol for local routes, so those routes show up with proto "none": $ ip -6 ro ls table local local ::1 dev lo proto none metric 0 pref medium local 2100:3:: dev lo proto none metric 0 pref medium local 2100:3::4 dev lo proto none metric 0 pref medium local fe80:: dev lo proto none metric 0 pref medium ... Set rt6i_protocol to RTPROT_KERNEL for consistency with IPv4. Now routes show up with proto "kernel": $ ip -6 ro ls table local local ::1 dev lo proto kernel metric 0 pref medium local 2100:3:: dev lo proto kernel metric 0 pref medium local 2100:3::4 dev lo proto kernel metric 0 pref medium local fe80:: dev lo proto kernel metric 0 pref medium ... Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03trace: rename trace_print_hex_seq arg and add kdocDaniel Borkmann3-5/+16
Steven suggested to improve trace_print_hex_seq() a bit after commit 2acae0d5b0f7 ("trace: add variant without spacing in trace_print_hex_seq") in two ways: i) by adding a kdoc comment for the helper function itself and ii) by renaming 'spacing' argument into 'concatenate' to better denote that we don't add spaces between each hex bytes. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03MAINTAINERS: add Ivan as a switchdev maintainerJiri Pirko1-0/+1
Ivan will be taking care of switchdev code from now on. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03Merge branch 'bridge-per-vlan-dst_metadata-support'David S. Miller15-121/+863
Roopa Prabhu says: ==================== bridge: per vlan dst_metadata support High level summary: lwt and dst_metadata have enabled vxlan l3 deployments to use a single vxlan netdev for multiple vnis eliminating the scalability problem with using a single vxlan netdev per vni. This series tries to do the same for vxlan netdevs in pure l2 bridged networks. Use-case/deployment and details are below. Deployment scerario details: As we know VXLAN is used to build layer 2 virtual networks across the underlay layer3 infrastructure. A VXLAN tunnel endpoint (VTEP) originates and terminates VXLAN tunnels. And a VTEP can be a TOR switch or a vswitch in the hypervisor. This patch series mainly focuses on the TOR switch configured as a Vtep. Vxlan segment ID (vni) along with vlan id is used to identify layer 2 segments in a vxlan overlay network. Vxlan bridging is the function provided by Vteps to terminate vxlan tunnels and map the vxlan vni to traditional end host vlan. This is covered in the "VXLAN Deployment Scenarios" in sections 6 and 6.1 in RFC 7348. To provide vxlan bridging function, a vtep has to map vlan to a vni. The rfc says that the ingress VTEP device shall remove the IEEE 802.1Q VLAN tag in the original Layer 2 packet if there is one before encapsulating the packet into the VXLAN format to transmit it through the underlay network. The remote VTEP devices have information about the VLAN in which the packet will be placed based on their own VLAN-to-VXLAN VNI mapping configurations. Existing solution: Without this patch series one can deploy such a vtep configuration by adding the local ports and vxlan netdevs into a vlan filtering bridge. The local ports are configured as trunk ports carrying all vlans. A vxlan netdev per vni is added to the bridge. Vlan mapping to vni is achieved by configuring the vlan as pvid on the corresponding vxlan netdev. The vxlan netdev only receives traffic corresponding to the vlan it is mapped to. This configuration maps traffic belonging to a vlan to the corresponding vxlan segment. ----------------------------------- | bridge | | | ----------------------------------- |100,200 |100 (pvid) |200 (pvid) | | | swp1 vxlan1000 vxlan2000 This provides the required vxlan bridging function but poses a scalability problem with using a separate vxlan netdev for each vni. Solution in this patch series: The Goal is to use a single vxlan device to carry all vnis similar to the vxlan collect metadata mode but additionally allowing the bridge and vxlan driver to carry all the forwarding information and also learn. This implementation uses the existing dst_metadata infrastructure to map vlan to a tunnel id. - vxlan driver changes: - enable collect metadata mode to be used with learning, replication and fdb - A single fdb table hashed by (mac, vni) - rx path already has the vni - tx path expects a vni in the packet with dst_metadata and relies on learnt or static forwarding information table to forward the packet - Bridge driver changes: per vlan dst_metadata support: - Our use case is vxlan and 1-1 mapping between vlan and vni, but I have kept the api generic for any tunnel info - Uapi to configure/unconfigure/dump per vlan tunnel data - new bridge port flag to turn this feature on/off. off by default - ingress hook: - if port is a tunnel port, use tunnel info in attached dst_metadata to map it to a local vlan - egress hook: - if port is a tunnel port, use tunnel info attached to vlan to set dst_metadata on the skb Other approaches tried and vetoed: - tc vlan push/pop and tunnel metadata dst: - though tc can be used to do part of this, these patches address a deployment case where bridge driver vlan filtering and forwarding information database along with vxlan driver forwarding information table and learning are required. - making vxlan driver understand vlan-vni mapping: - I had a series almost ready with this one but soon realized it duplicated a lot of vlan handling code in the vxlan driver ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03bridge: vlan dst_metadata hooks in ingress and egress pathsRoopa Prabhu6-2/+82
- ingress hook: - if port is a tunnel port, use tunnel info in attached dst_metadata to map it to a local vlan - egress hook: - if port is a tunnel port, use tunnel info attached to vlan to set dst_metadata on the skb CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03bridge: per vlan dst_metadata netlink supportRoopa Prabhu7-48/+641
This patch adds support to attach per vlan tunnel info dst metadata. This enables bridge driver to map vlan to tunnel_info at ingress and egress. It uses the kernel dst_metadata infrastructure. The initial use case is vlan to vni bridging, but the api is generic to extend to any tunnel_info in the future: - Uapi to configure/unconfigure/dump per vlan tunnel data - netlink functions to configure vlan and tunnel_info mapping - Introduces bridge port flag BR_LWT_VLAN to enable attach/detach dst_metadata to bridged packets on ports. off by default. - changes to existing code is mainly refactor some existing vlan handling netlink code + hooks for new vlan tunnel code - I have kept the vlan tunnel code isolated in separate files. - most of the netlink vlan tunnel code is handling of vlan-tunid ranges (follows the vlan range handling code). To conserve space vlan-tunid by default are always dumped in ranges if applicable. Use case: example use for this is a vxlan bridging gateway or vtep which maps vlans to vn-segments (or vnis). iproute2 example (patched and pruned iproute2 output to just show relevant fdb entries): example shows same host mac learnt on two vni's and vlan 100 maps to vni 1000, vlan 101 maps to vni 1001 before (netdev per vni): $bridge fdb show | grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan1001 vlan 101 master bridge 00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan1000 vlan 100 master bridge 00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self after this patch with collect metdata in bridged mode (single netdev): $bridge fdb show | grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan0 vlan 101 master bridge 00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan0 vlan 100 master bridge 00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03bridge: uapi: add per vlan tunnel infoRoopa Prabhu3-0/+13
New nested netlink attribute to associate tunnel info per vlan. This is used by bridge driver to send tunnel metadata to bridge ports in vlan tunnel mode. This patch also adds new per port flag IFLA_BRPORT_VLAN_TUNNEL to enable vlan tunnel mode. off by default. One example use for this is a vxlan bridging gateway or vtep which maps vlans to vn-segments (or vnis). User can configure per-vlan tunnel information which the bridge driver can use to bridge vlan into the corresponding vn-segment. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03vxlan: support fdb and learning in COLLECT_METADATA modeRoopa Prabhu2-71/+126
Vxlan COLLECT_METADATA mode today solves the per-vni netdev scalability problem in l3 networks. It expects all forwarding information to be present in dst_metadata. This patch series enhances collect metadata mode to include the case where only vni is present in dst_metadata, and the vxlan driver can then use the rest of the forwarding information datbase to make forwarding decisions. There is no change to default COLLECT_METADATA behaviour. These changes only apply to COLLECT_METADATA when used with the bridging use-case with a special dst_metadata tunnel info flag (eg: where vxlan device is part of a bridge). For all this to work, the vxlan driver will need to now support a single fdb table hashed by mac + vni. This series essentially makes this happen. use-case and workflow: vxlan collect metadata device participates in bridging vlan to vn-segments. Bridge driver above the vxlan device, sends the vni corresponding to the vlan in the dst_metadata. vxlan driver will lookup forwarding database with (mac + vni) for the required remote destination information to forward the packet. Changes introduced by this patch: - allow learning and forwarding database state in vxlan netdev in COLLECT_METADATA mode. Current behaviour is not changed by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used to support the new bridge friendly mode. - A single fdb table hashed by (mac, vni) to allow fdb entries with multiple vnis in the same fdb table - rx path already has the vni - tx path expects a vni in the packet with dst_metadata - prior to this series, fdb remote_dsts carried remote vni and the vxlan device carrying the fdb table represented the source vni. With the vxlan device now representing multiple vnis, this patch adds a src vni attribute to the fdb entry. The remote vni already uses NDA_VNI attribute. This patch introduces NDA_SRC_VNI netlink attribute to represent the src vni in a multi vni fdb table. iproute2 example (patched and pruned iproute2 output to just show relevant fdb entries): example shows same host mac learnt on two vni's. before (netdev per vni): $bridge fdb show | grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self after this patch with collect metadata in bridged mode (single netdev): $bridge fdb show | grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info modeRoopa Prabhu1-0/+1
New ip_tunnel_info flag to represent bridged tunnel metadata. Used by bridge driver later in the series to pass per vlan dst metadata to bridge ports. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03Merge branch 'ife-to-module'David S. Miller13-90/+276
Yotam Gigi says: ==================== Extract IFE logic to module Extract ife logic from the tc_ife action into an independent module, and make the tc_ife action use it. This way, the ife encapsulation can be used by other modules other than tc_ife action. v1->v2: Fix duplicate symbol error by introducing a new patch that makes the original symbol static. The symbol ife_tlv_meta_extract is exported in act_ife, though not being used by any other module. As the symbol is being moved to the new ife module, introducing the new module creates duplicate symbol. To fix it, add a new patch (1/3) that makes the ife_tlv_meta_extract symbol static in act_ife, thus the symbol does not collide. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03net/sched: act_ife: Change to use ife moduleYotam Gigi4-88/+34
Use the encode/decode functionality from the ife module instead of using implementation inside the act_ife. Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03net: Introduce ife encapsulation moduleYotam Gigi9-0/+242
This module is responsible for the ife encapsulation protocol encode/decode logics. That module can: - ife_encode: encode skb and reserve space for the ife meta header - ife_decode: decode skb and extract the meta header size - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife header space. - ife_tlv_meta_decode - decodes one tlv entry from the packet - ife_tlv_meta_next - advance to the next tlv Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03net/sched: act_ife: Unexport ife_tlv_meta_encodeYotam Gigi2-4/+2
As the function ife_tlv_meta_encode is not used by any other module, unexport it and make it static for the act_ife module. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03Merge tag 'mmc-v4.10-rc6' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc Pull MMC fix from Ulf Hansson: "MMC host: sdhci: Avoid hang when receiving spurious CARD_INT interrupts" * tag 'mmc-v4.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: mmc: sdhci: Ignore unexpected CARD_INT interrupts
2017-02-03acpi, nfit: fix acpi_nfit_flush_probe() crashDan Williams1-1/+5
We queue an on-stack work item to 'nfit_wq' and wait for it to complete as part of a 'flush_probe' request. However, if the user cancels the wait we need to make sure the item is flushed from the queue otherwise we are leaving an out-of-scope stack address on the work list. BUG: unable to handle kernel paging request at ffffbcb3c72f7cd0 IP: [<ffffffffa9413a7b>] __list_add+0x1b/0xb0 [..] RIP: 0010:[<ffffffffa9413a7b>] [<ffffffffa9413a7b>] __list_add+0x1b/0xb0 RSP: 0018:ffffbcb3c7ba7c00 EFLAGS: 00010046 [..] Call Trace: [<ffffffffa90bb11a>] insert_work+0x3a/0xc0 [<ffffffffa927fdda>] ? seq_open+0x5a/0xa0 [<ffffffffa90bb30a>] __queue_work+0x16a/0x460 [<ffffffffa90bbb08>] queue_work_on+0x38/0x40 [<ffffffffc0cf2685>] acpi_nfit_flush_probe+0x95/0xc0 [nfit] [<ffffffffc0cf25d0>] ? nfit_visible+0x40/0x40 [nfit] [<ffffffffa9571495>] wait_probe_show+0x25/0x60 [<ffffffffa9546b30>] dev_attr_show+0x20/0x50 Fixes: 7ae0fa439faf ("nfit, libnvdimm: async region scrub workqueue") Cc: <stable@vger.kernel.org> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-02-03Merge tag 'drm-fixes-for-v4.10-rc7' of ↵Linus Torvalds24-165/+171
git://people.freedesktop.org/~airlied/linux Pull drm fixes from Dave Airlie: "Another fixes pull for v4.10, it's a bit big due to the backport of the VMA fixes for i915 that should fix the oops on shutdown problems that you've worked around. There are also two drm core connector registration fixes, a bunch of nouveau regression fixes and two AMD fixes" * tag 'drm-fixes-for-v4.10-rc7' of git://people.freedesktop.org/~airlied/linux: drm/radeon: Fix vram_size/visible values in DRM_RADEON_GEM_INFO ioctl drm/amdgpu/si: fix crash on headless asics drm/i915: Track pinned vma in intel_plane_state drm/atomic: Unconditionally call prepare_fb. drm/atomic: Fix double free in drm_atomic_state_default_clear drm/nouveau/kms/nv50: request vblank events for commits that send completion events drm/nouveau/nv1a,nv1f/disp: fix memory clock rate retrieval drm/nouveau/disp/gt215: Fix HDA ELD handling (thus, HDMI audio) on gt215 drm/nouveau/nouveau/led: prevent compiling the led-code if nouveau=y and leds=m drm/nouveau/disp/mcp7x: disable dptmds workaround drm/nouveau: prevent userspace from deleting client object drm/nouveau/fence/g84-: protect against concurrent access to semaphore buffers drm: Don't race connector registration drm: prevent double-(un)registration for connectors
2017-02-03Merge tag 'powerpc-4.10-3' of ↵Linus Torvalds11-62/+11
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "The main change is we're reverting the initial stack protector support we merged this cycle. It turns out to not work on toolchains built with libc support, and fixing it will be need to wait for another release. And the rest are all fairly minor: - Some pasemi machines were not booting due to a missing error check in prom_find_boot_cpu() - In EEH we were checking a pointer rather than the bool it pointed to - The clang build was broken by a BUILD_BUG_ON() we added. - The radix (Power9 only) version of map_kernel_page() was broken if our memory size was a multiple of 2MB, which it generally isn't Thanks to: Darren Stevens, Gavin Shan, Reza Arbab" * tag 'powerpc-4.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm: Use the correct pointer when setting a 2MB pte powerpc: Fix build failure with clang due to BUILD_BUG_ON() powerpc: Revert the initial stack protector support powerpc/eeh: Fix wrong flag passed to eeh_unfreeze_pe() powerpc: Add missing error check to prom_find_boot_cpu()
2017-02-03Merge tag 'trace-v4.10-rc2-2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Simple fix of s/static struct __init/static __init struct/" * tag 'trace-v4.10-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/kprobes: Fix __init annotation
2017-02-03Merge branch 'modversions' (modversions fixes for powerpc from Ard)Linus Torvalds12-62/+93
Merge kcrctab entry fixes from Ard Biesheuvel: "This is a followup to [0] 'modversions: redefine kcrctab entries as relative CRC pointers', but since relative CRC pointers do not work in modules, and are actually only needed by powerpc with CONFIG_RELOCATABLE=y, I have made it a Kconfig selectable feature instead. First it introduces the MODULE_REL_CRCS Kconfig symbol, and adds the kbuild handling of it, i.e., modpost, genksyms and kallsyms. Then it switches all architectures to 32-bit CRC entries in kcrctab, where all architectures except powerpc with CONFIG_RELOCATABLE=y use absolute ELF symbol references as before" [0] http://marc.info/?l=linux-arch&m=148493613415294&w=2 * emailed patches from Ard Biesheuvel: module: unify absolute krctab definitions for 32-bit and 64-bit modversions: treat symbol CRCs as 32 bit quantities kbuild: modversions: add infrastructure for emitting relative CRCs
2017-02-03log2: make order_base_2() behave correctly on const input value zeroArd Biesheuvel1-1/+12
The function order_base_2() is defined (according to the comment block) as returning zero on input zero, but subsequently passes the input into roundup_pow_of_two(), which is explicitly undefined for input zero. This has gone unnoticed until now, but optimization passes in GCC 7 may produce constant folded function instances where a constant value of zero is passed into order_base_2(), resulting in link errors against the deliberately undefined '____ilog2_NaN'. So update order_base_2() to adhere to its own documented interface. [ See http://marc.info/?l=linux-kernel&m=147672952517795&w=2 and follow-up discussion for more background. The gcc "optimization pass" is really just broken, but now the GCC trunk problem seems to have escaped out of just specially built daily images, so we need to work around it in mainline. - Linus ] Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03KVM: x86: do not save guest-unsupported XSAVE stateRadim Krčmář1-0/+1
Saving unsupported state prevents migration when the new host does not support a XSAVE feature of the original host, even if the feature is not exposed to the guest. We've masked host features with guest-visible features before, with 4344ee981e21 ("KVM: x86: only copy XSAVE state for the supported features") and dropped it when implementing XSAVES. Do it again. Fixes: df1daba7d1cb ("KVM: x86: support XSAVES usage in the host") Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-02-03module: unify absolute krctab definitions for 32-bit and 64-bitArd Biesheuvel1-7/+0
The previous patch introduced a separate inline asm version of the krcrctab declaration template for use with 64-bit architectures, which cannot refer to ELF symbols using 32-bit quantities. This declaration should be equivalent to the C one for 32-bit architectures, but just in case - unify them in a separate patch, which can simply be dropped if it turns out to break anything. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03modversions: treat symbol CRCs as 32 bit quantitiesArd Biesheuvel7-52/+53
The modversion symbol CRCs are emitted as ELF symbols, which allows us to easily populate the kcrctab sections by relying on the linker to associate each kcrctab slot with the correct value. This has a couple of downsides: - Given that the CRCs are treated as memory addresses, we waste 4 bytes for each CRC on 64 bit architectures, - On architectures that support runtime relocation, a R_<arch>_RELATIVE relocation entry is emitted for each CRC value, which identifies it as a quantity that requires fixing up based on the actual runtime load offset of the kernel. This results in corrupted CRCs unless we explicitly undo the fixup (and this is currently being handled in the core module code) - Such runtime relocation entries take up 24 bytes of __init space each, resulting in a x8 overhead in [uncompressed] kernel size for CRCs. Switching to explicit 32 bit values on 64 bit architectures fixes most of these issues, given that 32 bit values are not treated as quantities that require fixing up based on the actual runtime load offset. Note that on some ELF64 architectures [such as PPC64], these 32-bit values are still emitted as [absolute] runtime relocatable quantities, even if the value resolves to a build time constant. Since relative relocations are always resolved at build time, this patch enables MODULE_REL_CRCS on powerpc when CONFIG_RELOCATABLE=y, which turns the absolute CRC references into relative references into .rodata where the actual CRC value is stored. So redefine all CRC fields and variables as u32, and redefine the __CRC_SYMBOL() macro for 64 bit builds to emit the CRC reference using inline assembler (which is necessary since 64-bit C code cannot use 32-bit types to hold memory addresses, even if they are ultimately resolved using values that do not exceed 0xffffffff). To avoid potential problems with legacy 32-bit architectures using legacy toolchains, the equivalent C definition of the kcrctab entry is retained for 32-bit architectures. Note that this mostly reverts commit d4703aefdbc8 ("module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y") Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03kbuild: modversions: add infrastructure for emitting relative CRCsArd Biesheuvel5-5/+42
This add the kbuild infrastructure that will allow architectures to emit vmlinux symbol CRCs as 32-bit offsets to another location in the kernel where the actual value is stored. This works around problems with CRCs being mistaken for relocatable symbols on kernels that self relocate at runtime (i.e., powerpc with CONFIG_RELOCATABLE=y) For the kbuild side of things, this comes down to the following: - introducing a Kconfig symbol MODULE_REL_CRCS - adding a -R switch to genksyms to instruct it to emit the CRC symbols as references into the .rodata section - making modpost distinguish such references from absolute CRC symbols by the section index (SHN_ABS) - making kallsyms disregard non-absolute symbols with a __crc_ prefix Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03tcp: add tcp_mss_clamp() helperEric Dumazet5-23/+17
Small cleanup factorizing code doing the TCP_MAXSEG clamping. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03hns_enet: use cpumask_var_t for on-stack maskArnd Bergmann1-10/+15
On large SMP builds, we can run into a build warning: drivers/net/ethernet/hisilicon/hns/hns_enet.c: In function 'hns_set_irq_affinity.isra.27': drivers/net/ethernet/hisilicon/hns/hns_enet.c:1242:1: warning: the frame size of 1032 bytes is larger than 1024 bytes [-Wframe-larger-than=] The solution here is to use cpumask_var_t, which can use dynamic allocation when CONFIG_CPUMASK_OFFSTACK is enabled. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03virtio_net: remove custom busy_pollEric Dumazet1-41/+0
Generic NAPI busy polling allows us to remove custom implementations found in drivers. It is possible further optimization could be done by testing napi_complete_done() return value. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>