diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2015-06-25 02:49:49 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2015-06-25 02:49:49 +0300 |
commit | e0456717e483bb8a9431b80a5bdc99a928b9b003 (patch) | |
tree | 5eb5add2bafd1f20326d70f5cb3b711d00a40b10 /Documentation/networking | |
parent | 98ec21a01896751b673b6c731ca8881daa8b2c6d (diff) | |
parent | 1ea2d020ba477cb7011a7174e8501a9e04a325d4 (diff) | |
download | linux-e0456717e483bb8a9431b80a5bdc99a928b9b003.tar.xz |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/bonding.txt | 84 | ||||
-rw-r--r-- | Documentation/networking/can.txt | 3 | ||||
-rw-r--r-- | Documentation/networking/dctcp.txt | 1 | ||||
-rw-r--r-- | Documentation/networking/ieee802154.txt | 32 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.txt | 25 | ||||
-rw-r--r-- | Documentation/networking/pktgen.txt | 150 | ||||
-rw-r--r-- | Documentation/networking/switchdev.txt | 419 | ||||
-rw-r--r-- | Documentation/networking/tc-actions-env-rules.txt | 6 |
8 files changed, 590 insertions, 130 deletions
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 83bf4986baea..334b49ef02d1 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -51,6 +51,7 @@ Table of Contents 3.4 Configuring Bonding Manually via Sysfs 3.5 Configuration with Interfaces Support 3.6 Overriding Configuration for Special Cases +3.7 Configuring LACP for 802.3ad mode in a more secure way 4. Querying Bonding Configuration 4.1 Bonding Configuration @@ -178,6 +179,27 @@ active_slave active slave, or the empty string if there is no active slave or the current mode does not use an active slave. +ad_actor_sys_prio + + In an AD system, this specifies the system priority. The allowed range + is 1 - 65535. If the value is not specified, it takes 65535 as the + default value. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +ad_actor_system + + In an AD system, this specifies the mac-address for the actor in + protocol packet exchanges (LACPDUs). The value cannot be NULL or + multicast. It is preferred to have the local-admin bit set for this + mac but driver does not enforce it. If the value is not given then + system defaults to using the masters' mac address as actors' system + address. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + ad_select Specifies the 802.3ad aggregation selection logic to use. The @@ -220,6 +242,21 @@ ad_select This option was added in bonding version 3.4.0. +ad_user_port_key + + In an AD system, the port-key has three parts as shown below - + + Bits Use + 00 Duplex + 01-05 Speed + 06-15 User-defined + + This defines the upper 10 bits of the port key. The values can be + from 0 - 1023. If not given, the system defaults to 0. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + all_slaves_active Specifies that duplicate frames (received on inactive ports) should be @@ -1622,6 +1659,53 @@ output port selection. This feature first appeared in bonding driver version 3.7.0 and support for output slave selection was limited to round-robin and active-backup modes. +3.7 Configuring LACP for 802.3ad mode in a more secure way +---------------------------------------------------------- + +When using 802.3ad bonding mode, the Actor (host) and Partner (switch) +exchange LACPDUs. These LACPDUs cannot be sniffed, because they are +destined to link local mac addresses (which switches/bridges are not +supposed to forward). However, most of the values are easily predictable +or are simply the machine's MAC address (which is trivially known to all +other hosts in the same L2). This implies that other machines in the L2 +domain can spoof LACPDU packets from other hosts to the switch and potentially +cause mayhem by joining (from the point of view of the switch) another +machine's aggregate, thus receiving a portion of that hosts incoming +traffic and / or spoofing traffic from that machine themselves (potentially +even successfully terminating some portion of flows). Though this is not +a likely scenario, one could avoid this possibility by simply configuring +few bonding parameters: + + (a) ad_actor_system : You can set a random mac-address that can be used for + these LACPDU exchanges. The value can not be either NULL or Multicast. + Also it's preferable to set the local-admin bit. Following shell code + generates a random mac-address as described above. + + # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \ + $(( (RANDOM & 0xFE) | 0x02 )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF ))) + # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system + + (b) ad_actor_sys_prio : Randomize the system priority. The default value + is 65535, but system can take the value from 1 - 65535. Following shell + code generates random priority and sets it. + + # sys_prio=$(( 1 + RANDOM + RANDOM )) + # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio + + (c) ad_user_port_key : Use the user portion of the port-key. The default + keeps this empty. These are the upper 10 bits of the port-key and value + ranges from 0 - 1023. Following shell code generates these 10 bits and + sets it. + + # usr_port_key=$(( RANDOM & 0x3FF )) + # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key + + 4 Querying Bonding Configuration ================================= diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt index 5abad1e921ca..b48d4a149411 100644 --- a/Documentation/networking/can.txt +++ b/Documentation/networking/can.txt @@ -268,6 +268,9 @@ solution for a couple of reasons: struct can_frame { canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */ __u8 can_dlc; /* frame payload length in byte (0 .. 8) */ + __u8 __pad; /* padding */ + __u8 __res0; /* reserved / padding */ + __u8 __res1; /* reserved / padding */ __u8 data[8] __attribute__((aligned(8))); }; diff --git a/Documentation/networking/dctcp.txt b/Documentation/networking/dctcp.txt index 0d5dfbc89ec9..13a857753208 100644 --- a/Documentation/networking/dctcp.txt +++ b/Documentation/networking/dctcp.txt @@ -8,6 +8,7 @@ the data center network to provide multi-bit feedback to the end hosts. To enable it on end hosts: sysctl -w net.ipv4.tcp_congestion_control=dctcp + sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional) All switches in the data center network running DCTCP must support ECN marking and be configured for marking when reaching defined switch buffer diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt index 22bbc7225f8e..1700756af057 100644 --- a/Documentation/networking/ieee802154.txt +++ b/Documentation/networking/ieee802154.txt @@ -30,8 +30,8 @@ int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0); The address family, socket addresses etc. are defined in the include/net/af_ieee802154.h header or in the special header -in our userspace package (see either linux-zigbee sourceforge download page -or git tree at git://linux-zigbee.git.sourceforge.net/gitroot/linux-zigbee). +in the userspace package (see either http://wpan.cakelab.org/ or the +git tree at https://github.com/linux-wpan/wpan-tools). One can use SOCK_RAW for passing raw data towards device xmit function. YMMV. @@ -49,15 +49,6 @@ Like with WiFi, there are several types of devices implementing IEEE 802.15.4. Those types of devices require different approach to be hooked into Linux kernel. -MLME - MAC Level Management -============================ - -Most of IEEE 802.15.4 MLME interfaces are directly mapped on netlink commands. -See the include/net/nl802154.h header. Our userspace tools package -(see above) provides CLI configuration utility for radio interfaces and simple -coordinator for IEEE 802.15.4 networks as an example users of MLME protocol. - - HardMAC ======= @@ -75,8 +66,6 @@ net_device with a pointer to struct ieee802154_mlme_ops instance. The fields assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional. All other fields are required. -We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c - SoftMAC ======= @@ -89,7 +78,8 @@ stack interface for network sniffers (e.g. WireShark). This layer is going to be extended soon. -See header include/net/mac802154.h and several drivers in drivers/ieee802154/. +See header include/net/mac802154.h and several drivers in +drivers/net/ieee802154/. Device drivers API @@ -114,18 +104,17 @@ Moreover IEEE 802.15.4 device operations structure should be filled. Fake drivers ============ -In addition there are two drivers available which simulate real devices with -HardMAC (fakehard) and SoftMAC (fakelb - IEEE 802.15.4 loopback driver) -interfaces. This option provides possibility to test and debug stack without -usage of real hardware. +In addition there is a driver available which simulates a real device with +SoftMAC (fakelb - IEEE 802.15.4 loopback driver) interface. This option +provides possibility to test and debug stack without usage of real hardware. -See sources in drivers/ieee802154 folder for more details. +See sources in drivers/net/ieee802154 folder for more details. 6LoWPAN Linux implementation ============================ -The IEEE 802.15.4 standard specifies an MTU of 128 bytes, yielding about 80 +The IEEE 802.15.4 standard specifies an MTU of 127 bytes, yielding about 80 octets of actual MAC payload once security is turned on, on a wireless link with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format [RFC4944] was specified to carry IPv6 datagrams over such constrained links, @@ -140,7 +129,8 @@ In Semptember 2011 the standard update was published - [RFC6282]. It deprecates HC1 and HC2 compression and defines IPHC encoding format which is used in this Linux implementation. -All the code related to 6lowpan you may find in files: net/ieee802154/6lowpan.* +All the code related to 6lowpan you may find in files: net/6lowpan/* +and net/ieee802154/6lowpan/* To setup 6lowpan interface you need (busybox release > 1.17.0): 1. Add IEEE802.15.4 interface and initialize PANid; diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 071fb18dc57c..5fae7704daab 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -267,6 +267,15 @@ tcp_ecn - INTEGER but do not request ECN on outgoing connections. Default: 2 +tcp_ecn_fallback - BOOLEAN + If the kernel detects that ECN connection misbehaves, enable fall + back to non-ECN. Currently, this knob implements the fallback + from RFC3168, section 6.1.1.1., but we reserve that in future, + additional detection mechanisms could be implemented under this + knob. The value is not used, if tcp_ecn or per route (or congestion + control) ECN settings are disabled. + Default: 1 (fallback enabled) + tcp_fack - BOOLEAN Enable FACK congestion avoidance and fast retransmission. The value is not used, if tcp_sack is not enabled. @@ -742,8 +751,10 @@ IP Variables: ip_local_port_range - 2 INTEGERS Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first, the - second the last local port number. The default values are - 32768 and 61000 respectively. + second the last local port number. + If possible, it is better these numbers have different parity. + (one even and one odd values) + The default values are 32768 and 60999 respectively. ip_local_reserved_ports - list of comma separated ranges Specify the ports which are reserved for known third-party @@ -766,7 +777,7 @@ ip_local_reserved_ports - list of comma separated ranges ip_local_port_range, e.g.: $ cat /proc/sys/net/ipv4/ip_local_port_range - 32000 61000 + 32000 60999 $ cat /proc/sys/net/ipv4/ip_local_reserved_ports 8080,9148 @@ -1213,6 +1224,14 @@ auto_flowlabels - BOOLEAN FALSE: disabled Default: false +flowlabel_state_ranges - BOOLEAN + Split the flow label number space into two ranges. 0-0x7FFFF is + reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF + is reserved for stateless flow labels as described in RFC6437. + TRUE: enabled + FALSE: disabled + Default: true + anycast_src_echo_reply - BOOLEAN Controls the use of anycast addresses as source addresses for ICMPv6 echo reply diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index 0344f1d45b37..f4be85e96005 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -1,6 +1,6 @@ - HOWTO for the linux packet generator + HOWTO for the linux packet generator ------------------------------------ Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel @@ -50,17 +50,33 @@ For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6): # ethtool -C ethX rx-usecs 30 -Viewing threads -=============== -/proc/net/pktgen/kpktgend_0 -Name: kpktgend_0 max_before_softirq: 10000 -Running: -Stopped: eth1 -Result: OK: max_before_softirq=10000 +Kernel threads +============== +Pktgen creates a thread for each CPU with affinity to that CPU. +Which is controlled through procfile /proc/net/pktgen/kpktgend_X. + +Example: /proc/net/pktgen/kpktgend_0 + + Running: + Stopped: eth4@0 + Result: OK: add_device=eth4@0 + +Most important are the devices assigned to the thread. + +The two basic thread commands are: + * add_device DEVICE@NAME -- adds a single device + * rem_device_all -- remove all associated devices + +When adding a device to a thread, a corrosponding procfile is created +which is used for configuring this device. Thus, device names need to +be unique. -Most important are the devices assigned to the thread. Note that a -device can only belong to one thread. +To support adding the same device to multiple threads, which is useful +with multi queue NICs, a the device naming scheme is extended with "@": + device@something +The part after "@" can be anything, but it is custom to use the thread +number. Viewing devices =============== @@ -69,29 +85,32 @@ The Params section holds configured information. The Current section holds running statistics. The Result is printed after a run or after interruption. Example: -/proc/net/pktgen/eth1 +/proc/net/pktgen/eth4@0 -Params: count 10000000 min_pkt_size: 60 max_pkt_size: 60 - frags: 0 delay: 0 clone_skb: 1000000 ifname: eth1 + Params: count 100000 min_pkt_size: 60 max_pkt_size: 60 + frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0 flows: 0 flowlen: 0 - dst_min: 10.10.11.2 dst_max: - src_min: src_max: - src_mac: 00:00:00:00:00:00 dst_mac: 00:04:23:AC:FD:82 - udp_src_min: 9 udp_src_max: 9 udp_dst_min: 9 udp_dst_max: 9 - src_mac_count: 0 dst_mac_count: 0 - Flags: -Current: - pkts-sofar: 10000000 errors: 39664 - started: 1103053986245187us stopped: 1103053999346329us idle: 880401us - seq_num: 10000011 cur_dst_mac_offset: 0 cur_src_mac_offset: 0 - cur_saddr: 0x10a0a0a cur_daddr: 0x20b0a0a - cur_udp_dst: 9 cur_udp_src: 9 + queue_map_min: 0 queue_map_max: 0 + dst_min: 192.168.81.2 dst_max: + src_min: src_max: + src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8 + udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9 + src_mac_count: 0 dst_mac_count: 0 + Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU + Current: + pkts-sofar: 100000 errors: 0 + started: 623913381008us stopped: 623913396439us idle: 25us + seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0 + cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2 + cur_udp_dst: 9 cur_udp_src: 42 + cur_queue_map: 0 flows: 0 -Result: OK: 13101142(c12220741+d880401) usec, 10000000 (60byte,0frags) - 763292pps 390Mb/sec (390805504bps) errors: 39664 + Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags) + 6480562pps 3110Mb/sec (3110669760bps) errors: 0 -Configuring threads and devices -================================ + +Configuring devices +=================== This is done via the /proc interface, and most easily done via pgset as defined in the sample scripts. @@ -126,7 +145,7 @@ Examples: To select queue 1 of a given device, use queue_map_min=1 and queue_map_max=1 - pgset "src_mac_count 1" Sets the number of MACs we'll range through. + pgset "src_mac_count 1" Sets the number of MACs we'll range through. The 'minimum' MAC is what you set with srcmac. pgset "dst_mac_count 1" Sets the number of MACs we'll range through. @@ -145,6 +164,7 @@ Examples: UDPCSUM, IPSEC # IPsec encapsulation (needs CONFIG_XFRM) NODE_ALLOC # node specific memory allocation + NO_TIMESTAMP # disable timestamping pgset spi SPI_VALUE Set specific SA used to transform packet. @@ -192,24 +212,43 @@ Examples: pgset "rate 300M" set rate to 300 Mb/s pgset "ratep 1000000" set rate to 1Mpps + pgset "xmit_mode netif_receive" RX inject into stack netif_receive_skb() + Works with "burst" but not with "clone_skb". + Default xmit_mode is "start_xmit". + Sample scripts ============== -A collection of small tutorial scripts for pktgen is in the -samples/pktgen directory: +A collection of tutorial scripts and helpers for pktgen is in the +samples/pktgen directory. The helper parameters.sh file support easy +and consistant parameter parsing across the sample scripts. + +Usage example and help: + ./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2 + +Usage: ./pktgen_sample01_simple.sh [-vx] -i ethX + -i : ($DEV) output interface/device (required) + -s : ($PKT_SIZE) packet size + -d : ($DEST_IP) destination IP + -m : ($DST_MAC) destination MAC-addr + -t : ($THREADS) threads to start + -c : ($SKB_CLONE) SKB clones send before alloc new SKB + -b : ($BURST) HW level bursting of SKBs + -v : ($VERBOSE) verbose + -x : ($DEBUG) debug + +The global variables being set are also listed. E.g. the required +interface/device parameter "-i" sets variable $DEV. Copy the +pktgen_sampleXX scripts and modify them to fit your own needs. + +The old scripts: -pktgen.conf-1-1 # 1 CPU 1 dev pktgen.conf-1-2 # 1 CPU 2 dev -pktgen.conf-2-1 # 2 CPU's 1 dev -pktgen.conf-2-2 # 2 CPU's 2 dev pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6 pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows. -Run in shell: ./pktgen.conf-X-Y -This does all the setup including sending. - Interrupt affinity =================== @@ -217,6 +256,9 @@ Note that when adding devices to a specific CPU it is a good idea to also assign /proc/irq/XX/smp_affinity so that the TX interrupts are bound to the same CPU. This reduces cache bouncing when freeing skbs. +Plus using the device flag QUEUE_MAP_CPU, which maps the SKBs TX queue +to the running threads CPU (directly from smp_processor_id()). + Enable IPsec ============ Default IPsec transformation with ESP encapsulation plus transport mode @@ -237,18 +279,19 @@ Current commands and configuration options start stop +reset ** Thread commands: add_device rem_device_all -max_before_softirq ** Device commands: count clone_skb +burst debug frags @@ -257,10 +300,17 @@ delay src_mac_count dst_mac_count -pkt_size +pkt_size min_pkt_size max_pkt_size +queue_map_min +queue_map_max +skb_priority + +tos (ipv4) +traffic_class (ipv6) + mpls udp_src_min @@ -269,6 +319,8 @@ udp_src_max udp_dst_min udp_dst_max +node + flag IPSRC_RND IPDST_RND @@ -287,6 +339,9 @@ flag UDPCSUM IPSEC NODE_ALLOC + NO_TIMESTAMP + +spi (ipsec) dst_min dst_max @@ -299,8 +354,10 @@ src_mac clear_counters -dst6 src6 +dst6 +dst6_max +dst6_min flows flowlen @@ -308,6 +365,17 @@ flowlen rate ratep +xmit_mode <start_xmit|netif_receive> + +vlan_cfi +vlan_id +vlan_p + +svlan_cfi +svlan_id +svlan_p + + References: ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/ ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/ diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt index f981a9295a39..c5d7ade10ff2 100644 --- a/Documentation/networking/switchdev.txt +++ b/Documentation/networking/switchdev.txt @@ -1,59 +1,360 @@ -Switch (and switch-ish) device drivers HOWTO -=========================== - -Please note that the word "switch" is here used in very generic meaning. -This include devices supporting L2/L3 but also various flow offloading chips, -including switches embedded into SR-IOV NICs. - -Lets describe a topology a bit. Imagine the following example: - - +----------------------------+ +---------------+ - | SOME switch chip | | CPU | - +----------------------------+ +---------------+ - port1 port2 port3 port4 MNGMNT | PCI-E | - | | | | | +---------------+ - PHY PHY | | | | NIC0 NIC1 - | | | | | | - | | +- PCI-E -+ | | - | +------- MII -------+ | - +------------- MII ------------+ - -In this example, there are two independent lines between the switch silicon -and CPU. NIC0 and NIC1 drivers are not aware of a switch presence. They are -separate from the switch driver. SOME switch chip is by managed by a driver -via PCI-E device MNGMNT. Note that MNGMNT device, NIC0 and NIC1 may be -connected to some other type of bus. - -Now, for the previous example show the representation in kernel: - - +----------------------------+ +---------------+ - | SOME switch chip | | CPU | - +----------------------------+ +---------------+ - sw0p0 sw0p1 sw0p2 sw0p3 MNGMNT | PCI-E | - | | | | | +---------------+ - PHY PHY | | | | eth0 eth1 - | | | | | | - | | +- PCI-E -+ | | - | +------- MII -------+ | - +------------- MII ------------+ - -Lets call the example switch driver for SOME switch chip "SOMEswitch". This -driver takes care of PCI-E device MNGMNT. There is a netdevice instance sw0pX -created for each port of a switch. These netdevices are instances -of "SOMEswitch" driver. sw0pX netdevices serve as a "representation" -of the switch chip. eth0 and eth1 are instances of some other existing driver. - -The only difference of the switch-port netdevice from the ordinary netdevice -is that is implements couple more NDOs: - - ndo_switch_parent_id_get - This returns the same ID for two port netdevices - of the same physical switch chip. This is - mandatory to be implemented by all switch drivers - and serves the caller for recognition of a port - netdevice. - ndo_switch_parent_* - Functions that serve for a manipulation of the switch - chip itself (it can be though of as a "parent" of the - port, therefore the name). They are not port-specific. - Caller might use arbitrary port netdevice of the same - switch and it will make no difference. - ndo_switch_port_* - Functions that serve for a port-specific manipulation. +Ethernet switch device driver model (switchdev) +=============================================== +Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us> +Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com> + + +The Ethernet switch device driver model (switchdev) is an in-kernel driver +model for switch devices which offload the forwarding (data) plane from the +kernel. + +Figure 1 is a block diagram showing the components of the switchdev model for +an example setup using a data-center-class switch ASIC chip. Other setups +with SR-IOV or soft switches, such as OVS, are possible. + + + User-space tools + + user space | + +-------------------------------------------------------------------+ + kernel | Netlink + | + +--------------+-------------------------------+ + | Network stack | + | (Linux) | + | | + +----------------------------------------------+ + + sw1p2 sw1p4 sw1p6 + sw1p1 + sw1p3 + sw1p5 + eth1 + + | + | + | + + | | | | | | | + +--+----+----+----+-+--+----+---+ +-----+-----+ + | Switch driver | | mgmt | + | (this document) | | driver | + | | | | + +--------------+----------------+ +-----------+ + | + kernel | HW bus (eg PCI) + +-------------------------------------------------------------------+ + hardware | + +--------------+---+------------+ + | Switch device (sw1) | + | +----+ +--------+ + | | v offloaded data path | mgmt port + | | | | + +--|----|----+----+----+----+---+ + | | | | | | + + + + + + + + p1 p2 p3 p4 p5 p6 + + front-panel ports + + + Fig 1. + + +Include Files +------------- + +#include <linux/netdevice.h> +#include <net/switchdev.h> + + +Configuration +------------- + +Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model +support is built for driver. + + +Switch Ports +------------ + +On switchdev driver initialization, the driver will allocate and register a +struct net_device (using register_netdev()) for each enumerated physical switch +port, called the port netdev. A port netdev is the software representation of +the physical port and provides a conduit for control traffic to/from the +controller (the kernel) and the network, as well as an anchor point for higher +level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using +standard netdev tools (iproute2, ethtool, etc), the port netdev can also +provide to the user access to the physical properties of the switch port such +as PHY link state and I/O statistics. + +There is (currently) no higher-level kernel object for the switch beyond the +port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. + +A switch management port is outside the scope of the switchdev driver model. +Typically, the management port is not participating in offloaded data plane and +is loaded with a different driver, such as a NIC driver, on the management port +device. + +Port Netdev Naming +^^^^^^^^^^^^^^^^^^ + +Udev rules should be used for port netdev naming, using some unique attribute +of the port as a key, for example the port MAC address or the port PHYS name. +Hard-coding of kernel netdev names within the driver is discouraged; let the +kernel pick the default netdev name, and let udev set the final name based on a +port attribute. + +Using port PHYS name (ndo_get_phys_port_name) for the key is particularly +useful for dynamically-named ports where the device names its ports based on +external configuration. For example, if a physical 40G port is split logically +into 4 10G ports, resulting in 4 port netdevs, the device can give a unique +name for each port using port PHYS name. The udev rule would be: + +SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \ + NAME="$attr{phys_port_name}" + +Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y +is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 +would be sub-port 0 on port 1 on switch 1. + +Switch ID +^^^^^^^^^ + +The switchdev driver must implement the switchdev op switchdev_port_attr_get +for SWITCHDEV_ATTR_PORT_PARENT_ID for each port netdev, returning the same +physical ID for each port of a switch. The ID must be unique between switches +on the same system. The ID does not need to be unique between switches on +different systems. + +The switch ID is used to locate ports on a switch and to know if aggregated +ports belong to the same switch. + +Port Features +^^^^^^^^^^^^^ + +NETIF_F_NETNS_LOCAL + +If the switchdev driver (and device) only supports offloading of the default +network namespace (netns), the driver should set this feature flag to prevent +the port netdev from being moved out of the default netns. A netns-aware +driver/device would not set this flag and be responsible for partitioning +hardware to preserve netns containment. This means hardware cannot forward +traffic from a port in one namespace to another port in another namespace. + +Port Topology +^^^^^^^^^^^^^ + +The port netdevs representing the physical switch ports can be organized into +higher-level switching constructs. The default construct is a standalone +router port, used to offload L3 forwarding. Two or more ports can be bonded +together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge +L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 +tunnels can be built on ports. These constructs are built using standard Linux +tools such as the bridge driver, the bonding/team drivers, and netlink-based +tools such as iproute2. + +The switchdev driver can know a particular port's position in the topology by +monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a +bond will see it's upper master change. If that bond is moved into a bridge, +the bond's upper master will change. And so on. The driver will track such +movements to know what position a port is in in the overall topology by +registering for netdevice events and acting on NETDEV_CHANGEUPPER. + +L2 Forwarding Offload +--------------------- + +The idea is to offload the L2 data forwarding (switching) path from the kernel +to the switchdev device by mirroring bridge FDB entries down to the device. An +FDB entry is the {port, MAC, VLAN} tuple forwarding destination. + +To offloading L2 bridging, the switchdev driver/device should support: + + - Static FDB entries installed on a bridge port + - Notification of learned/forgotten src mac/vlans from device + - STP state changes on the port + - VLAN flooding of multicast/broadcast and unknown unicast packets + +Static FDB Entries +^^^^^^^^^^^^^^^^^^ + +The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump +to support static FDB entries installed to the device. Static bridge FDB +entries are installed, for example, using iproute2 bridge cmd: + + bridge fdb add ADDR dev DEV [vlan VID] [self] + +The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx +ops, and handle add/delete/dump of SWITCHDEV_OBJ_PORT_FDB object using +switchdev_port_obj_xxx ops. + +XXX: what should be done if offloading this rule to hardware fails (for +example, due to full capacity in hardware tables) ? + +Note: by default, the bridge does not filter on VLAN and only bridges untagged +traffic. To enable VLAN support, turn on VLAN filtering: + + echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering + +Notification of Learned/Forgotten Source MAC/VLANs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switch device will learn/forget source MAC address/VLAN on ingress packets +and notify the switch driver of the mac/vlan/port tuples. The switch driver, +in turn, will notify the bridge driver using the switchdev notifier call: + + err = call_switchdev_notifiers(val, dev, info); + +Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when +forgetting, and info points to a struct switchdev_notifier_fdb_info. On +SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the +bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge +command will label these entries "offload": + + $ bridge fdb + 52:54:00:12:35:01 dev sw1p1 master br0 permanent + 00:02:00:00:02:00 dev sw1p1 master br0 offload + 00:02:00:00:02:00 dev sw1p1 self + 52:54:00:12:35:02 dev sw1p2 master br0 permanent + 00:02:00:00:03:00 dev sw1p2 master br0 offload + 00:02:00:00:03:00 dev sw1p2 self + 33:33:00:00:00:01 dev eth0 self permanent + 01:00:5e:00:00:01 dev eth0 self permanent + 33:33:ff:00:00:00 dev eth0 self permanent + 01:80:c2:00:00:0e dev eth0 self permanent + 33:33:00:00:00:01 dev br0 self permanent + 01:00:5e:00:00:01 dev br0 self permanent + 33:33:ff:12:35:01 dev br0 self permanent + +Learning on the port should be disabled on the bridge using the bridge command: + + bridge link set dev DEV learning off + +Learning on the device port should be enabled, as well as learning_sync: + + bridge link set dev DEV learning on self + bridge link set dev DEV learning_sync on self + +Learning_sync attribute enables syncing of the learned/forgotton FDB entry to +the bridge's FDB. It's possible, but not optimal, to enable learning on the +device port and on the bridge port, and disable learning_sync. + +To support learning and learning_sync port attributes, the driver implements +switchdev op switchdev_port_attr_get/set for SWITCHDEV_ATTR_PORT_BRIDGE_FLAGS. +The driver should initialize the attributes to the hardware defaults. + +FDB Ageing +^^^^^^^^^^ + +There are two FDB ageing models supported: 1) ageing by the device, and 2) +ageing by the kernel. Ageing by the device is preferred if many FDB entries +are supported. The driver calls call_switchdev_notifiers(SWITCHDEV_FDB_DEL, +...) to age out the FDB entry. In this model, ageing by the kernel should be +turned off. XXX: how to turn off ageing in kernel on a per-port basis or +otherwise prevent the kernel from ageing out the FDB entry? + +In the kernel ageing model, the standard bridge ageing mechanism is used to age +out stale FDB entries. To keep an FDB entry "alive", the driver should refresh +the FDB entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The +notification will reset the FDB entry's last-used time to now. The driver +should rate limit refresh notifications, for example, no more than once a +second. If the FDB entry expires, fdb_delete is called to remove entry from +the device. + +STP State Change on Port +^^^^^^^^^^^^^^^^^^^^^^^^ + +Internally or with a third-party STP protocol implementation (e.g. mstpd), the +bridge driver maintains the STP state for ports, and will notify the switch +driver of STP state change on a port using the switchdev op +switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_STP_UPDATE. + +State is one of BR_STATE_*. The switch driver can use STP state updates to +update ingress packet filter list for the port. For example, if port is +DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs +and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. + +Note that STP BDPUs are untagged and STP state applies to all VLANs on the port +so packet filters should be applied consistently across untagged and tagged +VLANs on the port. + +Flooding L2 domain +^^^^^^^^^^^^^^^^^^ + +For a given L2 VLAN domain, the switch device should flood multicast/broadcast +and unknown unicast packets to all ports in domain, if allowed by port's +current STP state. The switch driver, knowing which ports are within which +vlan L2 domain, can program the switch device for flooding. The packet should +also be sent to the port netdev for processing by the bridge driver. The +bridge should not reflood the packet to the same ports the device flooded. +XXX: the mechanism to avoid duplicate flood packets is being discuseed. + +It is possible for the switch device to not handle flooding and push the +packets up to the bridge driver for flooding. This is not ideal as the number +of ports scale in the L2 domain as the device is much more efficient at +flooding packets that software. + +IGMP Snooping +^^^^^^^^^^^^^ + +XXX: complete this section + + +L3 Routing Offload +------------------ + +Offloading L3 routing requires that device be programmed with FIB entries from +the kernel, with the device doing the FIB lookup and forwarding. The device +does a longest prefix match (LPM) on FIB entries matching route prefix and +forwards the packet to the matching FIB entry's nexthop(s) egress ports. + +To program the device, the driver implements support for +SWITCHDEV_OBJ_IPV[4|6]_FIB object using switchdev_port_obj_xxx ops. +switchdev_port_obj_add is used for both adding a new FIB entry to the device, +or modifying an existing entry on the device. + +XXX: Currently, only SWITCHDEV_OBJ_IPV4_FIB objects are supported. + +SWITCHDEV_OBJ_IPV4_FIB object passes: + + struct switchdev_obj_ipv4_fib { /* IPV4_FIB */ + u32 dst; + int dst_len; + struct fib_info *fi; + u8 tos; + u8 type; + u32 nlflags; + u32 tb_id; + } ipv4_fib; + +to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi +structure holds details on the route and route's nexthops. *dev is one of the +port netdevs mentioned in the routes next hop list. If the output port netdevs +referenced in the route's nexthop list don't all have the same switch ID, the +driver is not called to add/modify/delete the FIB entry. + +Routes offloaded to the device are labeled with "offload" in the ip route +listing: + + $ ip route show + default via 192.168.0.2 dev eth0 + 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload + 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload + 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload + 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload + 12.0.0.2 proto zebra metric 30 offload + nexthop via 11.0.0.1 dev sw1p1 weight 1 + nexthop via 11.0.0.9 dev sw1p2 weight 1 + 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload + 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload + 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 + +XXX: add/mod/del IPv6 FIB API + +Nexthop Resolution +^^^^^^^^^^^^^^^^^^ + +The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for +the switch device to forward the packet with the correct dst mac address, the +nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac +address discovery comes via the ARP (or ND) process and is available via the +arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver +should trigger the kernel's neighbor resolution process. See the rocker +driver's rocker_port_ipv4_resolve() for an example. + +The driver can monitor for updates to arp_tbl using the netevent notifier +NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops +for the routes as arp_tbl updates. diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt index 70d6cf608251..f37814693ad3 100644 --- a/Documentation/networking/tc-actions-env-rules.txt +++ b/Documentation/networking/tc-actions-env-rules.txt @@ -8,14 +8,8 @@ For example if your action queues a packet to be processed later, or intentionally branches by redirecting a packet, then you need to clone the packet. -There are certain fields in the skb tc_verd that need to be reset so we -avoid loops, etc. A few are generic enough that skb_act_clone() -resets them for you, so invoke skb_act_clone() rather than skb_clone(). - 2) If you munge any packet thou shalt call pskb_expand_head in the case someone else is referencing the skb. After that you "own" the skb. -You must also tell us if it is ok to munge the packet (TC_OK2MUNGE), -this way any action downstream can stomp on the packet. 3) Dropping packets you don't own is a no-no. You simply return TC_ACT_SHOT to the caller and they will drop it. |