diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2018-12-28 00:04:52 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2018-12-28 00:04:52 +0300 |
commit | e0c38a4d1f196a4b17d2eba36afff8f656a4f1de (patch) | |
tree | b26a69fabef0160adb127416a9744217700feeb7 /Documentation | |
parent | 7f9f852c75e7d776b078813586c76a2bc7dca993 (diff) | |
parent | 90cadbbf341dd5b2df991c33a6bd6341f3a53788 (diff) | |
download | linux-e0c38a4d1f196a4b17d2eba36afff8f656a4f1de.tar.xz |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
1) New ipset extensions for matching on destination MAC addresses, from
Stefano Brivio.
2) Add ipv4 ttl and tos, plus ipv6 flow label and hop limit offloads to
nfp driver. From Stefano Brivio.
3) Implement GRO for plain UDP sockets, from Paolo Abeni.
4) Lots of work from Michał Mirosław to eliminate the VLAN_TAG_PRESENT
bit so that we could support the entire vlan_tci value.
5) Rework the IPSEC policy lookups to better optimize more usecases,
from Florian Westphal.
6) Infrastructure changes eliminating direct manipulation of SKB lists
wherever possible, and to always use the appropriate SKB list
helpers. This work is still ongoing...
7) Lots of PHY driver and state machine improvements and
simplifications, from Heiner Kallweit.
8) Various TSO deferral refinements, from Eric Dumazet.
9) Add ntuple filter support to aquantia driver, from Dmitry Bogdanov.
10) Batch dropping of XDP packets in tuntap, from Jason Wang.
11) Lots of cleanups and improvements to the r8169 driver from Heiner
Kallweit, including support for ->xmit_more. This driver has been
getting some much needed love since he started working on it.
12) Lots of new forwarding selftests from Petr Machata.
13) Enable VXLAN learning in mlxsw driver, from Ido Schimmel.
14) Packed ring support for virtio, from Tiwei Bie.
15) Add new Aquantia AQtion USB driver, from Dmitry Bezrukov.
16) Add XDP support to dpaa2-eth driver, from Ioana Ciocoi Radulescu.
17) Implement coalescing on TCP backlog queue, from Eric Dumazet.
18) Implement carrier change in tun driver, from Nicolas Dichtel.
19) Support msg_zerocopy in UDP, from Willem de Bruijn.
20) Significantly improve garbage collection of neighbor objects when
the table has many PERMANENT entries, from David Ahern.
21) Remove egdev usage from nfp and mlx5, and remove the facility
completely from the tree as it no longer has any users. From Oz
Shlomo and others.
22) Add a NETDEV_PRE_CHANGEADDR so that drivers can veto the change and
therefore abort the operation before the commit phase (which is the
NETDEV_CHANGEADDR event). From Petr Machata.
23) Add indirect call wrappers to avoid retpoline overhead, and use them
in the GRO code paths. From Paolo Abeni.
24) Add support for netlink FDB get operations, from Roopa Prabhu.
25) Support bloom filter in mlxsw driver, from Nir Dotan.
26) Add SKB extension infrastructure. This consolidates the handling of
the auxiliary SKB data used by IPSEC and bridge netfilter, and is
designed to support the needs to MPTCP which could be integrated in
the future.
27) Lots of XDP TX optimizations in mlx5 from Tariq Toukan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1845 commits)
net: dccp: fix kernel crash on module load
drivers/net: appletalk/cops: remove redundant if statement and mask
bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
net/net_namespace: Check the return value of register_pernet_subsys()
net/netlink_compat: Fix a missing check of nla_parse_nested
ieee802154: lowpan_header_create check must check daddr
net/mlx4_core: drop useless LIST_HEAD
mlxsw: spectrum: drop useless LIST_HEAD
net/mlx5e: drop useless LIST_HEAD
iptunnel: Set tun_flags in the iptunnel_metadata_reply from src
net/mlx5e: fix semicolon.cocci warnings
staging: octeon: fix build failure with XFRM enabled
net: Revert recent Spectre-v1 patches.
can: af_can: Fix Spectre v1 vulnerability
packet: validate address length if non-zero
nfc: af_nfc: Fix Spectre v1 vulnerability
phonet: af_phonet: Fix Spectre v1 vulnerability
net: core: Fix Spectre v1 vulnerability
net: minor cleanup in skb_ext_add()
net: drop the unused helper skb_ext_get()
...
Diffstat (limited to 'Documentation')
60 files changed, 1455 insertions, 44 deletions
diff --git a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt index 01fdc33a41d0..bb7e896cb644 100644 --- a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt +++ b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt @@ -10,7 +10,7 @@ such as network interfaces, crypto accelerator instances, L2 switches, etc. For an overview of the DPAA2 architecture and fsl-mc bus see: -Documentation/networking/dpaa2/overview.rst +Documentation/networking/device_drivers/freescale/dpaa2/overview.rst As described in the above overview, all DPAA2 objects in a DPRC share the same hardware "isolation context" and a 10-bit value called an ICID diff --git a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt index 4194ff7e6ee6..c26f4e11037c 100644 --- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt +++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt @@ -10,6 +10,8 @@ device the slave device is attached to. Required properties: - compatible: should contain one of the following: + * "brcm,bcm20702a1" + * "brcm,bcm4330-bt" * "brcm,bcm43438-bt" Optional properties: @@ -18,8 +20,13 @@ Optional properties: - shutdown-gpios: GPIO specifier, used to enable the BT module - device-wakeup-gpios: GPIO specifier, used to wakeup the controller - host-wakeup-gpios: GPIO specifier, used to wakeup the host processor - - clocks: clock specifier if external clock provided to the controller - - clock-names: should be "extclk" + - clocks: 1 or 2 clocks as defined in clock-names below, in that order + - clock-names: names for clock inputs, matching the clocks given + - "extclk": deprecated, replaced by "txco" + - "txco": external reference clock (not a standalone crystal) + - "lpo": external low power 32.768 kHz clock + - vbat-supply: phandle to regulator supply for VBAT + - vddio-supply: phandle to regulator supply for VDDIO Example: diff --git a/Documentation/devicetree/bindings/net/can/fsl-flexcan.txt b/Documentation/devicetree/bindings/net/can/fsl-flexcan.txt index bfc0c433654f..bc77477c6878 100644 --- a/Documentation/devicetree/bindings/net/can/fsl-flexcan.txt +++ b/Documentation/devicetree/bindings/net/can/fsl-flexcan.txt @@ -24,6 +24,14 @@ Optional properties: if this property is present then controller is assumed to be big endian. +- fsl,stop-mode: register bits of stop mode control, the format is + <&gpr req_gpr req_bit ack_gpr ack_bit>. + gpr is the phandle to general purpose register node. + req_gpr is the gpr register offset of CAN stop request. + req_bit is the bit offset of CAN stop request. + ack_gpr is the gpr register offset of CAN stop acknowledge. + ack_bit is the bit offset of CAN stop acknowledge. + Example: can@1c000 { diff --git a/Documentation/devicetree/bindings/net/can/xilinx_can.txt b/Documentation/devicetree/bindings/net/can/xilinx_can.txt index 060e2d46bad9..100cc40b8510 100644 --- a/Documentation/devicetree/bindings/net/can/xilinx_can.txt +++ b/Documentation/devicetree/bindings/net/can/xilinx_can.txt @@ -6,6 +6,7 @@ Required properties: - "xlnx,zynq-can-1.0" for Zynq CAN controllers - "xlnx,axi-can-1.00.a" for Axi CAN controllers - "xlnx,canfd-1.0" for CAN FD controllers + - "xlnx,canfd-2.0" for CAN FD 2.0 controllers - reg : Physical base address and size of the controller registers map. - interrupts : Property with a value describing the interrupt diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt b/Documentation/devicetree/bindings/net/dsa/ksz.txt index ac145b885e95..0f407fb371ce 100644 --- a/Documentation/devicetree/bindings/net/dsa/ksz.txt +++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt @@ -8,6 +8,10 @@ Required properties: - "microchip,ksz9477" - "microchip,ksz9897" +Optional properties: + +- reset-gpios : Should be a gpio specifier for a reset line + See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of additional required and optional properties. diff --git a/Documentation/devicetree/bindings/net/icplus-ip101ag.txt b/Documentation/devicetree/bindings/net/icplus-ip101ag.txt new file mode 100644 index 000000000000..a784592bbb15 --- /dev/null +++ b/Documentation/devicetree/bindings/net/icplus-ip101ag.txt @@ -0,0 +1,19 @@ +IC Plus Corp. IP101A / IP101G Ethernet PHYs + +There are different models of the IP101G Ethernet PHY: +- IP101GR (32-pin QFN package) +- IP101G (die only, no package) +- IP101GA (48-pin LQFP package) + +There are different models of the IP101A Ethernet PHY (which is the +predecessor of the IP101G): +- IP101A (48-pin LQFP package) +- IP101AH (48-pin LQFP package) + +Optional properties for the IP101GR (32-pin QFN package): + +- icplus,select-rx-error: + pin 21 ("RXER/INTR_32") will output the receive error status. + interrupts are not routed outside the PHY in this mode. +- icplus,select-interrupt: + pin 21 ("RXER/INTR_32") will output the interrupt signal. diff --git a/Documentation/devicetree/bindings/net/mediatek-dwmac.txt b/Documentation/devicetree/bindings/net/mediatek-dwmac.txt new file mode 100644 index 000000000000..8a08621a5b54 --- /dev/null +++ b/Documentation/devicetree/bindings/net/mediatek-dwmac.txt @@ -0,0 +1,78 @@ +MediaTek DWMAC glue layer controller + +This file documents platform glue layer for stmmac. +Please see stmmac.txt for the other unchanged properties. + +The device node has following properties. + +Required properties: +- compatible: Should be "mediatek,mt2712-gmac" for MT2712 SoC +- reg: Address and length of the register set for the device +- interrupts: Should contain the MAC interrupts +- interrupt-names: Should contain a list of interrupt names corresponding to + the interrupts in the interrupts property, if available. + Should be "macirq" for the main MAC IRQ +- clocks: Must contain a phandle for each entry in clock-names. +- clock-names: The name of the clock listed in the clocks property. These are + "axi", "apb", "mac_main", "ptp_ref" for MT2712 SoC +- mac-address: See ethernet.txt in the same directory +- phy-mode: See ethernet.txt in the same directory +- mediatek,pericfg: A phandle to the syscon node that control ethernet + interface and timing delay. + +Optional properties: +- mediatek,tx-delay-ps: TX clock delay macro value. Default is 0. + It should be defined for RGMII/MII interface. +- mediatek,rx-delay-ps: RX clock delay macro value. Default is 0. + It should be defined for RGMII/MII/RMII interface. +Both delay properties need to be a multiple of 170 for RGMII interface, +or will round down. Range 0~31*170. +Both delay properties need to be a multiple of 550 for MII/RMII interface, +or will round down. Range 0~31*550. + +- mediatek,rmii-rxc: boolean property, if present indicates that the RMII + reference clock, which is from external PHYs, is connected to RXC pin + on MT2712 SoC. + Otherwise, is connected to TXC pin. +- mediatek,txc-inverse: boolean property, if present indicates that + 1. tx clock will be inversed in MII/RGMII case, + 2. tx clock inside MAC will be inversed relative to reference clock + which is from external PHYs in RMII case, and it rarely happen. +- mediatek,rxc-inverse: boolean property, if present indicates that + 1. rx clock will be inversed in MII/RGMII case. + 2. reference clock will be inversed when arrived at MAC in RMII case. +- assigned-clocks: mac_main and ptp_ref clocks +- assigned-clock-parents: parent clocks of the assigned clocks + +Example: + eth: ethernet@1101c000 { + compatible = "mediatek,mt2712-gmac"; + reg = <0 0x1101c000 0 0x1300>; + interrupts = <GIC_SPI 237 IRQ_TYPE_LEVEL_LOW>; + interrupt-names = "macirq"; + phy-mode ="rgmii"; + mac-address = [00 55 7b b5 7d f7]; + clock-names = "axi", + "apb", + "mac_main", + "ptp_ref", + "ptp_top"; + clocks = <&pericfg CLK_PERI_GMAC>, + <&pericfg CLK_PERI_GMAC_PCLK>, + <&topckgen CLK_TOP_ETHER_125M_SEL>, + <&topckgen CLK_TOP_ETHER_50M_SEL>; + assigned-clocks = <&topckgen CLK_TOP_ETHER_125M_SEL>, + <&topckgen CLK_TOP_ETHER_50M_SEL>; + assigned-clock-parents = <&topckgen CLK_TOP_ETHERPLL_125M>, + <&topckgen CLK_TOP_APLL1_D3>; + mediatek,pericfg = <&pericfg>; + mediatek,tx-delay-ps = <1530>; + mediatek,rx-delay-ps = <1530>; + mediatek,rmii-rxc; + mediatek,txc-inverse; + mediatek,rxc-inverse; + snps,txpbl = <32>; + snps,rxpbl = <32>; + snps,reset-gpio = <&pio 87 GPIO_ACTIVE_LOW>; + snps,reset-active-low; + }; diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt index 3530256a879c..7ad36213093e 100644 --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt @@ -18,6 +18,7 @@ Required properties: R-Car Gen2 and RZ/G1 devices. - "renesas,etheravb-r8a774a1" for the R8A774A1 SoC. + - "renesas,etheravb-r8a774c0" for the R8A774C0 SoC. - "renesas,etheravb-r8a7795" for the R8A7795 SoC. - "renesas,etheravb-r8a7796" for the R8A7796 SoC. - "renesas,etheravb-r8a77965" for the R8A77965 SoC. diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt index 2196d1ab3c8c..ae661e65354e 100644 --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt @@ -21,10 +21,22 @@ can be provided per device. SNOC based devices (i.e. wcn3990) uses compatible string "qcom,wcn3990-wifi". -Optional properties: - reg: Address and length of the register set for the device. - reg-names: Must include the list of following reg names, "membase" +- interrupts: reference to the list of 17 interrupt numbers for "qcom,ipq4019-wifi" + compatible target. + reference to the list of 12 interrupt numbers for "qcom,wcn3990-wifi" + compatible target. + Must contain interrupt-names property per entry for + "qcom,ath10k", "qcom,ipq4019-wifi" compatible targets. + +- interrupt-names: Must include the entries for MSI interrupt + names ("msi0" to "msi15") and legacy interrupt + name ("legacy") for "qcom,ath10k", "qcom,ipq4019-wifi" + compatible targets. + +Optional properties: - resets: Must contain an entry for each entry in reset-names. See ../reset/reseti.txt for details. - reset-names: Must include the list of following reset names, @@ -37,12 +49,9 @@ Optional properties: - clocks: List of clock specifiers, must contain an entry for each required entry in clock-names. - clock-names: Should contain the clock names "wifi_wcss_cmd", "wifi_wcss_ref", - "wifi_wcss_rtc". -- interrupts: List of interrupt lines. Must contain an entry - for each entry in the interrupt-names property. -- interrupt-names: Must include the entries for MSI interrupt - names ("msi0" to "msi15") and legacy interrupt - name ("legacy"), + "wifi_wcss_rtc" for "qcom,ipq4019-wifi" compatible target and + "cxo_ref_clk_pin" for "qcom,wcn3990-wifi" + compatible target. - qcom,msi_addr: MSI interrupt address. - qcom,msi_base: Base value to add before writing MSI data into MSI address register. @@ -55,14 +64,25 @@ Optional properties: - qcom,ath10k-pre-calibration-data : pre calibration data as an array, the length can vary between hw versions. - <supply-name>-supply: handle to the regulator device tree node - optional "supply-name" is "vdd-0.8-cx-mx". + optional "supply-name" are "vdd-0.8-cx-mx", + "vdd-1.8-xo", "vdd-1.3-rfa" and "vdd-3.3-ch0". - memory-region: Usage: optional Value type: <phandle> Definition: reference to the reserved-memory for the msa region used by the wifi firmware running in Q6. +- iommus: + Usage: optional + Value type: <prop-encoded-array> + Definition: A list of phandle and IOMMU specifier pairs. +- ext-fem-name: + Usage: Optional + Value type: string + Definition: Name of external front end module used. Some valid FEM names + for example: "microsemi-lx5586", "sky85703-11" + and "sky85803" etc. -Example (to supply the calibration data alone): +Example (to supply PCI based wifi block details): In this example, the node is defined as child node of the PCI controller. @@ -74,10 +94,10 @@ pci { #address-cells = <3>; device_type = "pci"; - ath10k@0,0 { + wifi@0,0 { reg = <0 0 0 0 0>; - device_type = "pci"; qcom,ath10k-calibration-data = [ 01 02 03 ... ]; + ext-fem-name = "microsemi-lx5586"; }; }; }; @@ -138,21 +158,25 @@ wifi@18000000 { compatible = "qcom,wcn3990-wifi"; reg = <0x18800000 0x800000>; reg-names = "membase"; - clocks = <&clock_gcc clk_aggre2_noc_clk>; - clock-names = "smmu_aggre2_noc_clk" + clocks = <&clock_gcc clk_rf_clk2_pin>; + clock-names = "cxo_ref_clk_pin"; interrupts = - <0 130 0 /* CE0 */ >, - <0 131 0 /* CE1 */ >, - <0 132 0 /* CE2 */ >, - <0 133 0 /* CE3 */ >, - <0 134 0 /* CE4 */ >, - <0 135 0 /* CE5 */ >, - <0 136 0 /* CE6 */ >, - <0 137 0 /* CE7 */ >, - <0 138 0 /* CE8 */ >, - <0 139 0 /* CE9 */ >, - <0 140 0 /* CE10 */ >, - <0 141 0 /* CE11 */ >; + <GIC_SPI 414 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 415 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 416 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 417 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 418 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 419 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 420 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 421 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 422 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 423 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 424 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 425 IRQ_TYPE_LEVEL_HIGH>; vdd-0.8-cx-mx-supply = <&pm8998_l5>; + vdd-1.8-xo-supply = <&vreg_l7a_1p8>; + vdd-1.3-rfa-supply = <&vreg_l17a_1p3>; + vdd-3.3-ch0-supply = <&vreg_l25a_3p3>; memory-region = <&wifi_msa_mem>; + iommus = <&apps_smmu 0x0040 0x1>; }; diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index a2f4451cf3f9..1c9de314197e 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -171,6 +171,7 @@ holtek Holtek Semiconductor, Inc. hwacom HwaCom Systems Inc. i2se I2SE GmbH ibm International Business Machines (IBM) +icplus IC Plus Corp. idt Integrated Device Technologies, Inc. ifi Ingenieurburo Fur Ic-Technologie (I/F/I) ilitek ILI Technology Corporation (ILITEK) diff --git a/Documentation/networking/3c509.txt b/Documentation/networking/device_drivers/3com/3c509.txt index fbf722e15ac3..fbf722e15ac3 100644 --- a/Documentation/networking/3c509.txt +++ b/Documentation/networking/device_drivers/3com/3c509.txt diff --git a/Documentation/networking/vortex.txt b/Documentation/networking/device_drivers/3com/vortex.txt index ad3dead052a4..587f3fcfbcae 100644 --- a/Documentation/networking/vortex.txt +++ b/Documentation/networking/device_drivers/3com/vortex.txt @@ -1,4 +1,4 @@ -Documentation/networking/vortex.txt +Documentation/networking/device_drivers/3com/vortex.txt Andrew Morton 30 April 2000 diff --git a/Documentation/networking/ena.txt b/Documentation/networking/device_drivers/amazon/ena.txt index 2b4b6f57e549..2b4b6f57e549 100644 --- a/Documentation/networking/ena.txt +++ b/Documentation/networking/device_drivers/amazon/ena.txt diff --git a/Documentation/networking/cxgb.txt b/Documentation/networking/device_drivers/chelsio/cxgb.txt index 20a887615c4a..20a887615c4a 100644 --- a/Documentation/networking/cxgb.txt +++ b/Documentation/networking/device_drivers/chelsio/cxgb.txt diff --git a/Documentation/networking/cs89x0.txt b/Documentation/networking/device_drivers/cirrus/cs89x0.txt index 0e190180eec8..0e190180eec8 100644 --- a/Documentation/networking/cs89x0.txt +++ b/Documentation/networking/device_drivers/cirrus/cs89x0.txt diff --git a/Documentation/networking/dm9000.txt b/Documentation/networking/device_drivers/davicom/dm9000.txt index 5552e2e575c5..5552e2e575c5 100644 --- a/Documentation/networking/dm9000.txt +++ b/Documentation/networking/device_drivers/davicom/dm9000.txt diff --git a/Documentation/networking/de4x5.txt b/Documentation/networking/device_drivers/dec/de4x5.txt index c8e4ca9b2c3e..c8e4ca9b2c3e 100644 --- a/Documentation/networking/de4x5.txt +++ b/Documentation/networking/device_drivers/dec/de4x5.txt diff --git a/Documentation/networking/dmfe.txt b/Documentation/networking/device_drivers/dec/dmfe.txt index 25320bf19c86..25320bf19c86 100644 --- a/Documentation/networking/dmfe.txt +++ b/Documentation/networking/device_drivers/dec/dmfe.txt diff --git a/Documentation/networking/dl2k.txt b/Documentation/networking/device_drivers/dlink/dl2k.txt index cba74f7a3abc..cba74f7a3abc 100644 --- a/Documentation/networking/dl2k.txt +++ b/Documentation/networking/device_drivers/dlink/dl2k.txt diff --git a/Documentation/networking/dpaa.txt b/Documentation/networking/device_drivers/freescale/dpaa.txt index f88194f71c54..f88194f71c54 100644 --- a/Documentation/networking/dpaa.txt +++ b/Documentation/networking/device_drivers/freescale/dpaa.txt diff --git a/Documentation/networking/dpaa2/dpio-driver.rst b/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst index 13588104161b..a188466b6698 100644 --- a/Documentation/networking/dpaa2/dpio-driver.rst +++ b/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst @@ -19,8 +19,8 @@ pool management for network interfaces. This document provides an overview the Linux DPIO driver, its subcomponents, and its APIs. -See Documentation/networking/dpaa2/overview.rst for a general overview of DPAA2 -and the general DPAA2 driver architecture in Linux. +See Documentation/networking/device_drivers/freescale/dpaa2/overview.rst for +a general overview of DPAA2 and the general DPAA2 driver architecture in Linux. Driver Overview --------------- diff --git a/Documentation/networking/dpaa2/ethernet-driver.rst b/Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst index 90ec940749e8..cb4c9a0c5a17 100644 --- a/Documentation/networking/dpaa2/ethernet-driver.rst +++ b/Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst @@ -33,7 +33,7 @@ hardware resources, like queues, do not have a corresponding MC object and are treated as internal resources of other objects. For a more detailed description of the DPAA2 architecture and its object -abstractions see *Documentation/networking/dpaa2/overview.rst*. +abstractions see *Documentation/networking/device_drivers/freescale/dpaa2/overview.rst*. Each Linux net device is built on top of a Datapath Network Interface (DPNI) object and uses Buffer Pools (DPBPs), I/O Portals (DPIOs) and Concentrators diff --git a/Documentation/networking/dpaa2/index.rst b/Documentation/networking/device_drivers/freescale/dpaa2/index.rst index 67bd87fe6c53..67bd87fe6c53 100644 --- a/Documentation/networking/dpaa2/index.rst +++ b/Documentation/networking/device_drivers/freescale/dpaa2/index.rst diff --git a/Documentation/networking/dpaa2/overview.rst b/Documentation/networking/device_drivers/freescale/dpaa2/overview.rst index d638b5a8aadd..d638b5a8aadd 100644 --- a/Documentation/networking/dpaa2/overview.rst +++ b/Documentation/networking/device_drivers/freescale/dpaa2/overview.rst diff --git a/Documentation/networking/gianfar.txt b/Documentation/networking/device_drivers/freescale/gianfar.txt index ba1daea7f2e4..ba1daea7f2e4 100644 --- a/Documentation/networking/gianfar.txt +++ b/Documentation/networking/device_drivers/freescale/gianfar.txt diff --git a/Documentation/networking/e100.rst b/Documentation/networking/device_drivers/intel/e100.rst index 5e2839b4ec92..5e2839b4ec92 100644 --- a/Documentation/networking/e100.rst +++ b/Documentation/networking/device_drivers/intel/e100.rst diff --git a/Documentation/networking/e1000.rst b/Documentation/networking/device_drivers/intel/e1000.rst index 6379d4d20771..6379d4d20771 100644 --- a/Documentation/networking/e1000.rst +++ b/Documentation/networking/device_drivers/intel/e1000.rst diff --git a/Documentation/networking/e1000e.rst b/Documentation/networking/device_drivers/intel/e1000e.rst index 33554e5416c5..33554e5416c5 100644 --- a/Documentation/networking/e1000e.rst +++ b/Documentation/networking/device_drivers/intel/e1000e.rst diff --git a/Documentation/networking/fm10k.rst b/Documentation/networking/device_drivers/intel/fm10k.rst index bf5e5942f28d..bf5e5942f28d 100644 --- a/Documentation/networking/fm10k.rst +++ b/Documentation/networking/device_drivers/intel/fm10k.rst diff --git a/Documentation/networking/i40e.rst b/Documentation/networking/device_drivers/intel/i40e.rst index 0cc16c525d10..0cc16c525d10 100644 --- a/Documentation/networking/i40e.rst +++ b/Documentation/networking/device_drivers/intel/i40e.rst diff --git a/Documentation/networking/iavf.rst b/Documentation/networking/device_drivers/intel/iavf.rst index f8b42b64eb28..f8b42b64eb28 100644 --- a/Documentation/networking/iavf.rst +++ b/Documentation/networking/device_drivers/intel/iavf.rst diff --git a/Documentation/networking/ice.rst b/Documentation/networking/device_drivers/intel/ice.rst index 4d118b827bbb..4d118b827bbb 100644 --- a/Documentation/networking/ice.rst +++ b/Documentation/networking/device_drivers/intel/ice.rst diff --git a/Documentation/networking/igb.rst b/Documentation/networking/device_drivers/intel/igb.rst index ba16b86d5593..e87a4a72ea2d 100644 --- a/Documentation/networking/igb.rst +++ b/Documentation/networking/device_drivers/intel/igb.rst @@ -177,6 +177,25 @@ rate limit using the IProute2 tool. Download the latest version of the IProute2 tool from Sourceforge if your version does not have all the features you require. +Credit Based Shaper (Qav Mode) +------------------------------ +When enabling the CBS qdisc in the hardware offload mode, traffic shaping using +the CBS (described in the IEEE 802.1Q-2018 Section 8.6.8.2 and discussed in the +Annex L) algorithm will run in the i210 controller, so it's more accurate and +uses less CPU. + +When using offloaded CBS, and the traffic rate obeys the configured rate +(doesn't go above it), CBS should have little to no effect in the latency. + +The offloaded version of the algorithm has some limits, caused by how the idle +slope is expressed in the adapter's registers. It can only represent idle slopes +in 16.38431 kbps units, which means that if a idle slope of 2576kbps is +requested, the controller will be configured to use a idle slope of ~2589 kbps, +because the driver rounds the value up. For more details, see the comments on +:c:func:`igb_config_tx_modes()`. + +NOTE: This feature is exclusive to i210 models. + Support ======= diff --git a/Documentation/networking/igbvf.rst b/Documentation/networking/device_drivers/intel/igbvf.rst index a8a9ffa4f8d3..a8a9ffa4f8d3 100644 --- a/Documentation/networking/igbvf.rst +++ b/Documentation/networking/device_drivers/intel/igbvf.rst diff --git a/Documentation/networking/README.ipw2100 b/Documentation/networking/device_drivers/intel/ipw2100.txt index 6f85e1d06031..6f85e1d06031 100644 --- a/Documentation/networking/README.ipw2100 +++ b/Documentation/networking/device_drivers/intel/ipw2100.txt diff --git a/Documentation/networking/README.ipw2200 b/Documentation/networking/device_drivers/intel/ipw2200.txt index b7658bed4906..b7658bed4906 100644 --- a/Documentation/networking/README.ipw2200 +++ b/Documentation/networking/device_drivers/intel/ipw2200.txt diff --git a/Documentation/networking/ixgb.rst b/Documentation/networking/device_drivers/intel/ixgb.rst index 8bd80e27843d..8bd80e27843d 100644 --- a/Documentation/networking/ixgb.rst +++ b/Documentation/networking/device_drivers/intel/ixgb.rst diff --git a/Documentation/networking/ixgbe.rst b/Documentation/networking/device_drivers/intel/ixgbe.rst index 725fc697fd8f..86d887a63606 100644 --- a/Documentation/networking/ixgbe.rst +++ b/Documentation/networking/device_drivers/intel/ixgbe.rst @@ -501,6 +501,19 @@ NOTE: This feature can be disabled for a specific Virtual Function (VF):: ip link set <pf dev> vf <vf id> spoofchk {off|on} +IPsec Offload +------------- +The ixgbe driver supports IPsec Hardware Offload. When creating Security +Associations with "ip xfrm ..." the 'offload' tag option can be used to +register the IPsec SA with the driver in order to get higher throughput in +the secure communications. + +The offload is also supported for ixgbe's VFs, but the VF must be set as +'trusted' and the support must be enabled with:: + + ethtool --set-priv-flags eth<x> vf-ipsec on + ip link set eth<x> vf <y> trust on + Known Issues/Troubleshooting ============================ diff --git a/Documentation/networking/ixgbevf.rst b/Documentation/networking/device_drivers/intel/ixgbevf.rst index 56cde6366c2f..56cde6366c2f 100644 --- a/Documentation/networking/ixgbevf.rst +++ b/Documentation/networking/device_drivers/intel/ixgbevf.rst diff --git a/Documentation/networking/netvsc.txt b/Documentation/networking/device_drivers/microsoft/netvsc.txt index 3bfa635bbbd5..3bfa635bbbd5 100644 --- a/Documentation/networking/netvsc.txt +++ b/Documentation/networking/device_drivers/microsoft/netvsc.txt diff --git a/Documentation/networking/s2io.txt b/Documentation/networking/device_drivers/neterion/s2io.txt index 0362a42f7cf4..0362a42f7cf4 100644 --- a/Documentation/networking/s2io.txt +++ b/Documentation/networking/device_drivers/neterion/s2io.txt diff --git a/Documentation/networking/vxge.txt b/Documentation/networking/device_drivers/neterion/vxge.txt index abfec245f97c..abfec245f97c 100644 --- a/Documentation/networking/vxge.txt +++ b/Documentation/networking/device_drivers/neterion/vxge.txt diff --git a/Documentation/networking/LICENSE.qla3xxx b/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx index 2f2077e34d81..2f2077e34d81 100644 --- a/Documentation/networking/LICENSE.qla3xxx +++ b/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx diff --git a/Documentation/networking/LICENSE.qlcnic b/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic index 2ae3b64983ab..2ae3b64983ab 100644 --- a/Documentation/networking/LICENSE.qlcnic +++ b/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic diff --git a/Documentation/networking/LICENSE.qlge b/Documentation/networking/device_drivers/qlogic/LICENSE.qlge index ce64e4d15b21..ce64e4d15b21 100644 --- a/Documentation/networking/LICENSE.qlge +++ b/Documentation/networking/device_drivers/qlogic/LICENSE.qlge diff --git a/Documentation/networking/rmnet.txt b/Documentation/networking/device_drivers/qualcomm/rmnet.txt index 6b341eaf2062..6b341eaf2062 100644 --- a/Documentation/networking/rmnet.txt +++ b/Documentation/networking/device_drivers/qualcomm/rmnet.txt diff --git a/Documentation/networking/README.sb1000 b/Documentation/networking/device_drivers/sb1000.txt index f92c2aac56a9..f92c2aac56a9 100644 --- a/Documentation/networking/README.sb1000 +++ b/Documentation/networking/device_drivers/sb1000.txt diff --git a/Documentation/networking/smc9.txt b/Documentation/networking/device_drivers/smsc/smc9.txt index d1e15074e43d..d1e15074e43d 100644 --- a/Documentation/networking/smc9.txt +++ b/Documentation/networking/device_drivers/smsc/smc9.txt diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/device_drivers/stmicro/stmmac.txt index 2bb07078f535..2bb07078f535 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/device_drivers/stmicro/stmmac.txt diff --git a/Documentation/networking/ti-cpsw.txt b/Documentation/networking/device_drivers/ti/cpsw.txt index d4d4c0751a09..d4d4c0751a09 100644 --- a/Documentation/networking/ti-cpsw.txt +++ b/Documentation/networking/device_drivers/ti/cpsw.txt diff --git a/Documentation/networking/tlan.txt b/Documentation/networking/device_drivers/ti/tlan.txt index 34550dfcef74..34550dfcef74 100644 --- a/Documentation/networking/tlan.txt +++ b/Documentation/networking/device_drivers/ti/tlan.txt diff --git a/Documentation/networking/spider_net.txt b/Documentation/networking/device_drivers/toshiba/spider_net.txt index b0b75f8463b3..b0b75f8463b3 100644 --- a/Documentation/networking/spider_net.txt +++ b/Documentation/networking/device_drivers/toshiba/spider_net.txt diff --git a/Documentation/networking/devlink-params.txt b/Documentation/networking/devlink-params.txt index ae444ffe73ac..2d26434ddcf8 100644 --- a/Documentation/networking/devlink-params.txt +++ b/Documentation/networking/devlink-params.txt @@ -40,3 +40,12 @@ msix_vec_per_pf_min [DEVICE, GENERIC] for the device initialization. Value is same across all physical functions (PFs) in the device. Type: u32 + +fw_load_policy [DEVICE, GENERIC] + Controls the device's firmware loading policy. + Valid values: + * DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER (0) + Load firmware version preferred by the driver. + * DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH (1) + Load firmware currently stored in flash. + Type: u8 diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index bd89dae8d578..6a47629ef8ed 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -31,6 +31,7 @@ Contents: net_failover alias bridge + snmp_counter .. only:: subproject diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 32b21571adfe..acdfb5d2bcaa 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -108,8 +108,8 @@ neigh/default/gc_thresh2 - INTEGER Default: 512 neigh/default/gc_thresh3 - INTEGER - Maximum number of neighbor entries allowed. Increase this - when using large numbers of interfaces and when communicating + Maximum number of non-PERMANENT neighbor entries allowed. Increase + this when using large numbers of interfaces and when communicating with large numbers of directly-connected peers. Default: 1024 @@ -370,6 +370,7 @@ tcp_l3mdev_accept - BOOLEAN derived from the listen socket to be bound to the L3 domain in which the packets originated. Only valid when the kernel was compiled with CONFIG_NET_L3_MASTER_DEV. + Default: 0 (disabled) tcp_low_latency - BOOLEAN This is a legacy option, it has no effect anymore. @@ -758,7 +759,7 @@ tcp_limit_output_bytes - INTEGER flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes limits the number of bytes on qdisc or device to reduce artificial RTT/cwnd and reduce bufferbloat. - Default: 262144 + Default: 1048576 (16 * 65536) tcp_challenge_ack_limit - INTEGER Limits number of Challenge ACK sent per second, as recommended @@ -773,6 +774,7 @@ udp_l3mdev_accept - BOOLEAN being received regardless of the L3 domain in which they originated. Only valid when the kernel was compiled with CONFIG_NET_L3_MASTER_DEV. + Default: 0 (disabled) udp_mem - vector of 3 INTEGERs: min, pressure, max Number of pages allowed for queueing by all UDP sockets. @@ -799,6 +801,16 @@ udp_wmem_min - INTEGER total pages of UDP sockets exceed udp_mem pressure. The unit is byte. Default: 4K +RAW variables: + +raw_l3mdev_accept - BOOLEAN + Enabling this option allows a "global" bound socket to work + across L3 master domains (e.g., VRFs) with packets capable of + being received regardless of the L3 domain in which they + originated. Only valid when the kernel was compiled with + CONFIG_NET_L3_MASTER_DEV. + Default: 1 (enabled) + CIPSOv4 Variables: cipso_cache_enable - BOOLEAN diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt index c4a54c162547..58dd1c1e3c65 100644 --- a/Documentation/networking/netdev-features.txt +++ b/Documentation/networking/netdev-features.txt @@ -115,7 +115,7 @@ set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). * Transmit UDP segmentation offload -NETIF_F_GSO_UDP_GSO_L4 accepts a single UDP header with a payload that exceeds +NETIF_F_GSO_UDP_L4 accepts a single UDP header with a payload that exceeds gso_size. On segmentation, it segments the payload on gso_size boundaries and replicates the network and UDP headers (fixing up the last one if less than gso_size). diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt index 1669dc2419fd..f75c2ce6e136 100644 --- a/Documentation/networking/nf_conntrack-sysctl.txt +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -157,7 +157,16 @@ nf_conntrack_udp_timeout - INTEGER (seconds) default 30 nf_conntrack_udp_timeout_stream - INTEGER (seconds) - default 180 + default 120 This extended timeout will be used in case there is an UDP stream detected. + +nf_conntrack_gre_timeout - INTEGER (seconds) + default 30 + +nf_conntrack_gre_timeout_stream - INTEGER (seconds) + default 180 + + This extended timeout will be used in case there is an GRE stream + detected. diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst new file mode 100644 index 000000000000..f8eb77ddbd44 --- /dev/null +++ b/Documentation/networking/snmp_counter.rst @@ -0,0 +1,1190 @@ +=========== +SNMP counter +=========== + +This document explains the meaning of SNMP counters. + +General IPv4 counters +==================== +All layer 4 packets and ICMP packets will change these counters, but +these counters won't be changed by layer 2 packets (such as STP) or +ARP packets. + +* IpInReceives +Defined in `RFC1213 ipInReceives`_ + +.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26 + +The number of packets received by the IP layer. It gets increasing at the +beginning of ip_rcv function, always be updated together with +IpExtInOctets. It will be increased even if the packet is dropped +later (e.g. due to the IP header is invalid or the checksum is wrong +and so on). It indicates the number of aggregated segments after +GRO/LRO. + +* IpInDelivers +Defined in `RFC1213 ipInDelivers`_ + +.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28 + +The number of packets delivers to the upper layer protocols. E.g. TCP, UDP, +ICMP and so on. If no one listens on a raw socket, only kernel +supported protocols will be delivered, if someone listens on the raw +socket, all valid IP packets will be delivered. + +* IpOutRequests +Defined in `RFC1213 ipOutRequests`_ + +.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28 + +The number of packets sent via IP layer, for both single cast and +multicast packets, and would always be updated together with +IpExtOutOctets. + +* IpExtInOctets and IpExtOutOctets +They are Linux kernel extensions, no RFC definitions. Please note, +RFC1213 indeed defines ifInOctets and ifOutOctets, but they +are different things. The ifInOctets and ifOutOctets include the MAC +layer header size but IpExtInOctets and IpExtOutOctets don't, they +only include the IP layer header and the IP layer data. + +* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts +They indicate the number of four kinds of ECN IP packets, please refer +`Explicit Congestion Notification`_ for more details. + +.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6 + +These 4 counters calculate how many packets received per ECN +status. They count the real frame number regardless the LRO/GRO. So +for the same packet, you might find that IpInReceives count 1, but +IpExtInNoECTPkts counts 2 or more. + +* IpInHdrErrors +Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is +dropped due to the IP header error. It might happen in both IP input +and IP forward paths. + +.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27 + +* IpInAddrErrors +Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two +scenarios: (1) The IP address is invalid. (2) The destination IP +address is not a local address and IP forwarding is not enabled + +.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27 + +* IpExtInNoRoutes +This counter means the packet is dropped when the IP stack receives a +packet and can't find a route for it from the route table. It might +happen when IP forwarding is enabled and the destination IP address is +not a local address and there is no route for the destination IP +address. + +* IpInUnknownProtos +Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the +layer 4 protocol is unsupported by kernel. If an application is using +raw socket, kernel will always deliver the packet to the raw socket +and this counter won't be increased. + +.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27 + +* IpExtInTruncatedPkts +For IPv4 packet, it means the actual data size is smaller than the +"Total Length" field in the IPv4 header. + +* IpInDiscards +Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped +in the IP receiving path and due to kernel internal reasons (e.g. no +enough memory). + +.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28 + +* IpOutDiscards +Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is +dropped in the IP sending path and due to kernel internal reasons. + +.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28 + +* IpOutNoRoutes +Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is +dropped in the IP sending path and no route is found for it. + +.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29 + +ICMP counters +============ +* IcmpInMsgs and IcmpOutMsgs +Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_ + +.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41 +.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43 + +As mentioned in the RFC1213, these two counters include errors, they +would be increased even if the ICMP packet has an invalid type. The +ICMP output path will check the header of a raw socket, so the +IcmpOutMsgs would still be updated if the IP header is constructed by +a userspace program. + +* ICMP named types +| These counters include most of common ICMP types, they are: +| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_ +| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_ +| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_ +| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_ +| IcmpInRedirects: `RFC1213 icmpInRedirects`_ +| IcmpInEchos: `RFC1213 icmpInEchos`_ +| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_ +| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_ +| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_ +| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_ +| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_ +| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_ +| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_ +| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_ +| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_ +| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_ +| IcmpOutEchos: `RFC1213 icmpOutEchos`_ +| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_ +| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_ +| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_ +| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_ +| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_ + +.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41 +.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41 +.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42 +.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42 +.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42 +.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42 +.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42 +.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42 +.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43 +.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43 +.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43 + +.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44 +.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44 +.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44 +.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44 +.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44 +.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45 +.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45 +.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45 +.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45 +.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45 +.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46 + +Every ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP +Echo packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are +straightforward. The 'In' counter means kernel receives such a packet +and the 'Out' counter means kernel sends such a packet. + +* ICMP numeric types +They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the +ICMP type number. These counters track all kinds of ICMP packets. The +ICMP type number definition could be found in the `ICMP parameters`_ +document. + +.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml + +For example, if the Linux kernel sends an ICMP Echo packet, the +IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply +packet, IcmpMsgInType0 would increase 1. + +* IcmpInCsumErrors +This counter indicates the checksum of the ICMP packet is +wrong. Kernel verifies the checksum after updating the IcmpInMsgs and +before updating IcmpMsgInType[N]. If a packet has bad checksum, the +IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated. + +* IcmpInErrors and IcmpOutErrors +Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_ + +.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41 +.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43 + +When an error occurs in the ICMP packet handler path, these two +counters would be updated. The receiving packet path use IcmpInErrors +and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors +is increased, IcmpInErrors would always be increased too. + +relationship of the ICMP counters +------------------------------- +The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they +are updated at the same time. The sum of IcmpMsgInType[N] plus +IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel +receives an ICMP packet, kernel follows below logic: + +1. increase IcmpInMsgs +2. if has any error, update IcmpInErrors and finish the process +3. update IcmpMsgOutType[N] +4. handle the packet depending on the type, if has any error, update + IcmpInErrors and finish the process + +So if all errors occur in step (2), IcmpInMsgs should be equal to the +sum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in +step (4), IcmpInMsgs should be equal to the sum of +IcmpMsgOutType[N]. If the errors occur in both step (2) and step (4), +IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus +IcmpInErrors. + +General TCP counters +================== +* TcpInSegs +Defined in `RFC1213 tcpInSegs`_ + +.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48 + +The number of packets received by the TCP layer. As mentioned in +RFC1213, it includes the packets received in error, such as checksum +error, invalid TCP header and so on. Only one error won't be included: +if the layer 2 destination address is not the NIC's layer 2 +address. It might happen if the packet is a multicast or broadcast +packet, or the NIC is in promiscuous mode. In these situations, the +packets would be delivered to the TCP layer, but the TCP layer will discard +these packets before increasing TcpInSegs. The TcpInSegs counter +isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs +counter would only increase 1. + +* TcpOutSegs +Defined in `RFC1213 tcpOutSegs`_ + +.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48 + +The number of packets sent by the TCP layer. As mentioned in RFC1213, +it excludes the retransmitted packets. But it includes the SYN, ACK +and RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of +GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will +increase 2. + +* TcpActiveOpens +Defined in `RFC1213 tcpActiveOpens`_ + +.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47 + +It means the TCP layer sends a SYN, and come into the SYN-SENT +state. Every time TcpActiveOpens increases 1, TcpOutSegs should always +increase 1. + +* TcpPassiveOpens +Defined in `RFC1213 tcpPassiveOpens`_ + +.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47 + +It means the TCP layer receives a SYN, replies a SYN+ACK, come into +the SYN-RCVD state. + +* TcpExtTCPRcvCoalesce +When packets are received by the TCP layer and are not be read by the +application, the TCP layer will try to merge them. This counter +indicate how many packets are merged in such situation. If GRO is +enabled, lots of packets would be merged by GRO, these packets +wouldn't be counted to TcpExtTCPRcvCoalesce. + +* TcpExtTCPAutoCorking +When sending packets, the TCP layer will try to merge small packets to +a bigger one. This counter increase 1 for every packet merged in such +situation. Please refer to the LWN article for more details: +https://lwn.net/Articles/576263/ + +* TcpExtTCPOrigDataSent +This counter is explained by `kernel commit f19c29e3e391`_, I pasted the +explaination below:: + + TCPOrigDataSent: number of outgoing packets with original data (excluding + retransmission but including data-in-SYN). This counter is different from + TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is + more useful to track the TCP retransmission rate. + +* TCPSynRetrans +This counter is explained by `kernel commit f19c29e3e391`_, I pasted the +explaination below:: + + TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down + retransmissions into SYN, fast-retransmits, timeout retransmits, etc. + +* TCPFastOpenActiveFail +This counter is explained by `kernel commit f19c29e3e391`_, I pasted the +explaination below:: + + TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because + the remote does not accept it or the attempts timed out. + +.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd + +* TcpExtListenOverflows and TcpExtListenDrops +When kernel receives a SYN from a client, and if the TCP accept queue +is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows. +At the same time kernel will also add 1 to TcpExtListenDrops. When a +TCP socket is in LISTEN state, and kernel need to drop a packet, +kernel would always add 1 to TcpExtListenDrops. So increase +TcpExtListenOverflows would let TcpExtListenDrops increasing at the +same time, but TcpExtListenDrops would also increase without +TcpExtListenOverflows increasing, e.g. a memory allocation fail would +also let TcpExtListenDrops increase. + +Note: The above explanation is based on kernel 4.10 or above version, on +an old kernel, the TCP stack has different behavior when TCP accept +queue is full. On the old kernel, TCP stack won't drop the SYN, it +would complete the 3-way handshake. As the accept queue is full, TCP +stack will keep the socket in the TCP half-open queue. As it is in the +half open queue, TCP stack will send SYN+ACK on an exponential backoff +timer, after client replies ACK, TCP stack checks whether the accept +queue is still full, if it is not full, moves the socket to the accept +queue, if it is full, keeps the socket in the half-open queue, at next +time client replies ACK, this socket will get another chance to move +to the accept queue. + + +TCP Fast Open +============ +When kernel receives a TCP packet, it has two paths to handler the +packet, one is fast path, another is slow path. The comment in kernel +code provides a good explanation of them, I pasted them below:: + + It is split into a fast path and a slow path. The fast path is + disabled when: + + - A zero window was announced from us + - zero window probing + is only handled properly on the slow path. + - Out of order segments arrived. + - Urgent data is expected. + - There is no buffer space left + - Unexpected TCP flags/window values/header lengths are received + (detected by checking the TCP header against pred_flags) + - Data is sent in both directions. The fast path only supports pure senders + or pure receivers (this means either the sequence number or the ack + value must stay constant) + - Unexpected TCP option. + +Kernel will try to use fast path unless any of the above conditions +are satisfied. If the packets are out of order, kernel will handle +them in slow path, which means the performance might be not very +good. Kernel would also come into slow path if the "Delayed ack" is +used, because when using "Delayed ack", the data is sent in both +directions. When the TCP window scale option is not used, kernel will +try to enable fast path immediately when the connection comes into the +established state, but if the TCP window scale option is used, kernel +will disable the fast path at first, and try to enable it after kernel +receives packets. + +* TcpExtTCPPureAcks and TcpExtTCPHPAcks +If a packet set ACK flag and has no data, it is a pure ACK packet, if +kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1, +if kernel handles it in the slow path, TcpExtTCPPureAcks will +increase 1. + +* TcpExtTCPHPHits +If a TCP packet has data (which means it is not a pure ACK packet), +and this packet is handled in the fast path, TcpExtTCPHPHits will +increase 1. + + +TCP abort +======== + + +* TcpExtTCPAbortOnData +It means TCP layer has data in flight, but need to close the +connection. So TCP layer sends a RST to the other side, indicate the +connection is not closed very graceful. An easy way to increase this +counter is using the SO_LINGER option. Please refer to the SO_LINGER +section of the `socket man page`_: + +.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html + +By default, when an application closes a connection, the close function +will return immediately and kernel will try to send the in-flight data +async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger +to a positive number, the close function won't return immediately, but +wait for the in-flight data are acked by the other side, the max wait +time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0, +when the application closes a connection, kernel will send a RST +immediately and increase the TcpExtTCPAbortOnData counter. + +* TcpExtTCPAbortOnClose +This counter means the application has unread data in the TCP layer when +the application wants to close the TCP connection. In such a situation, +kernel will send a RST to the other side of the TCP connection. + +* TcpExtTCPAbortOnMemory +When an application closes a TCP connection, kernel still need to track +the connection, let it complete the TCP disconnect process. E.g. an +app calls the close method of a socket, kernel sends fin to the other +side of the connection, then the app has no relationship with the +socket any more, but kernel need to keep the socket, this socket +becomes an orphan socket, kernel waits for the reply of the other side, +and would come to the TIME_WAIT state finally. When kernel has no +enough memory to keep the orphan socket, kernel would send an RST to +the other side, and delete the socket, in such situation, kernel will +increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger +TcpExtTCPAbortOnMemory: + +1. the memory used by the TCP protocol is higher than the third value of +the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_: + +.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html + +2. the orphan socket count is higher than net.ipv4.tcp_max_orphans + + +* TcpExtTCPAbortOnTimeout +This counter will increase when any of the TCP timers expire. In such +situation, kernel won't send RST, just give up the connection. + +* TcpExtTCPAbortOnLinger +When a TCP connection comes into FIN_WAIT_2 state, instead of waiting +for the fin packet from the other side, kernel could send a RST and +delete the socket immediately. This is not the default behavior of +Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option, +you could let kernel follow this behavior. + +* TcpExtTCPAbortFailed +The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is +satisfied. If an internal error occurs during this process, +TcpExtTCPAbortFailed will be increased. + +.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50 + +TCP Hybrid Slow Start +==================== +The Hybrid Slow Start algorithm is an enhancement of the traditional +TCP congestion window Slow Start algorithm. It uses two pieces of +information to detect whether the max bandwidth of the TCP path is +approached. The two pieces of information are ACK train length and +increase in packet delay. For detail information, please refer the +`Hybrid Slow Start paper`_. Either ACK train length or packet delay +hits a specific threshold, the congestion control algorithm will come +into the Congestion Avoidance state. Until v4.20, two congestion +control algorithms are using Hybrid Slow Start, they are cubic (the +default congestion control algorithm) and cdg. Four snmp counters +relate with the Hybrid Slow Start algorithm. + +.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf + +* TcpExtTCPHystartTrainDetect +How many times the ACK train length threshold is detected + +* TcpExtTCPHystartTrainCwnd +The sum of CWND detected by ACK train length. Dividing this value by +TcpExtTCPHystartTrainDetect is the average CWND which detected by the +ACK train length. + +* TcpExtTCPHystartDelayDetect +How many times the packet delay threshold is detected. + +* TcpExtTCPHystartDelayCwnd +The sum of CWND detected by packet delay. Dividing this value by +TcpExtTCPHystartDelayDetect is the average CWND which detected by the +packet delay. + +TCP retransmission and congestion control +====================================== +The TCP protocol has two retransmission mechanisms: SACK and fast +recovery. They are exclusive with each other. When SACK is enabled, +the kernel TCP stack would use SACK, or kernel would use fast +recovery. The SACK is a TCP option, which is defined in `RFC2018`_, +the fast recovery is defined in `RFC6582`_, which is also called +'Reno'. + +The TCP congestion control is a big and complex topic. To understand +the related snmp counter, we need to know the states of the congestion +control state machine. There are 5 states: Open, Disorder, CWR, +Recovery and Loss. For details about these states, please refer page 5 +and page 6 of this document: +https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf + +.. _RFC2018: https://tools.ietf.org/html/rfc2018 +.. _RFC6582: https://tools.ietf.org/html/rfc6582 + +* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery +When the congestion control comes into Recovery state, if sack is +used, TcpExtTCPSackRecovery increases 1, if sack is not used, +TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP +stack begins to retransmit the lost packets. + +* TcpExtTCPSACKReneging +A packet was acknowledged by SACK, but the receiver has dropped this +packet, so the sender needs to retransmit this packet. In this +situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver +could drop a packet which has been acknowledged by SACK, although it is +unusual, it is allowed by the TCP protocol. The sender doesn't really +know what happened on the receiver side. The sender just waits until +the RTO expires for this packet, then the sender assumes this packet +has been dropped by the receiver. + +* TcpExtTCPRenoReorder +The reorder packet is detected by fast recovery. It would only be used +if SACK is disabled. The fast recovery algorithm detects recorder by +the duplicate ACK number. E.g., if retransmission is triggered, and +the original retransmitted packet is not lost, it is just out of +order, the receiver would acknowledge multiple times, one for the +retransmitted packet, another for the arriving of the original out of +order packet. Thus the sender would find more ACks than its +expectation, and the sender knows out of order occurs. + +* TcpExtTCPTSReorder +The reorder packet is detected when a hole is filled. E.g., assume the +sender sends packet 1,2,3,4,5, and the receiving order is +1,2,4,5,3. When the sender receives the ACK of packet 3 (which will +fill the hole), two conditions will let TcpExtTCPTSReorder increase +1: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet +3 is retransmitted but the timestamp of the packet 3's ACK is earlier +than the retransmission timestamp. + +* TcpExtTCPSACKReorder +The reorder packet detected by SACK. The SACK has two methods to +detect reorder: (1) DSACK is received by the sender. It means the +sender sends the same packet more than one times. And the only reason +is the sender believes an out of order packet is lost so it sends the +packet again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and +the sender has received SACKs for packet 2 and 5, now the sender +receives SACK for packet 4 and the sender doesn't retransmit the +packet yet, the sender would know packet 4 is out of order. The TCP +stack of kernel will increase TcpExtTCPSACKReorder for both of the +above scenarios. + + +DSACK +===== +The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report +duplicate packets to the sender. There are two kinds of +duplications: (1) a packet which has been acknowledged is +duplicate. (2) an out of order packet is duplicate. The TCP stack +counts these two kinds of duplications on both receiver side and +sender side. + +.. _RFC2883 : https://tools.ietf.org/html/rfc2883 + +* TcpExtTCPDSACKOldSent +The TCP stack receives a duplicate packet which has been acked, so it +sends a DSACK to the sender. + +* TcpExtTCPDSACKOfoSent +The TCP stack receives an out of order duplicate packet, so it sends a +DSACK to the sender. + +* TcpExtTCPDSACKRecv +The TCP stack receives a DSACK, which indicate an acknowledged +duplicate packet is received. + +* TcpExtTCPDSACKOfoRecv +The TCP stack receives a DSACK, which indicate an out of order +duplciate packet is received. + +examples +======= + +ping test +-------- +Run the ping command against the public dns server 8.8.8.8:: + + nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1 + PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. + 64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms + + --- 8.8.8.8 ping statistics --- + 1 packets transmitted, 1 received, 0% packet loss, time 0ms + rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms + +The nstayt result:: + + nstatuser@nstat-a:~$ nstat + #kernel + IpInReceives 1 0.0 + IpInDelivers 1 0.0 + IpOutRequests 1 0.0 + IcmpInMsgs 1 0.0 + IcmpInEchoReps 1 0.0 + IcmpOutMsgs 1 0.0 + IcmpOutEchos 1 0.0 + IcmpMsgInType0 1 0.0 + IcmpMsgOutType8 1 0.0 + IpExtInOctets 84 0.0 + IpExtOutOctets 84 0.0 + IpExtInNoECTPkts 1 0.0 + +The Linux server sent an ICMP Echo packet, so IpOutRequests, +IcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The +server got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs, +IcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply +was passed to the ICMP layer via IP layer, so IpInDelivers was +increased 1. The default ping data size is 48, so an ICMP Echo packet +and its corresponding Echo Reply packet are constructed by: + +* 14 bytes MAC header +* 20 bytes IP header +* 16 bytes ICMP header +* 48 bytes data (default value of the ping command) + +So the IpExtInOctets and IpExtOutOctets are 20+16+48=84. + +tcp 3-way handshake +------------------ +On server side, we run:: + + nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000 + Listening on [0.0.0.0] (family 0, port 9000) + +On client side, we run:: + + nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000 + Connection to 192.168.122.251 9000 port [tcp/*] succeeded! + +The server listened on tcp 9000 port, the client connected to it, they +completed the 3-way handshake. + +On server side, we can find below nstat output:: + + nstatuser@nstat-b:~$ nstat | grep -i tcp + TcpPassiveOpens 1 0.0 + TcpInSegs 2 0.0 + TcpOutSegs 1 0.0 + TcpExtTCPPureAcks 1 0.0 + +On client side, we can find below nstat output:: + + nstatuser@nstat-a:~$ nstat | grep -i tcp + TcpActiveOpens 1 0.0 + TcpInSegs 1 0.0 + TcpOutSegs 2 0.0 + +When the server received the first SYN, it replied a SYN+ACK, and came into +SYN-RCVD state, so TcpPassiveOpens increased 1. The server received +SYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2 +packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK +of the 3-way handshake is a pure ACK without data, so +TcpExtTCPPureAcks increased 1. + +When the client sent SYN, the client came into the SYN-SENT state, so +TcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent +ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased +1, TcpOutSegs increased 2. + +TCP normal traffic +----------------- +Run nc on server:: + + nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 + Listening on [0.0.0.0] (family 0, port 9000) + +Run nc on client:: + + nstatuser@nstat-a:~$ nc -v nstat-b 9000 + Connection to nstat-b 9000 port [tcp/*] succeeded! + +Input a string in the nc client ('hello' in our example):: + + nstatuser@nstat-a:~$ nc -v nstat-b 9000 + Connection to nstat-b 9000 port [tcp/*] succeeded! + hello + +The client side nstat output:: + + nstatuser@nstat-a:~$ nstat + #kernel + IpInReceives 1 0.0 + IpInDelivers 1 0.0 + IpOutRequests 1 0.0 + TcpInSegs 1 0.0 + TcpOutSegs 1 0.0 + TcpExtTCPPureAcks 1 0.0 + TcpExtTCPOrigDataSent 1 0.0 + IpExtInOctets 52 0.0 + IpExtOutOctets 58 0.0 + IpExtInNoECTPkts 1 0.0 + +The server side nstat output:: + + nstatuser@nstat-b:~$ nstat + #kernel + IpInReceives 1 0.0 + IpInDelivers 1 0.0 + IpOutRequests 1 0.0 + TcpInSegs 1 0.0 + TcpOutSegs 1 0.0 + IpExtInOctets 58 0.0 + IpExtOutOctets 52 0.0 + IpExtInNoECTPkts 1 0.0 + +Input a string in nc client side again ('world' in our exmaple):: + + nstatuser@nstat-a:~$ nc -v nstat-b 9000 + Connection to nstat-b 9000 port [tcp/*] succeeded! + hello + world + +Client side nstat output:: + + nstatuser@nstat-a:~$ nstat + #kernel + IpInReceives 1 0.0 + IpInDelivers 1 0.0 + IpOutRequests 1 0.0 + TcpInSegs 1 0.0 + TcpOutSegs 1 0.0 + TcpExtTCPHPAcks 1 0.0 + TcpExtTCPOrigDataSent 1 0.0 + IpExtInOctets 52 0.0 + IpExtOutOctets 58 0.0 + IpExtInNoECTPkts 1 0.0 + + +Server side nstat output:: + + nstatuser@nstat-b:~$ nstat + #kernel + IpInReceives 1 0.0 + IpInDelivers 1 0.0 + IpOutRequests 1 0.0 + TcpInSegs 1 0.0 + TcpOutSegs 1 0.0 + TcpExtTCPHPHits 1 0.0 + IpExtInOctets 58 0.0 + IpExtOutOctets 52 0.0 + IpExtInNoECTPkts 1 0.0 + +Compare the first client-side nstat and the second client-side nstat, +we could find one difference: the first one had a 'TcpExtTCPPureAcks', +but the second one had a 'TcpExtTCPHPAcks'. The first server-side +nstat and the second server-side nstat had a difference too: the +second server-side nstat had a TcpExtTCPHPHits, but the first +server-side nstat didn't have it. The network traffic patterns were +exactly the same: the client sent a packet to the server, the server +replied an ACK. But kernel handled them in different ways. When the +TCP window scale option is not used, kernel will try to enable fast +path immediately when the connection comes into the established state, +but if the TCP window scale option is used, kernel will disable the +fast path at first, and try to enable it after kerenl receives +packets. We could use the 'ss' command to verify whether the window +scale option is used. e.g. run below command on either server or +client:: + + nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 ) + Netid Recv-Q Send-Q Local Address:Port Peer Address:Port + tcp 0 0 192.168.122.250:40654 192.168.122.251:9000 + ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98 + +The 'wscale:7,7' means both server and client set the window scale +option to 7. Now we could explain the nstat output in our test: + +In the first nstat output of client side, the client sent a packet, server +reply an ACK, when kernel handled this ACK, the fast path was not +enabled, so the ACK was counted into 'TcpExtTCPPureAcks'. + +In the second nstat output of client side, the client sent a packet again, +and received another ACK from the server, in this time, the fast path is +enabled, and the ACK was qualified for fast path, so it was handled by +the fast path, so this ACK was counted into TcpExtTCPHPAcks. + +In the first nstat output of server side, fast path was not enabled, +so there was no 'TcpExtTCPHPHits'. + +In the second nstat output of server side, the fast path was enabled, +and the packet received from client qualified for fast path, so it +was counted into 'TcpExtTCPHPHits'. + +TcpExtTCPAbortOnClose +-------------------- +On the server side, we run below python script:: + + import socket + import time + + port = 9000 + + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.bind(('0.0.0.0', port)) + s.listen(1) + sock, addr = s.accept() + while True: + time.sleep(9999999) + +This python script listen on 9000 port, but doesn't read anything from +the connection. + +On the client side, we send the string "hello" by nc:: + + nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000 + +Then, we come back to the server side, the server has received the "hello" +packet, and the TCP layer has acked this packet, but the application didn't +read it yet. We type Ctrl-C to terminate the server script. Then we +could find TcpExtTCPAbortOnClose increased 1 on the server side:: + + nstatuser@nstat-b:~$ nstat | grep -i abort + TcpExtTCPAbortOnClose 1 0.0 + +If we run tcpdump on the server side, we could find the server sent a +RST after we type Ctrl-C. + +TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout +----------------------------------------------- +Below is an example which let the orphan socket count be higher than +net.ipv4.tcp_max_orphans. +Change tcp_max_orphans to a smaller value on client:: + + sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans" + +Client code (create 64 connection to server):: + + nstatuser@nstat-a:~$ cat client_orphan.py + import socket + import time + + server = 'nstat-b' # server address + port = 9000 + + count = 64 + + connection_list = [] + + for i in range(64): + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.connect((server, port)) + connection_list.append(s) + print("connection_count: %d" % len(connection_list)) + + while True: + time.sleep(99999) + +Server code (accept 64 connection from client):: + + nstatuser@nstat-b:~$ cat server_orphan.py + import socket + import time + + port = 9000 + count = 64 + + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.bind(('0.0.0.0', port)) + s.listen(count) + connection_list = [] + while True: + sock, addr = s.accept() + connection_list.append((sock, addr)) + print("connection_count: %d" % len(connection_list)) + +Run the python scripts on server and client. + +On server:: + + python3 server_orphan.py + +On client:: + + python3 client_orphan.py + +Run iptables on server:: + + sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP + +Type Ctrl-C on client, stop client_orphan.py. + +Check TcpExtTCPAbortOnMemory on client:: + + nstatuser@nstat-a:~$ nstat | grep -i abort + TcpExtTCPAbortOnMemory 54 0.0 + +Check orphane socket count on client:: + + nstatuser@nstat-a:~$ ss -s + Total: 131 (kernel 0) + TCP: 14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0 + + Transport Total IP IPv6 + * 0 - - + RAW 1 0 1 + UDP 1 1 0 + TCP 14 13 1 + INET 16 14 2 + FRAG 0 0 0 + +The explanation of the test: after run server_orphan.py and +client_orphan.py, we set up 64 connections between server and +client. Run the iptables command, the server will drop all packets from +the client, type Ctrl-C on client_orphan.py, the system of the client +would try to close these connections, and before they are closed +gracefully, these connections became orphan sockets. As the iptables +of the server blocked packets from the client, the server won't receive fin +from the client, so all connection on clients would be stuck on FIN_WAIT_1 +stage, so they will keep as orphan sockets until timeout. We have echo +10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would +only keep 10 orphan sockets, for all other orphan sockets, the client +system sent RST for them and delete them. We have 64 connections, so +the 'ss -s' command shows the system has 10 orphan sockets, and the +value of TcpExtTCPAbortOnMemory was 54. + +An additional explanation about orphan socket count: You could find the +exactly orphan socket count by the 'ss -s' command, but when kernel +decide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel +doesn't always check the exactly orphan socket count. For increasing +performance, kernel checks an approximate count firstly, if the +approximate count is more than tcp_max_orphans, kernel checks the +exact count again. So if the approximate count is less than +tcp_max_orphans, but exactly count is more than tcp_max_orphans, you +would find TcpExtTCPAbortOnMemory is not increased at all. If +tcp_max_orphans is large enough, it won't occur, but if you decrease +tcp_max_orphans to a small value like our test, you might find this +issue. So in our test, the client set up 64 connections although the +tcp_max_orphans is 10. If the client only set up 11 connections, we +can't find the change of TcpExtTCPAbortOnMemory. + +Continue the previous test, we wait for several minutes. Because of the +iptables on the server blocked the traffic, the server wouldn't receive +fin, and all the client's orphan sockets would timeout on the +FIN_WAIT_1 state finally. So we wait for a few minutes, we could find +10 timeout on the client:: + + nstatuser@nstat-a:~$ nstat | grep -i abort + TcpExtTCPAbortOnTimeout 10 0.0 + +TcpExtTCPAbortOnLinger +--------------------- +The server side code:: + + nstatuser@nstat-b:~$ cat server_linger.py + import socket + import time + + port = 9000 + + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.bind(('0.0.0.0', port)) + s.listen(1) + sock, addr = s.accept() + while True: + time.sleep(9999999) + +The client side code:: + + nstatuser@nstat-a:~$ cat client_linger.py + import socket + import struct + + server = 'nstat-b' # server address + port = 9000 + + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10)) + s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1)) + s.connect((server, port)) + s.close() + +Run server_linger.py on server:: + + nstatuser@nstat-b:~$ python3 server_linger.py + +Run client_linger.py on client:: + + nstatuser@nstat-a:~$ python3 client_linger.py + +After run client_linger.py, check the output of nstat:: + + nstatuser@nstat-a:~$ nstat | grep -i abort + TcpExtTCPAbortOnLinger 1 0.0 + +TcpExtTCPRcvCoalesce +------------------- +On the server, we run a program which listen on TCP port 9000, but +doesn't read any data:: + + import socket + import time + port = 9000 + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.bind(('0.0.0.0', port)) + s.listen(1) + sock, addr = s.accept() + while True: + time.sleep(9999999) + +Save the above code as server_coalesce.py, and run:: + + python3 server_coalesce.py + +On the client, save below code as client_coalesce.py:: + + import socket + server = 'nstat-b' + port = 9000 + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.connect((server, port)) + +Run:: + + nstatuser@nstat-a:~$ python3 -i client_coalesce.py + +We use '-i' to come into the interactive mode, then a packet:: + + >>> s.send(b'foo') + 3 + +Send a packet again:: + + >>> s.send(b'bar') + 3 + +On the server, run nstat:: + + ubuntu@nstat-b:~$ nstat + #kernel + IpInReceives 2 0.0 + IpInDelivers 2 0.0 + IpOutRequests 2 0.0 + TcpInSegs 2 0.0 + TcpOutSegs 2 0.0 + TcpExtTCPRcvCoalesce 1 0.0 + IpExtInOctets 110 0.0 + IpExtOutOctets 104 0.0 + IpExtInNoECTPkts 2 0.0 + +The client sent two packets, server didn't read any data. When +the second packet arrived at server, the first packet was still in +the receiving queue. So the TCP layer merged the two packets, and we +could find the TcpExtTCPRcvCoalesce increased 1. + +TcpExtListenOverflows and TcpExtListenDrops +---------------------------------------- +On server, run the nc command, listen on port 9000:: + + nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 + Listening on [0.0.0.0] (family 0, port 9000) + +On client, run 3 nc commands in different terminals:: + + nstatuser@nstat-a:~$ nc -v nstat-b 9000 + Connection to nstat-b 9000 port [tcp/*] succeeded! + +The nc command only accepts 1 connection, and the accept queue length +is 1. On current linux implementation, set queue length to n means the +actual queue length is n+1. Now we create 3 connections, 1 is accepted +by nc, 2 in accepted queue, so the accept queue is full. + +Before running the 4th nc, we clean the nstat history on the server:: + + nstatuser@nstat-b:~$ nstat -n + +Run the 4th nc on the client:: + + nstatuser@nstat-a:~$ nc -v nstat-b 9000 + +If the nc server is running on kernel 4.10 or higher version, you +won't see the "Connection to ... succeeded!" string, because kernel +will drop the SYN if the accept queue is full. If the nc client is running +on an old kernel, you would see that the connection is succeeded, +because kernel would complete the 3 way handshake and keep the socket +on half open queue. I did the test on kernel 4.15. Below is the nstat +on the server:: + + nstatuser@nstat-b:~$ nstat + #kernel + IpInReceives 4 0.0 + IpInDelivers 4 0.0 + TcpInSegs 4 0.0 + TcpExtListenOverflows 4 0.0 + TcpExtListenDrops 4 0.0 + IpExtInOctets 240 0.0 + IpExtInNoECTPkts 4 0.0 + +Both TcpExtListenOverflows and TcpExtListenDrops were 4. If the time +between the 4th nc and the nstat was longer, the value of +TcpExtListenOverflows and TcpExtListenDrops would be larger, because +the SYN of the 4th nc was dropped, the client was retrying. + +IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes +---------------------------------------------- +server A IP address: 192.168.122.250 +server B IP address: 192.168.122.251 +Prepare on server A, add a route to server B:: + + $ sudo ip route add 8.8.8.8/32 via 192.168.122.251 + +Prepare on server B, disable send_redirects for all interfaces:: + + $ sudo sysctl -w net.ipv4.conf.all.send_redirects=0 + $ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0 + $ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0 + $ sudo sysctl -w net.ipv4.conf.default.send_redirects=0 + +We want to let sever A send a packet to 8.8.8.8, and route the packet +to server B. When server B receives such packet, it might send a ICMP +Redirect message to server A, set send_redirects to 0 will disable +this behavior. + +First, generate InAddrErrors. On server B, we disable IP forwarding:: + + $ sudo sysctl -w net.ipv4.conf.all.forwarding=0 + +On server A, we send packets to 8.8.8.8:: + + $ nc -v 8.8.8.8 53 + +On server B, we check the output of nstat:: + + $ nstat + #kernel + IpInReceives 3 0.0 + IpInAddrErrors 3 0.0 + IpExtInOctets 180 0.0 + IpExtInNoECTPkts 3 0.0 + +As we have let server A route 8.8.8.8 to server B, and we disabled IP +forwarding on server B, Server A sent packets to server B, then server B +dropped packets and increased IpInAddrErrors. As the nc command would +re-send the SYN packet if it didn't receive a SYN+ACK, we could find +multiple IpInAddrErrors. + +Second, generate IpExtInNoRoutes. On server B, we enable IP +forwarding:: + + $ sudo sysctl -w net.ipv4.conf.all.forwarding=1 + +Check the route table of server B and remove the default route:: + + $ ip route show + default via 192.168.122.1 dev ens3 proto static + 192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251 + $ sudo ip route delete default via 192.168.122.1 dev ens3 proto static + +On server A, we contact 8.8.8.8 again:: + + $ nc -v 8.8.8.8 53 + nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable + +On server B, run nstat:: + + $ nstat + #kernel + IpInReceives 1 0.0 + IpOutRequests 1 0.0 + IcmpOutMsgs 1 0.0 + IcmpOutDestUnreachs 1 0.0 + IcmpMsgOutType3 1 0.0 + IpExtInNoRoutes 1 0.0 + IpExtInOctets 60 0.0 + IpExtOutOctets 88 0.0 + IpExtInNoECTPkts 1 0.0 + +We enabled IP forwarding on server B, when server B received a packet +which destination IP address is 8.8.8.8, server B will try to forward +this packet. We have deleted the default route, there was no route for +8.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP +Destination Unreachable" message to server A. + +Third, generate IpOutNoRoutes. Run ping command on server B:: + + $ ping -c 1 8.8.8.8 + connect: Network is unreachable + +Run nstat on server B:: + + $ nstat + #kernel + IpOutNoRoutes 1 0.0 + +We have deleted the default route on server B. Server B couldn't find +a route for the 8.8.8.8 IP address, so server B increased +IpOutNoRoutes. diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt index 8ff7b4c8f91b..a5f103b083a0 100644 --- a/Documentation/networking/vrf.txt +++ b/Documentation/networking/vrf.txt @@ -103,19 +103,33 @@ VRF device: or to specify the output device using cmsg and IP_PKTINFO. +By default the scope of the port bindings for unbound sockets is +limited to the default VRF. That is, it will not be matched by packets +arriving on interfaces enslaved to an l3mdev and processes may bind to +the same port if they bind to an l3mdev. + TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across all VRF domains by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: + sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1 +These options are disabled by default so that a socket in a VRF is only +selected for packets in that VRF. There is a similar option for RAW +sockets, which is enabled by default for reasons of backwards compatibility. +This is so as to specify the output device with cmsg and IP_PKTINFO, but +using a socket not bound to the corresponding VRF. This allows e.g. older ping +implementations to be run with specifying the device but without executing it +in the VRF. This option can be disabled so that packets received in a VRF +context are only handled by a raw socket bound to the VRF, and packets in the +default VRF are only handled by a socket not bound to any VRF: + + sysctl -w net.ipv4.raw_l3mdev_accept=0 + netfilter rules on the VRF device can be used to limit access to services running in the default VRF context as well. -The default VRF does not have limited scope with respect to port bindings. -That is, if a process does a wildcard bind to a port in the default VRF it -owns the port across all VRF domains within the network namespace. - ################################################################################ Using iproute2 for VRFs diff --git a/Documentation/networking/xfrm_device.txt b/Documentation/networking/xfrm_device.txt index 267f55b5f54a..a1c904dc70dc 100644 --- a/Documentation/networking/xfrm_device.txt +++ b/Documentation/networking/xfrm_device.txt @@ -111,9 +111,10 @@ the stack in xfrm_input(). xfrm_state_hold(xs); store the state information into the skb - skb->sp = secpath_dup(skb->sp); - skb->sp->xvec[skb->sp->len++] = xs; - skb->sp->olen++; + sp = secpath_set(skb); + if (!sp) return; + sp->xvec[sp->len++] = xs; + sp->olen++; indicate the success and/or error status of the offload xo = xfrm_offload(skb); |