diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-27 02:07:23 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-27 02:07:23 +0300 |
commit | 6e98b09da931a00bf4e0477d0fa52748bf28fcce (patch) | |
tree | 9c658ed95add5693f42f29f63df80a2ede3f6ec2 /Documentation/networking/napi.rst | |
parent | b68ee1c6131c540a62ecd443be89c406401df091 (diff) | |
parent | 9b78d919632b7149d311aaad5a977e4b48b10321 (diff) | |
download | linux-6e98b09da931a00bf4e0477d0fa52748bf28fcce.tar.xz |
Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances
- Reduce compound page head access for zero-copy data transfers
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
possible
- Threaded NAPI improvements, adding defer skb free support and
unneeded softirq avoidance
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking
- Add lockless accesses annotation to sk_err[_soft]
- Optimize again the skb struct layout
- Extends the skb drop reasons to make it usable by multiple
subsystems
- Better const qualifier awareness for socket casts
BPF:
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and
variable-sized accesses
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward
- Add more precise memory usage reporting for all BPF map types
- Adds support for using {FOU,GUE} encap with an ipip device
operating in collect_md mode and add a set of BPF kfuncs for
controlling encap params
- Allow BPF programs to detect at load time whether a particular
kfunc exists or not, and also add support for this in light
skeleton
- Bigger batch of BPF verifier improvements to prepare for upcoming
BPF open-coded iterators allowing for less restrictive looping
capabilities
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce
BPF programs to NULL-check before passing such pointers into kfunc
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
in local storage maps
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree
- Add BPF verifier support for ST instructions in
convert_ctx_access() which will help new -mcpu=v4 clang flag to
start emitting them
- Add ARM32 USDT support to libbpf
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations
Protocols:
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition
- Add the handshake upcall mechanism, allowing the user-space to
implement generic TLS handshake on kernel's behalf
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter:
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged
- Update bridge netfilter and ovs conntrack helpers to handle IPv6
Jumbo packets properly, i.e. fetch the packet length from
hop-by-hop extension header. This is needed for BIT TCP support
- The iptables 32bit compat interface isn't compiled in by default
anymore
- Move ip(6)tables builtin icmp matches to the udptcp one. This has
the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device
Driver API:
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them
- Allow the page_pool to directly recycle the pages from safely
localized NAPI
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization
- Add YNL support for user headers and struct attrs
- Add partial YNL specification for devlink
- Add partial YNL specification for ethtool
- Add tc-mqprio and tc-taprio support for preemptible traffic classes
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device
- Add basic LED support for switch/phy
- Add NAPI documentation, stop relaying on external links
- Convert dsa_master_ioctl() to netdev notifier. This is a
preparatory work to make the hardware timestamping layer selectable
by user space
- Add transceiver support and improve the error messages for CAN-FD
controllers
New hardware / drivers:
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers:
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors
- add support for configuring max SDU for each Tx queue
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only on
shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices (e.g. MAC address from
efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support"
* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
net: phy: hide the PHYLIB_LEDS knob
net: phy: marvell-88x2222: remove unnecessary (void*) conversions
tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
net: amd: Fix link leak when verifying config failed
net: phy: marvell: Fix inconsistent indenting in led_blink_set
lan966x: Don't use xdp_frame when action is XDP_TX
tsnep: Add XDP socket zero-copy TX support
tsnep: Add XDP socket zero-copy RX support
tsnep: Move skb receive action to separate function
tsnep: Add functions for queue enable/disable
tsnep: Rework TX/RX queue initialization
tsnep: Replace modulo operation with mask
net: phy: dp83867: Add led_brightness_set support
net: phy: Fix reading LED reg property
drivers: nfc: nfcsim: remove return value check of `dev_dir`
net: phy: dp83867: Remove unnecessary (void*) conversions
net: ethtool: coalesce: try to make user settings stick twice
net: mana: Check if netdev/napi_alloc_frag returns single page
net: mana: Rename mana_refill_rxoob and remove some empty lines
net: veth: add page_pool stats
...
Diffstat (limited to 'Documentation/networking/napi.rst')
-rw-r--r-- | Documentation/networking/napi.rst | 254 |
1 files changed, 254 insertions, 0 deletions
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst new file mode 100644 index 000000000000..a7a047742e93 --- /dev/null +++ b/Documentation/networking/napi.rst @@ -0,0 +1,254 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +.. _napi: + +==== +NAPI +==== + +NAPI is the event handling mechanism used by the Linux networking stack. +The name NAPI no longer stands for anything in particular [#]_. + +In basic operation the device notifies the host about new events +via an interrupt. +The host then schedules a NAPI instance to process the events. +The device may also be polled for events via NAPI without receiving +interrupts first (:ref:`busy polling<poll>`). + +NAPI processing usually happens in the software interrupt context, +but there is an option to use :ref:`separate kernel threads<threaded>` +for NAPI processing. + +All in all NAPI abstracts away from the drivers the context and configuration +of event (packet Rx and Tx) processing. + +Driver API +========== + +The two most important elements of NAPI are the struct napi_struct +and the associated poll method. struct napi_struct holds the state +of the NAPI instance while the method is the driver-specific event +handler. The method will typically free Tx packets that have been +transmitted and process newly received packets. + +.. _drv_ctrl: + +Control API +----------- + +netif_napi_add() and netif_napi_del() add/remove a NAPI instance +from the system. The instances are attached to the netdevice passed +as argument (and will be deleted automatically when netdevice is +unregistered). Instances are added in a disabled state. + +napi_enable() and napi_disable() manage the disabled state. +A disabled NAPI can't be scheduled and its poll method is guaranteed +to not be invoked. napi_disable() waits for ownership of the NAPI +instance to be released. + +The control APIs are not idempotent. Control API calls are safe against +concurrent use of datapath APIs but an incorrect sequence of control API +calls may result in crashes, deadlocks, or race conditions. For example, +calling napi_disable() multiple times in a row will deadlock. + +Datapath API +------------ + +napi_schedule() is the basic method of scheduling a NAPI poll. +Drivers should call this function in their interrupt handler +(see :ref:`drv_sched` for more info). A successful call to napi_schedule() +will take ownership of the NAPI instance. + +Later, after NAPI is scheduled, the driver's poll method will be +called to process the events/packets. The method takes a ``budget`` +argument - drivers can process completions for any number of Tx +packets but should only process up to ``budget`` number of +Rx packets. Rx processing is usually much more expensive. + +In other words, it is recommended to ignore the budget argument when +performing TX buffer reclamation to ensure that the reclamation is not +arbitrarily bounded; however, it is required to honor the budget argument +for RX processing. + +.. warning:: + + The ``budget`` argument may be 0 if core tries to only process Tx completions + and no Rx packets. + +The poll method returns the amount of work done. If the driver still +has outstanding work to do (e.g. ``budget`` was exhausted) +the poll method should return exactly ``budget``. In that case, +the NAPI instance will be serviced/polled again (without the +need to be scheduled). + +If event processing has been completed (all outstanding packets +processed) the poll method should call napi_complete_done() +before returning. napi_complete_done() releases the ownership +of the instance. + +.. warning:: + + The case of finishing all events and using exactly ``budget`` + must be handled carefully. There is no way to report this + (rare) condition to the stack, so the driver must either + not call napi_complete_done() and wait to be called again, + or return ``budget - 1``. + + If the ``budget`` is 0 napi_complete_done() should never be called. + +Call sequence +------------- + +Drivers should not make assumptions about the exact sequencing +of calls. The poll method may be called without the driver scheduling +the instance (unless the instance is disabled). Similarly, +it's not guaranteed that the poll method will be called, even +if napi_schedule() succeeded (e.g. if the instance gets disabled). + +As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent +calls to the poll method only wait for the ownership of the instance +to be released, not for the poll method to exit. This means that +drivers should avoid accessing any data structures after calling +napi_complete_done(). + +.. _drv_sched: + +Scheduling and IRQ masking +-------------------------- + +Drivers should keep the interrupts masked after scheduling +the NAPI instance - until NAPI polling finishes any further +interrupts are unnecessary. + +Drivers which have to mask the interrupts explicitly (as opposed +to IRQ being auto-masked by the device) should use the napi_schedule_prep() +and __napi_schedule() calls: + +.. code-block:: c + + if (napi_schedule_prep(&v->napi)) { + mydrv_mask_rxtx_irq(v->idx); + /* schedule after masking to avoid races */ + __napi_schedule(&v->napi); + } + +IRQ should only be unmasked after a successful call to napi_complete_done(): + +.. code-block:: c + + if (budget && napi_complete_done(&v->napi, work_done)) { + mydrv_unmask_rxtx_irq(v->idx); + return min(work_done, budget - 1); + } + +napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage +of guarantees given by being invoked in IRQ context (no need to +mask interrupts). Note that PREEMPT_RT forces all interrupts +to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD`` +to avoid issues on real-time kernel configurations. + +Instance to queue mapping +------------------------- + +Modern devices have multiple NAPI instances (struct napi_struct) per +interface. There is no strong requirement on how the instances are +mapped to queues and interrupts. NAPI is primarily a polling/processing +abstraction without specific user-facing semantics. That said, most networking +devices end up using NAPI in fairly similar ways. + +NAPI instances most often correspond 1:1:1 to interrupts and queue pairs +(queue pair is a set of a single Rx and single Tx queue). + +In less common cases a NAPI instance may be used for multiple queues +or Rx and Tx queues can be serviced by separate NAPI instances on a single +core. Regardless of the queue assignment, however, there is usually still +a 1:1 mapping between NAPI instances and interrupts. + +It's worth noting that the ethtool API uses a "channel" terminology where +each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear +what constitutes a channel; the recommended interpretation is to understand +a channel as an IRQ/NAPI which services queues of a given type. For example, +a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected +to utilize 3 interrupts, 2 Rx and 2 Tx queues. + +User API +======== + +User interactions with NAPI depend on NAPI instance ID. The instance IDs +are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option. +It's not currently possible to query IDs used by a given device. + +Software IRQ coalescing +----------------------- + +NAPI does not perform any explicit event coalescing by default. +In most scenarios batching happens due to IRQ coalescing which is done +by the device. There are cases where software coalescing is helpful. + +NAPI can be configured to arm a repoll timer instead of unmasking +the hardware interrupts as soon as all packets are processed. +The ``gro_flush_timeout`` sysfs configuration of the netdevice +is reused to control the delay of the timer, while +``napi_defer_hard_irqs`` controls the number of consecutive empty polls +before NAPI gives up and goes back to using hardware IRQs. + +.. _poll: + +Busy polling +------------ + +Busy polling allows a user process to check for incoming packets before +the device interrupt fires. As is the case with any busy polling it trades +off CPU cycles for lower latency (production uses of NAPI busy polling +are not well known). + +Busy polling is enabled by either setting ``SO_BUSY_POLL`` on +selected sockets or using the global ``net.core.busy_poll`` and +``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling +also exists. + +IRQ mitigation +--------------- + +While busy polling is supposed to be used by low latency applications, +a similar mechanism can be used for IRQ mitigation. + +Very high request-per-second applications (especially routing/forwarding +applications and especially applications using AF_XDP sockets) may not +want to be interrupted until they finish processing a request or a batch +of packets. + +Such applications can pledge to the kernel that they will perform a busy +polling operation periodically, and the driver should keep the device IRQs +permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL`` +socket option. To avoid system misbehavior the pledge is revoked +if ``gro_flush_timeout`` passes without any busy poll call. + +The NAPI budget for busy polling is lower than the default (which makes +sense given the low latency intention of normal busy polling). This is +not the case with IRQ mitigation, however, so the budget can be adjusted +with the ``SO_BUSY_POLL_BUDGET`` socket option. + +.. _threaded: + +Threaded NAPI +------------- + +Threaded NAPI is an operating mode that uses dedicated kernel +threads rather than software IRQ context for NAPI processing. +The configuration is per netdevice and will affect all +NAPI instances of that device. Each NAPI instance will spawn a separate +thread (called ``napi/${ifc-name}-${napi-id}``). + +It is recommended to pin each kernel thread to a single CPU, the same +CPU as the CPU which services the interrupt. Note that the mapping +between IRQs and NAPI instances may not be trivial (and is driver +dependent). The NAPI instance IDs will be assigned in the opposite +order than the process IDs of the kernel threads. + +Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in +netdev's sysfs directory. + +.. rubric:: Footnotes + +.. [#] NAPI was originally referred to as New API in 2.4 Linux. |