diff options
author | Achiad Shochat <achiad@mellanox.com> | 2015-07-23 23:35:59 +0300 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2015-07-27 10:29:17 +0300 |
commit | 88a85f99e51fb2373259ab83c8bb130a9bbf3804 (patch) | |
tree | f27980b2cdd3c9601a5953c7d268eabc18452a2e /include/linux/mlx5 | |
parent | 58d522912ac7d25b63f468fa4a4e8bb059c5144e (diff) | |
download | linux-88a85f99e51fb2373259ab83c8bb130a9bbf3804.tar.xz |
net/mlx5e: TX latency optimization to save DMA reads
A regular TX WQE execution involves two or more DMA reads -
one to fetch the WQE, and another one per WQE gather entry.
These DMA reads obviously increase the TX latency.
There are two mlx5 mechanisms to bypass these DMA reads:
1) Inline WQE
2) Blue Flame (BF)
An inline WQE contains a whole packet, thus saves the DMA read/s
of the regular WQE gather entry/s. Inline WQE support was already
added in the previous commit.
A BF WQE is written directly to the device I/O mapped memory, thus
enables saving the DMA read that fetches the WQE.
The BF WQE I/O write must be in cache line granularity, thus uses
the CPU write combining mechanism.
A BF WQE I/O write acts also as a TX doorbell for notifying the
device of new TX WQEs.
A BF WQE is written to the same I/O mapped address as the regular TX
doorbell, thus this address is being mapped twice - once by ioremap()
and once by io_mapping_map_wc().
While both mechanisms reduce the TX latency, they both consume more CPU
cycles than a regular WQE:
- A BF WQE must still be written to host memory, in addition to being
written directly to the device I/O mapped memory.
- An inline WQE involves copying the SKB data into it.
To handle this tradeoff, we introduce here a heuristic algorithm that
strives to avoid using these two mechanisms in case the TX queue is
being back-pressured by the device, and limit their usage rate otherwise.
An inline WQE will always be "Blue Flamed" (written directly to the
device I/O mapped memory) while a BF WQE may not be inlined (may contain
gather entries).
Preliminary testing using netperf UDP_RR shows that the latency goes down
from 17.5us to 16.9us, while the message rate (tested with pktgen) stays
the same.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'include/linux/mlx5')
-rw-r--r-- | include/linux/mlx5/driver.h | 4 |
1 files changed, 3 insertions, 1 deletions
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index 1c0d5d062d7c..5fe0cae1a515 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -380,7 +380,7 @@ struct mlx5_uar { u32 index; struct list_head bf_list; unsigned free_bf_bmap; - void __iomem *wc_map; + void __iomem *bf_map; void __iomem *map; }; @@ -435,6 +435,8 @@ struct mlx5_priv { struct mlx5_uuar_info uuari; MLX5_DECLARE_DOORBELL_LOCK(cq_uar_lock); + struct io_mapping *bf_mapping; + /* pages stuff */ struct workqueue_struct *pg_wq; struct rb_root page_root; |