diff options
| author | David S. Miller <davem@davemloft.net> | 2016-07-20 07:46:34 +0300 |
|---|---|---|
| committer | David S. Miller <davem@davemloft.net> | 2016-07-20 07:46:34 +0300 |
| commit | 22b3548861fb21ad79e0d3afeee123b0eb3912cc (patch) | |
| tree | f776e9a5cb46b98c4148f83a3f960022dd3c142e /include/linux | |
| parent | ddbcb79493d96bd0d98987f4f6602f0f96665518 (diff) | |
| parent | 764cbccef8c9cb95e869ba2bb8371c42685c934a (diff) | |
| download | linux-22b3548861fb21ad79e0d3afeee123b0eb3912cc.tar.xz | |
Merge branch 'xdp'
Brenden Blanco says:
====================
Add driver bpf hook for early packet drop and forwarding
This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling eXpress Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before
even an skb has been allocated.
Extend on this with the ability to modify packet data and send back out
on the same port.
Patch 1 adds an API for bulk bpf prog refcnt incrememnt.
Patch 2 introduces the new prog type and helpers for validating the bpf
program. A new userspace struct is defined containing only data and
data_end as fields, with others to follow in the future.
In patch 3, create a new ndo to pass the fd to supported drivers.
In patch 4, expose a new rtnl option to userspace.
In patch 5, enable support in mlx4 driver.
In patch 6, create a sample drop and count program. With single core,
achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
packet data access, bpf array lookup, and increment.
In patch 7, add a page recycle facility to mlx4 rx, enabled when xdp is
active.
In patch 8, add the XDP_TX type to bpf.h
In patch 9, add helper in tx patch for writing tx_desc
In patch 10, add support in mlx4 for packet data write and forwarding
In patch 11, turn on packet write support in the bpf verifier
In patch 12, add a sample program for packet write and forwarding. With
single core, achieved ~10 Mpps rewrite and forwarding.
[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
v10:
1/12: Add bulk refcnt api.
5/12: Move prog from priv to ring. This attribute is still only set
globally, but the path to finer granularity should be clear. No lock
is taken, so some rings may operate on older programs for a time (one
napi loop). Looked into options such as napi_synchronize, but they
were deemed too slow (calls to msleep).
Rename prog to xdp_prog. Add xdp_ring_num to help with accounting,
used more heavily in later patches.
7/12: Adjust to use per-ring xdp prog. Use priv->xdp_ring_num where
before priv->prog was used to determine buffer allocations.
9/12: Add cpu_to_be16 to vlan_tag in mxl4_en_xmit(). Remove unused variable
from mlx4_en_xmit and unused params from build_inline_wqe.
v9:
4/11: Add missing newline in en_err message.
6/11: Move page_cache cleanup from mlx4_en_destroy_rx_ring to
mlx4_en_deactivate_rx_ring. Move mlx4_en_moderation_update back to
static. Remove calls to mlx4_en_alloc/free_resources in mlx4_xdp_set.
Adopt instead the approach of mlx4_en_change_mtu to use a watchdog.
9/11: Use a per-ring function pointer in tx to separate out the code
for regular and recycle paths of tx completion handling. Add a helper
function to init the recycle ring and callback, called just after
activating tx. Remove extra tx ring resource requirement, and instead
steal from the upper rings. This helps to avoid needing
mlx4_en_alloc_resources. Add some hopefully meaningful error
messages for the various error cases. Reverted some of the
hard-to-follow logic that was accounting for the extra tx rings.
v8:
1/11: Reduce WARN_ONCE to single line. Also, change act param of that
function to u32 to match return type of bpf_prog_run_xdp.
2/11: Clarify locking semantics in ndo comment.
4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation.
v7:
Addressing two of the major discussion points: return codes and ndo.
The rest will be taken as todo items for separate patches.
Add an XDP_ABORTED type, which explicitly falls through to DROP. The
same result must be taken for the default case as well, as it is now
well-defined API behavior.
Merge ndo_xdp_* into a single ndo. The style is similar to
ndo_setup_tc, but with less unidirectional naming convention. The IFLA
parameter names are unchanged.
TODOs:
Add ethtool per-ring stats for aborted, default cases, maybe even drop
and tx as well.
Avoid duplicate dma sync operation in XDP_PASS case as mentioned by
Saeed.
1/12: Add XDP_ABORTED enum, reword API comment, and update commit
message.
2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style
calling convention.
3/12: Switch to ndo_xdp callback.
4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement
ndo_xdp.
12/12: Dropped, this will need some more work.
v6:
2/12: drop unnecessary netif_device_present check
4/12, 6/12, 9/12: Reorder default case statement above drop case to
remove some copy/paste.
v5:
0/12: Rebase and remove previous 1/13 patch
1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
out of bounds action is returned by the program. Add a comment to
bpf.h denoting the undefined nature of out of bounds returns.
2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
ndo_xdp_attached().
3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
bpf_warn_invalid_xdp_action helper.
5/12: Adjust to using the nested netlink options.
6/12: kbuild was complaining about overflow of u16 on tile
architecture...bump frag_stride to u32. The page_offset member that
is computed from this was already u32.
v4:
2/12: Add inline helper for calling xdp bpf prog under rcu
3/12: Add detail to ndo comments
5/12: Remove mlx4_call_xdp and use inline helper instead.
6/12: Fix checkpatch complaints
9/12: Introduce new patch 9/12 with common helper for tx_desc write
Refactor to use common tx_desc write helper
11/12: Fix checkpatch complaints
v3:
Rewrite from v2 trying to incorporate feedback from multiple sources.
Specifically, add ability to forward packets out the same port and
allow packet modification.
For packet forwarding, the driver reserves a dedicated set of tx rings
for exclusive use by xdp. Upon completion, the pages on this ring are
recycled directly back to a small per-rx-ring page cache without
being dma unmapped.
Use of the percpu skb is dropped in favor of a lightweight struct
xdp_buff. The direct packet access feature is leveraged to remove
dependence on the skb.
The mlx4 driver implementation allocates a page-per-packet and maps it
in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated.
Naming is converted to use "xdp" instead of "phys_dev".
v2:
1/5: Drop xdp from types, instead consistently use bpf_phys_dev_
Introduce enum for return values from phys_dev hook
2/5: Move prog->type check to just before invoking ndo
Change ndo to take a bpf_prog * instead of fd
Add ndo_bpf_get rather than keeping a bool in the netdev struct
3/5: Use ndo_bpf_get to fetch bool
4/5: Enforce that only 1 frag is ever given to bpf prog by disallowing
mtu to increase beyond FRAG_SZ0 when bpf prog is running, or conversely
to set a bpf prog when priv->num_frags > 1
Rename pseudo_skb to bpf_phys_dev_md
Implement ndo_bpf_get
Add dma sync just before invoking prog
Check for explicit bpf return code rather than nonzero
Remove increment of rx_dropped
5/5: Use explicit bpf return code in example
Update commit log with higher pps numbers
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'include/linux')
| -rw-r--r-- | include/linux/bpf.h | 1 | ||||
| -rw-r--r-- | include/linux/filter.h | 18 | ||||
| -rw-r--r-- | include/linux/mlx4/qp.h | 18 | ||||
| -rw-r--r-- | include/linux/netdevice.h | 34 |
4 files changed, 63 insertions, 8 deletions
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index c13e92b00bf5..75a5ae6bee07 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -224,6 +224,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl); struct bpf_prog *bpf_prog_get(u32 ufd); struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type); +struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i); struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog); void bpf_prog_put(struct bpf_prog *prog); diff --git a/include/linux/filter.h b/include/linux/filter.h index 6fc31ef1da2d..15d816a8b755 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -368,6 +368,11 @@ struct bpf_skb_data_end { void *data_end; }; +struct xdp_buff { + void *data; + void *data_end; +}; + /* compute the linear packet data range [data, data_end) which * will be accessed by cls_bpf and act_bpf programs */ @@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, return BPF_PROG_RUN(prog, skb); } +static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, + struct xdp_buff *xdp) +{ + u32 ret; + + rcu_read_lock(); + ret = BPF_PROG_RUN(prog, (void *)xdp); + rcu_read_unlock(); + + return ret; +} + static inline unsigned int bpf_prog_size(unsigned int proglen) { return max(sizeof(struct bpf_prog), @@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func); struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, const struct bpf_insn *patch, u32 len); +void bpf_warn_invalid_xdp_action(u32 act); #ifdef CONFIG_BPF_JIT extern int bpf_jit_enable; diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h index 587cdf943b52..deaa2217214d 100644 --- a/include/linux/mlx4/qp.h +++ b/include/linux/mlx4/qp.h @@ -291,16 +291,18 @@ enum { MLX4_WQE_CTRL_FORCE_LOOPBACK = 1 << 0, }; +union mlx4_wqe_qpn_vlan { + struct { + __be16 vlan_tag; + u8 ins_vlan; + u8 fence_size; + }; + __be32 bf_qpn; +}; + struct mlx4_wqe_ctrl_seg { __be32 owner_opcode; - union { - struct { - __be16 vlan_tag; - u8 ins_vlan; - u8 fence_size; - }; - __be32 bf_qpn; - }; + union mlx4_wqe_qpn_vlan qpn_vlan; /* * High 24 bits are SRC remote buffer; low 8 bits are flags: * [7] SO (strong ordering) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 49736a31acaa..fab9a1c2a2ac 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -63,6 +63,7 @@ struct wpan_dev; struct mpls_dev; /* UDP Tunnel offloads */ struct udp_tunnel_info; +struct bpf_prog; void netdev_set_default_ethtool_ops(struct net_device *dev, const struct ethtool_ops *ops); @@ -799,6 +800,33 @@ struct tc_to_netdev { }; }; +/* These structures hold the attributes of xdp state that are being passed + * to the netdevice through the xdp op. + */ +enum xdp_netdev_command { + /* Set or clear a bpf program used in the earliest stages of packet + * rx. The prog will have been loaded as BPF_PROG_TYPE_XDP. The callee + * is responsible for calling bpf_prog_put on any old progs that are + * stored. In case of error, the callee need not release the new prog + * reference, but on success it takes ownership and must bpf_prog_put + * when it is no longer used. + */ + XDP_SETUP_PROG, + /* Check if a bpf program is set on the device. The callee should + * return true if a program is currently attached and running. + */ + XDP_QUERY_PROG, +}; + +struct netdev_xdp { + enum xdp_netdev_command command; + union { + /* XDP_SETUP_PROG */ + struct bpf_prog *prog; + /* XDP_QUERY_PROG */ + bool prog_attached; + }; +}; /* * This structure defines the management hooks for network devices. @@ -1087,6 +1115,9 @@ struct tc_to_netdev { * appropriate rx headroom value allows avoiding skb head copy on * forward. Setting a negative value resets the rx headroom to the * default value. + * int (*ndo_xdp)(struct net_device *dev, struct netdev_xdp *xdp); + * This function is used to set or query state related to XDP on the + * netdevice. See definition of enum xdp_netdev_command for details. * */ struct net_device_ops { @@ -1271,6 +1302,8 @@ struct net_device_ops { struct sk_buff *skb); void (*ndo_set_rx_headroom)(struct net_device *dev, int needed_headroom); + int (*ndo_xdp)(struct net_device *dev, + struct netdev_xdp *xdp); }; /** @@ -3257,6 +3290,7 @@ int dev_get_phys_port_id(struct net_device *dev, int dev_get_phys_port_name(struct net_device *dev, char *name, size_t len); int dev_change_proto_down(struct net_device *dev, bool proto_down); +int dev_change_xdp_fd(struct net_device *dev, int fd); struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev); struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, int *ret); |
