summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2016-07-02net/mlx5e: Introduce SRIOV VF representorsHadar Hen Zion6-19/+574
Implement the relevant profile functions to create mlx5e driver instance serving as VF representor. When SRIOV offloads mode is enabled, each VF will have a representor netdevice instance on the host. To do that, we also export set of shared service functions from en_main.c, such that they can be used by both NIC and repsresentors netdevs. The newly created representor netdevice has a basic set of net_device_ops which are the same ndo functions as the NIC netdevice and an ndo of it's own for phys port name. The profiling infrastructure allow sharing code between the NIC and the vport representor even though the representor has only a subset of the NIC functionality. The VF reps and the PF which is used in that mode to represent the uplink, expose switchdev ops. Currently the only op supposed is attr get for the port parent ID which here serves to identify net-devices belonging to the same HW E-Switch. Other than that, no offloading is implemented and hence switching functionality is achieved if one sets SW switching rules, e.g using tc, bridge or ovs. Port phys name (ndo_get_phys_port_name) is implemented to allow exporting to user-space the VF vport number and along with the switchdev port parent id (phys_switch_id) enable a udev base consistent naming scheme: SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \ ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}" where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is the name of the PF netdevice. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: Add Representors registration APIHadar Hen Zion5-7/+97
Introduce E-Switch registration/unregister representors functions. Those functions are called by the mlx5e driver when the PF NIC is created upon pci probe action regardless of the E-Switch mode (NONE, LEGACY or OFFLOADS). Adding basic E-Switch database that will hold the vport represntors upon creation. This patch doesn't add any new functionality. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5e: Add support for multiple profilesHadar Hen Zion2-118/+240
To allow support in representor netdevices where we create more than one netdevice per NIC, add profiles to the mlx5e driver. The profiling allows for creation of mlx5e instances with different characteristics. Each profile implements its own behavior using set of function pointers defined in struct mlx5e_profile. This is done to allow for avoiding complex per profix branching in the code. Currently only the profile for the conventional NIC is implemented, which is of use when a netdev is created upon pci probe. This patch doesn't add any new functionality. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5e: Mark enabled RQTs instances explicitlyHadar Hen Zion3-23/+37
In the current driver implementation two types of receive queue tables (RQTs) are in use - direct and indirect. Change the driver to mark each new created RQT (direct or indirect) as "enabled". This behaviour is needed for introducing new mlx5e instances which serve to represent SRIOV VFs. The VF representors will have only one type of RQTs (direct). An "enabled" flag is added to each RQT to allow better handling and code sharing between the representors and the nic netdevices. This patch doesn't add any new functionality. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5e: TIRs management refactoringHadar Hen Zion6-57/+77
The current refresh tirs self loopback mechanism, refreshes all the tirs belonging to the same mlx5e instance to prevent self loopback by packets sent over any ring of that instance. This mechanism relies on all the tirs/tises of an instance to be created with the same transport domain number (tdn). Change the driver to refresh all the tirs created under the same tdn regardless of which mlx5e netdev instance they belong to. This behaviour is needed for introducing new mlx5e instances which serve to represent SRIOV VFs. The representors and the PF share vport used for E-Switch management, and we want to avoid NIC level HW loopback between them, e.g when sending broadcast packets. To achieve that, both the representors and the PF NIC will share the tdn. This patch doesn't add any new functionality. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5e: Create NIC global resources only onceHadar Hen Zion5-90/+171
To allow creating more than one netdev over the same PCI function, we change the driver such that global NIC resources are created once and later be shared amongst all the mlx5e netdevs running over that port. Move the CQ UAR, PD (pdn), Transport Domain (tdn), MKey resources from being kept in the mlx5e priv part to a new resources structure (mlx5e_resources) placed under the mlx5_core device. This patch doesn't add any new functionality. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5e: Add devlink based SRIOV mode changesOr Gerlitz2-9/+124
Implement handlers for the devlink commands to get and set the SRIOV E-Switch mode. When turning to the switchdev/offloads mode, we disable the e-switch and enable it again in the new mode, create the NIC offloads table and create VF reps. When turning to legacy mode, we remove the VF reps and the offloads table, and re-initiate the e-switch in it's legacy mode. The actual creation/removal of the VF reps is done in downstream patches. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: Add devlink interfaceOr Gerlitz4-4/+37
The devlink interface is initially used to set/get the mode of the SRIOV e-switch. Currently, these are only stubs for get/set, down-stream patch will actually fill them out. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/devlink: Add E-Switch mode controlOr Gerlitz3-0/+98
Add the commands to set and show the mode of SRIOV E-Switch, two modes are supported: * legacy: operating in the "old" L2 based mode (DMAC --> VF vport) * switchdev: the E-Switch is referred to as whitebox switch configured using standard tools such as tc, bridge, openvswitch etc. To allow working with the tools, for each VF, a VF representor netdevice is created by the E-Switch manager vendor device driver instance (e.g PF). Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: E-Switch, Add API to create vport rx rulesOr Gerlitz2-0/+89
Add the API to create vport rx rules of the form packet meta-data :: vport == $VPORT --> $TIR where the TIR is opened by this VF representor. This logic will by used for packets that didn't match any rule in the e-switch datapath and should be received into the host OS through the netdevice that represents the VF they were sent from. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: E-Switch, Add offloads tableOr Gerlitz2-0/+36
Belongs to the NIC offloads name-space, and to be used as part of the SRIOV offloads logic to steer packets that hit the e-switch miss rule to the TIR of the relevant VF representor. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: Introduce offloads steering namespaceOr Gerlitz2-1/+11
Add a new namespace (MLX5_FLOW_NAMESPACE_OFFLOADS) to be populated with flow steering rules that deal with rules that have have to be executed before the EN NIC steering rules are matched. The namespace is located after the bypass name-space and before the kernel name-space. Therefore, it precedes the HW processing done for rules set for the kernel NIC name-space. Under SRIOV, it would allow us to match on e-switch missed packet and forward them to the relevant VF representor TIR. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: E-Switch, Add API to create send-to-vport rulesOr Gerlitz2-1/+41
Add the API to create send-to-vport e-switch rules of the form packet meta-data :: send-queue-number == $SQN and source-vport == 0 --> $VPORT These rules are to be used for a send-to-vport logic which conceptually bypasses the "normal" steering rules currently present at the e-switch datapath. Such rule should apply only for packets that originate in the e-switch manager vport (0) and are sent for a given SQN which is used by a given VF representor device, and hence the matching logic. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: E-Switch, Add miss rule for offloads modeOr Gerlitz2-0/+41
In the sriov offloads mode, packets that are not matched by any other rule should be sent towards the e-switch manager for further processing. Add such "miss" rule which matches ANY packet as the last rule in the e-switch FDB and programs the HW to send the packet to vport 0 where the e-switch manager runs. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: E-Switch, Add support for the sriov offloads modeOr Gerlitz4-20/+168
Unlike the legacy mode, here, forwarding rules are not learned by the driver per events on macs set by VFs/VMs into their vports, but rather should be programmed by higher-level SW entities. Saying that, still, in the offloads mode (SRIOV_OFFLOADS), two flow groups are created by the driver for management (slow path) purposes: The first group will be used for sending packets over e-switch vports from the host OS where the e-switch management code runs, to be received by VFs. The second group will be used by a miss rule which forwards packets toward the e-switch manager. Further logic will trap these packets such that the receiving net-device as seen by the networking stack is the representor of the vport that sent the packet over the e-switch data-path. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02net/mlx5: E-Switch, Add operational mode to the SRIOV e-SwitchOr Gerlitz3-29/+46
Define three modes for the SRIOV e-switch operation, none (SRIOV_NONE, none of the VF vports are enabled), legacy (SRIOV_LEGACY, the current mode) and sriov offloads (SRIOV_OFFLOADS). Currently, when in SRIOV, only the legacy mode is supported, where steering rules are of the form: destination mac --> VF vport This patch does not change any functionality. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-02Merge tag 'batadv-next-for-davem-20160701' of ↵David S. Miller26-192/+704
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature patchset includes the following changes: - two patches with minimal clean up work by Antonio Quartulli and Simon Wunderlich - eight patches of B.A.T.M.A.N. V, API and documentation clean up work, by Antonio Quartulli and Marek Lindner - Andrew Lunn fixed the skb priority adoption when forwarding fragmented packets (two patches) - Multicast optimization support is now enabled for bridges which comes with some protocol updates, by Linus Luessing ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01Merge branch 'hns-next'David S. Miller10-69/+158
Yisen Zhuang says: ==================== net: hns: fix the typo of hns This series includes typo fixes which review by Andy, adding the hns maintainer to MAINTAINERS, as below: > from Daode: adds the maintainer for hns driver; > from Daode: fix the typo of hns reviewed by Andy Shevchenko; > from Kejian: one remove redundant function and two fix to get configuration from DT. changlog: v2 -> v3: match all files in and below drivers/net/ethernet/hisilicon/ v1 -> v2: fix the indentations reviewed by David. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: hns: get reset registers from DTKejian Yan1-14/+66
Since the registers of subctrl may be different, it is better to mv the registers from hns mdio driver routine to device tree node. Signed-off-by: Kejian Yan <yankejian@huawei.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: hns: add media-type property for hnsKejian Yan5-3/+48
It is PORT_TP type if the service port is GE mode. It is wrong to judge the port type by using if it is service port. Adding the media type to know port type. Reported-by: Jinchuan Tian <tianjinchuan1@huawei.com> Signed-off-by: Kejian Yan <yankejian@huawei.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: hns: remove redundant hns_mac_dev_to_enet_if()Kejian Yan1-15/+0
The sequence of hns_mac_dev_to_enet_if() is the same as hns_get_enet_interface(), and hns_get_enet_interface() is called by initialization to get the mac mode. And the mode is not changed anywhere. Thus add hns_mac_dev_to_enet_if() function to get the mac mode is obviously redundant. Reported-by: Jinchuan Tian <tianjinchuan1@huawei.com> Signed-off-by: Kejian Yan <yankejian@huawei.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: hns: normalize two different loopDaode Huang1-9/+9
There are two approaches to assign data, one does 2 loops, another does 1 loop. This patch normalize the different methods to 1 loop. Signed-off-by: Daode Huang <huangdaode@hisilicon.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: hns: add a space before "*/"Daode Huang1-2/+2
In comment line, some time miss a space before */, so this patch adds a space before */. Signed-off-by: Daode Huang <huangdaode@hisilicon.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: hns: delete redundant parentheseDaode Huang1-2/+2
According to the previous review comments from Andy, this patch deletes the redundant parens in the patch. Signed-off-by: Daode Huang <huangdaode@hisilicon.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: hns: change code style from a = a + x to a += xDaode Huang1-16/+16
This patch fixes the code style in hns driver. Change it from "buff = buff + xxx" to "buff += xxx". The reveiw comments is from andy. Reviewed-by: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Daode Huang <huangdaode@hisilicon.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01net: hns: fix code style about hns driverDaode Huang1-9/+7
This patch fixes code sytle of hns driver to make it simple. Signed-off-by: Daode Huang <huangdaode@hisilicon.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01MAINTAINERS: add maintainers for hns driverDaode Huang1-0/+9
This patch adds maintainers for hisilicon network subsystem driver Signed-off-by: Daode Huang <huangdaode@hisilicon.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01Merge branch 'rds-multipath-datastructures'David S. Miller17-174/+211
Sowmini Varadhan says: ==================== RDS:TCP data structure changes for multipath support The second installment of changes to enable multipath support in RDS-TCP. This series implements the changes in rds-tcp so that the rds_conn_path has a pointer to the rds_tcp_connection in cp_transport_data. Struct rds_tcp_connection keeps track of the inet_sk per path in t_sock. The ->sk_user_data in turn is a pointer to the rds_conn_path. With this set of changes, rds_tcp has the needed plumbing to handle multiple paths(socket) per rds_connection. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: Do not send a pong to an incoming ping with 0 src portSowmini Varadhan1-0/+4
RDS ping messages are sent with a non-zero src port to a zero dst port, so that the rds pong messages can be sent back to the originators src port. However if a confused/malicious sender sends a ping with a 0 src port, we'd have an infinite ping-pong loop. To avoid this, the receiver should ignore ping messages with a 0 src port. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: TCP: Simplify reconnect to avoid duelling reconnnect attemptsSowmini Varadhan2-3/+6
When reconnecting, the peer with the smaller IP address will initiate the reconnect, to avoid needless duelling SYN issues. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: TCP: Hooks to set up a single connection pathSowmini Varadhan9-17/+20
This patch adds ->conn_path_connect callbacks in the rds_transport that are used to set up a single connection path. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: TCP: make receive path use the rds_conn_pathSowmini Varadhan9-22/+26
The ->sk_user_data contains a pointer to the rds_conn_path for the socket. Use this consistently in the rds_tcp_data_ready callbacks to get the rds_conn_path for rds_recv_incoming. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: TCP: make ->sk_user_data point to a rds_conn_pathSowmini Varadhan6-40/+41
The socket callbacks should all operate on a struct rds_conn_path, in preparation for a MP capable RDS-TCP. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: TCP: Refactor connection destruction to handle multiple pathsSowmini Varadhan1-7/+39
A single rds_connection may have multiple rds_conn_paths that have to be carefully and correctly destroyed, for both rmmod and netns-delete cases. For both cases, we extract a single rds_tcp_connection for each conn into a temporary list, and then invoke rds_conn_destroy() which iteratively dismantles every path in the rds_connection. For the netns deletion case, we additionally have to make sure that we do not leave a socket in TIME_WAIT state, as this will hold up the netns deletion. Thus we call rds_tcp_conn_paths_destroy() to reset state quickly. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: TCP: Make rds_tcp_connection track the rds_conn_pathSowmini Varadhan5-42/+48
The struct rds_tcp_connection is the transport-specific private data structure that tracks TCP information per rds_conn_path. Modify this structure to have a back-pointer to the rds_conn_path for which it is the ->cp_transport_data. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: TCP: Remove dead logic around c_passive in rds-tcpSowmini Varadhan1-6/+1
The c_passive bit is only intended for the IB transport and will never be encountered in rds-tcp, so remove the dead logic that predicates on this bit. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01RDS: Rework path specific indirectionsSowmini Varadhan12-40/+29
Refactor code to avoid separate indirections for single-path and multipath transports. All transports (both single and mp-capable) will get a pointer to the rds_conn_path, and can trivially derive the rds_connection from the ->cp_conn. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01Merge branch 'bpf-cgroup2'David S. Miller12-1/+506
Martin KaFai Lau says: ==================== cgroup: bpf: cgroup2 membership test on skb This series is to implement a bpf-way to check the cgroup2 membership of a skb (sk_buff). It is similar to the feature added in netfilter: c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match") The current target is the tc-like usage. v3: - Remove WARN_ON_ONCE(!rcu_read_lock_held()) - Stop BPF_MAP_TYPE_CGROUP_ARRAY usage in patch 2/4 - Avoid mounting bpf fs manually in patch 4/4 - Thanks for Daniel's review and the above suggestions - Check CONFIG_SOCK_CGROUP_DATA instead of CONFIG_CGROUPS. Thanks to the kbuild bot's report. Patch 2/4 only needs CONFIG_CGROUPS while patch 3/4 needs CONFIG_SOCK_CGROUP_DATA. Since a single bpf cgrp2 array alone is not useful for now, CONFIG_SOCK_CGROUP_DATA is also used in patch 2/4. We can fine tune it later if we find other use cases for the cgrp2 array. - Return EAGAIN instead of ENOENT if the cgrp2 array entry is NULL. It is to distinguish these two cases: 1) the userland has not populated this array entry yet. or 2) not finding cgrp2 from the skb. - Be-lated thanks to Alexei and Tejun on reviewing v1 and giving advice on this work. v2: - Fix two return cases in cgroup_get_from_fd() - Fix compilation errors when CONFIG_CGROUPS is not used: - arraymap.c: avoid registering BPF_MAP_TYPE_CGROUP_ARRAY - filter.c: tc_cls_act_func_proto() returns NULL on BPF_FUNC_skb_in_cgroup - Add comments to BPF_FUNC_skb_in_cgroup and cgroup_get_from_fd() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01cgroup: bpf: Add an example to do cgroup checking in BPFMartin KaFai Lau5-0/+367
test_cgrp2_array_pin.c: A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY), pouplates/updates it with a cgroup2's backed fd and pins it to a bpf-fs's file. The pinned file can be loaded by tc and then used by the bpf prog later. This program can also update an existing pinned array and it could be useful for debugging/testing purpose. test_cgrp2_tc_kern.c: A bpf prog which should be loaded by tc. It is to demonstrate the usage of bpf_skb_in_cgroup. test_cgrp2_tc.sh: A script that glues the test_cgrp2_array_pin.c and test_cgrp2_tc_kern.c together. The idea is like: 1. Load the test_cgrp2_tc_kern.o by tc 2. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY with a cgroup fd 3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been dropped because of a match on the cgroup Most of the lines in test_cgrp2_tc.sh is the boilerplate to setup the cgroup/bpf-fs/net-devices/netns...etc. It is not bulletproof on errors but should work well enough and give enough debug info if things did not go well. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01cgroup: bpf: Add bpf_skb_in_cgroup_protoMartin KaFai Lau3-1/+56
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk belongs to a descendant of a cgroup2. It is similar to the feature added in netfilter: commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match") The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY which will be used by the bpf_skb_in_cgroup. Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY and bpf_skb_in_cgroup() are always used together. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAYMartin KaFai Lau4-1/+48
Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations. To update an element, the caller is expected to obtain a cgroup2 backed fd by open(cgroup2_dir) and then update the array with that fd. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01cgroup: Add cgroup_get_from_fdMartin KaFai Lau2-0/+36
Add a helper function to get a cgroup2 from a fd. It will be stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will be introduced in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01Merge branch 'bpf-robustify'David S. Miller9-61/+34
Daniel Borkmann says: ==================== Further robustify putting BPF progs This series addresses a potential issue reported to us by Jann Horn with regards to putting progs. First patch moves progs generally under RCU destruction and second patch refactors getting of progs to simplify code a bit. For details, please see individual patches. Note, we think that addressing this one in net-next should be sufficient. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01bpf: refactor bpf_prog_get and type check into helperDaniel Borkmann7-44/+31
Since bpf_prog_get() and program type check is used in a couple of places, refactor this into a small helper function that we can make use of. Since the non RO prog->aux part is not used in performance critical paths and a program destruction via RCU is rather very unlikley when doing the put, we shouldn't have an issue just doing the bpf_prog_get() + prog->type != type check, but actually not taking the ref at all (due to being in fdget() / fdput() section of the bpf fd) is even cleaner and makes the diff smaller as well, so just go for that. Callsites are changed to make use of the new helper where possible. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01bpf: generally move prog destruction to RCU deferralDaniel Borkmann4-19/+5
Jann Horn reported following analysis that could potentially result in a very hard to trigger (if not impossible) UAF race, to quote his event timeline: - Set up a process with threads T1, T2 and T3 - Let T1 set up a socket filter F1 that invokes another filter F2 through a BPF map [tail call] - Let T1 trigger the socket filter via a unix domain socket write, don't wait for completion - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put() - Let T3 close the file descriptor for F2, dropping the reference count of F2 to 2 - At this point, T1 should have looked up F2 from the map, but not finished executing it - Let T3 remove F2 from the BPF map, dropping the reference count of F2 to 1 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping the reference count of F2 to 0 and scheduling bpf_prog_free_deferred() via schedule_work() - At this point, the BPF program could be freed - BPF execution is still running in a freed BPF program While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf event fd we're doing the syscall on doesn't disappear from underneath us for whole syscall time, it may not be the case for the bpf fd used as an argument only after we did the put. It needs to be a valid fd pointing to a BPF program at the time of the call to make the bpf_prog_get() and while T2 gets preempted, F2 must have dropped reference to 1 on the other CPU. The fput() from the close() in T3 should also add additionally delay to the reference drop via exit_task_work() when bpf_prog_release() gets called as well as scheduling bpf_prog_free_deferred(). That said, it makes nevertheless sense to move the BPF prog destruction generally after RCU grace period to guarantee that such scenario above, but also others as recently fixed in ceb56070359b ("bpf, perf: delay release of BPF prog after grace period") with regards to tail calls won't happen. Integrating bpf_prog_free_deferred() directly into the RCU callback is not allowed since the invocation might happen from either softirq or process context, so we're not permitted to block. Reviewing all bpf_prog_put() invocations from eBPF side (note, cBPF -> eBPF progs don't use this for their destruction) with call_rcu() look good to me. Since we don't know whether at the time of attaching the program, we're already part of a tail call map, we need to use RCU variant. However, due to this, there won't be severely more stress on the RCU callback queue: situations with above bpf_prog_get() and bpf_prog_put() combo in practice normally won't lead to releases, but even if they would, enough effort/ cycles have to be put into loading a BPF program into the kernel already. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01atm: horizon: Use setup_timerAmitoj Kaur Chawla1-3/+1
Convert a call to init_timer and accompanying intializations of the timer's data and function fields to a call to setup_timer. The Coccinelle semantic patch that fixes this problem is as follows: @@ expression t,d,f,e1; identifier x1; statement S1; @@ ( -t.data = d; | -t.function = f; | -init_timer(&t); +setup_timer(&t,f,d); | -init_timer_on_stack(&t); +setup_timer_on_stack(&t,f,d); ) <... when != S1 t.x1 = e1; ...> Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01Merge branch 'qed-next'David S. Miller3-58/+133
Manish Chopra says: ==================== qede: Enhancements This patch series have few small fastpath features support and code refactoring. Note - regarding get/set tunable configuration via ethtool Surprisingly, there is NO ethtool application support for such configuration given that we have kernel support. Do let us know if we need to add support for that in user ethtool. Please consider applying this series to "net-next". ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01qede: Bump up driver version to 8.10.1.20Manish Chopra1-1/+1
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01qede: Add get/set rx copy break tunable supportManish Chopra3-1/+51
Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01qede: Utilize xmit_moreManish Chopra1-14/+23
This patch uses xmit_more optimization to reduce number of TX doorbells write per packet. Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>