summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-01-24Merge branch 'vxlan-fdb-fixes'David S. Miller1-3/+7
Roopa Prabhu says: ==================== vxlan: misc fdb fixes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24vxlan: do not age static remote mac entriesBalakrishnan Raman1-1/+1
Mac aging is applicable only for dynamically learnt remote mac entries. Check for user configured static remote mac entries and skip aging. Signed-off-by: Balakrishnan Raman <ramanb@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24vxlan: don't flush static fdb entries on admin downRoopa Prabhu1-2/+6
This patch skips flushing static fdb entries in ndo_stop, but flushes all fdb entries during vxlan device delete. This is consistent with the bridge driver fdb Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: dsa: Drop WARN() in tag_brcm.cFlorian Fainelli1-1/+2
We may be able to see invalid Broadcom tags when the hardware and drivers are misconfigured, or just while exercising the error path. Instead of flooding the console with messages, flat out drop the packet. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: ks8851: Drop eeprom_size structure memberStephen Boyd1-7/+0
After commit 51b7b1c34e19 (KSZ8851-SNL: Add ethtool support for EEPROM via eeprom_93cx6, 2011-11-21) this structure member is unused. Delete it. Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Merge branch 'ip6_tnl_parse_tlv_enc_lim-fixes'David S. Miller2-12/+27
Eric Dumazet says: ==================== ipv6: fix ip6_tnl_parse_tlv_enc_lim() issues First patch fixes ip6_tnl_parse_tlv_enc_lim() callers, bug added in linux-3.7 Second patch fixes ip6_tnl_parse_tlv_enc_lim() itself, bug predates linux-2.6.12 Based on a report from Dmitry Vyukov, thanks to KASAN. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24ipv6: fix ip6_tnl_parse_tlv_enc_lim()Eric Dumazet1-12/+22
This function suffers from multiple issues. First one is that pskb_may_pull() may reallocate skb->head, so the 'raw' pointer needs either to be reloaded or not used at all. Second issue is that NEXTHDR_DEST handling does not validate that the options are present in skb->data, so we might read garbage or access non existent memory. With help from Willem de Bruijn. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()Eric Dumazet2-0/+5
Since ip6_tnl_parse_tlv_enc_lim() can call pskb_may_pull(), we must reload any pointer that was related to skb->head (or skb->data), or risk use after free. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Merge branch 'bpf-misc'David S. Miller6-52/+364
Daniel Borkmann says: ==================== Misc BPF improvements This series adds various misc improvements to BPF, f.e. allowing skb_load_bytes() helper to be used with filter/reuseport programs to facilitate programming, test cases for program tag, etc. For details, please see individual patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: enable verifier to better track const alu opsDaniel Borkmann2-19/+127
William reported couple of issues in relation to direct packet access. Typical scheme is to check for data + [off] <= data_end, where [off] can be either immediate or coming from a tracked register that contains an immediate, depending on the branch, we can then access the data. However, in case of calculating [off] for either the mentioned test itself or for access after the test in a more "complex" way, then the verifier will stop tracking the CONST_IMM marked register and will mark it as UNKNOWN_VALUE one. Adding that UNKNOWN_VALUE typed register to a pkt() marked register, the verifier then bails out in check_packet_ptr_add() as it finds the registers imm value below 48. In the first below example, that is due to evaluate_reg_imm_alu() not handling right shifts and thus marking the register as UNKNOWN_VALUE via helper __mark_reg_unknown_value() that resets imm to 0. In the second case the same happens at the time when r4 is set to r4 &= r5, where it transitions to UNKNOWN_VALUE from evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside evaluate_reg_alu(), where the register's imm turns into 3. That is, for registers with type UNKNOWN_VALUE, imm of 0 means that we don't know what value the register has, and for imm > 0 it means that the value has [imm] upper zero bits. F.e. when shifting an UNKNOWN_VALUE register by 3 to the right, no matter what value it had, we know that the 3 upper most bits must be zero now. This is to make sure that ALU operations with unknown registers don't overflow. Meaning, once we know that we have more than 48 upper zero bits, or, in other words cannot go beyond 0xffff offset with ALU ops, such an addition will track the target register as a new pkt() register with a new id, but 0 offset and 0 range, so for that a new data/data_end test will be required. Is the source register a CONST_IMM one that is to be added to the pkt() register, or the source instruction is an add instruction with immediate value, then it will get added if it stays within max 0xffff bounds. >From there, pkt() type, can be accessed should reg->off + imm be within the access range of pkt(). [...] from 28 to 30: R0=imm1,min_value=1,max_value=1 R1=pkt(id=0,off=0,r=22) R2=pkt_end R3=imm144,min_value=144,max_value=144 R4=imm0,min_value=0,max_value=0 R5=inv48,min_value=2054,max_value=2054 R10=fp 30: (bf) r5 = r3 31: (07) r5 += 23 32: (77) r5 >>= 3 33: (bf) r6 = r1 34: (0f) r6 += r5 cannot add integer value with 0 upper zero bits to ptr_to_packet [...] from 52 to 80: R0=imm1,min_value=1,max_value=1 R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv R4=imm272 R5=inv56,min_value=17,max_value=17 R6=pkt(id=0,off=26,r=34) R10=fp 80: (07) r4 += 71 81: (18) r5 = 0xfffffff8 83: (5f) r4 &= r5 84: (77) r4 >>= 3 85: (0f) r1 += r4 cannot add integer value with 3 upper zero bits to ptr_to_packet Thus to get above use-cases working, evaluate_reg_imm_alu() has been extended for further ALU ops. This is fine, because we only operate strictly within realm of CONST_IMM types, so here we don't care about overflows as they will happen in the simulated but also real execution and interaction with pkt() in check_packet_ptr_add() will check actual imm value once added to pkt(), but it's irrelevant before. With regards to 06c1c049721a ("bpf: allow helpers access to variable memory") that works on UNKNOWN_VALUE registers, the verifier becomes now a bit smarter as it can better resolve ALU ops, so we need to adapt two test cases there, as min/max bound tracking only becomes necessary when registers were spilled to stack. So while mask was set before to track upper bound for UNKNOWN_VALUE case, it's now resolved directly as CONST_IMM, and such contructs are only necessary when f.e. registers are spilled. For commit 6b17387307ba ("bpf: recognize 64bit immediate loads as consts") that initially enabled dw load tracking only for nfp jit/ analyzer, I did couple of tests on large, complex programs and we don't increase complexity badly (my tests were in ~3% range on avg). I've added a couple of tests similar to affected code above, and it works fine with verifier now. Reported-by: William Tu <u9012063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Gianluca Borello <g.borello@gmail.com> Cc: William Tu <u9012063@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: add prog tag test case to bpf selftestsDaniel Borkmann2-2/+204
Add the test case used to compare the results from fdinfo with af_alg's output on the tag. Tests are from min to max sized programs, with and without maps included. # ./test_tag test_tag: OK (40945 tests) Tested on x86_64 and s390x. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: allow option for setting bpf_l4_csum_replace from scratchDaniel Borkmann2-3/+5
When programs need to calculate the csum from scratch for small UDP packets and use bpf_l4_csum_replace() to feed the result from helpers like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0 that would ignore the case of current csum being 0, and which would still allow for the helper to set the csum and transform when needed to CSUM_MANGLED_0. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: enable load bytes helper for filter/reuseport progsDaniel Borkmann1-15/+26
BPF_PROG_TYPE_SOCKET_FILTER are used in various facilities such as for SO_REUSEPORT and packet fanout demuxing, packet filtering, kcm, etc, and yet the only facility they can use is BPF_LD with {BPF_ABS, BPF_IND} for single byte/half/word access. Direct packet access is only restricted to tc programs right now, but we can still facilitate usage by allowing skb_load_bytes() helper added back then in 05c74e5e53f6 ("bpf: add bpf_skb_load_bytes helper") that calls skb_header_pointer() similarly to bpf_load_pointer(), but for stack buffers with larger access size. Name the previous sk_filter_func_proto() as bpf_base_func_proto() since this is used everywhere else as well, similarly for the ctx converter, that is, bpf_convert_ctx_access(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: simplify __is_valid_access test on cbDaniel Borkmann1-13/+2
The __is_valid_access() test for cb[] from 62c7989b24db ("bpf: allow b/h/w/dw access for bpf's cb in ctx") was done unnecessarily complex, we can just simplify it the same way as recent fix from 2d071c643f1c ("bpf, trace: make ctx access checks more robust") did. Overflow can never happen as size is 1/2/4/8 depending on access. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24virtio_net: fix PAGE_SIZE > 64kMichael S. Tsirkin1-1/+9
I don't have any guests with PAGE_SIZE > 64k but the code seems to be clearly broken in that case as PAGE_SIZE / MERGEABLE_BUFFER_ALIGN will need more than 8 bit and so the code in mergeable_ctx_to_buf_address does not give us the actual true size. Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24af_unix: move unix_mknod() out of bindlockWANG Cong1-11/+16
Dmitry reported a deadlock scenario: unix_bind() path: u->bindlock ==> sb_writer do_splice() path: sb_writer ==> pipe->mutex ==> u->bindlock In the unix_bind() code path, unix_mknod() does not have to be done with u->bindlock held, since it is a pure fs operation, so we can just move unix_mknod() out. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24IB/vmw_pvrdma: Fix incorrect cleanup on pvrdma_pci_probe error pathAdit Ranadive1-3/+1
If the interrupt allocation failed we should start freeing the CQ rings rather than unregistering the netdev notifier. Fixes: 29c8d9eba550 ("IB: Add vmw_pvrdma driver") Signed-off-by: Adit Ranadive <aditr@vmware.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24IB/vmw_pvrdma: Don't leak info from alloc_ucontextAdit Ranadive1-1/+1
Clear out the user response struct correctly. Fixes: 29c8d9eba550 ("IB: Add vmw_pvrdma driver") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Adit Ranadive <aditr@vmware.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24net/mlx5e: CQE compression control code reuseShaker Daibes4-20/+10
This patch is intended for code reuse of mlx5e_modify_rx_cqe_compression function. Signed-off-by: Shaker Daibes <shakerd@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5e: Reduce memory consumption on kdump kernelKamal Heib3-11/+21
Reduce memory consumption on kdump kernel by decreasing the number of channels to 1 and the size of RQs and SQs to the minimal values. Signed-off-by: Kamal Heib <kamalh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24IB/mlx5: Enable Eth VFs to query their min-inline value for user-spaceOr Gerlitz2-1/+22
For some mlx5 HW models (CX4, CX4Lx), the VF driver needs to put part of the packet headers on the TX descriptor so the e-switch can do proper matching and steering. This is called "min-inline", it's advertized to the VF by the FW and also enforced on them by the HW, such that if they don't obey, their packets are dropped. SRIOV VF libmlx5 instances should take into account the min-inline value of their vports. For that end, we provide this value through the vendor response part of init_ucontext command. The min inline value is reported in a way which will let newer libmlx5 instances realize that they are running over an older kernel and act accordingly (e.g apply some educated guess). Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5: Push min-inline mode resolution helper into the coreOr Gerlitz3-17/+19
So we can use that from the IB driver too in downstream patches. This patch doesn't change any functionality. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5: Add support for setting VF min rateMohamad Haj Yahia4-13/+104
Add support for SRIOV VF min rate guarantee by using the TSAR BW share weights mechanism. The TSAR BW share vport attribute represents the weight of that vport among the other vports weights which means that the actual vport BW percentage is the same vport weight percentage among the total vports weights sum. Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5: E-Switch, Enlarge the FDB size for the switchdev modeOr Gerlitz1-5/+8
The E-Switch FDB size was hard coded to 8k. Change it to be min(max eswitch table size, max flow counters * num flow groups) where the max values are read from the firmware and the number of flow groups is hard-coded as before this change. We don't know upfront the division of flows to group. This setup allows each group to be of size up to the where we want to support (we mandate pairing of flows with counters for offloading). Thus, we don't expect multiple occurences for a group which in turn adds steering hops. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Tested-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5e: Support SRIOV TC encapsulation offloads for IPv6 tunnelsOr Gerlitz1-8/+151
Add the missing parts for offloading IPv6 tunnels. This includes route and neigh lookups and construnction of the IPv6 tunnel headers. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5e: Maximize ip tunnel key usage on the TC offloading pathOr Gerlitz1-9/+6
Use more fields out of the tunnel key (e.g the tunnel source IP address) provided by upper layers for the route lookup done on the encap offload path. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5e: Use the full tunnel key info for encapsulation offload house-keepingOr Gerlitz2-30/+17
Currently we use subset of the input tunnel key fields (id, ip daddr, dst port) which are provided by upper layers to indentify flows that should go through the same encapsulation and maintain the HW encapsulation table. This is redundant and can get us wrong. Instead, keep a copy of the ip tunnel info provided by the user through TC and have the tunnel key part as the key to our internal hash. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5e: TC ipv4 tunnel encap offload cosmetic changesOr Gerlitz1-6/+4
Move around some settings of variables as pre-step to make things more robust and clear for the ipv6 case in down-stream patch. This patch doesn't change any functionality. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5e: Add TC offloads matching on IPv6 encapsulation headersOr Gerlitz1-3/+27
Enhance the parsing of offloaded TC rules to set HW matching on outer IPv6 encapsulation headers. This effectively adds support for TC tunnel key release action (decapsulation) of SRIOV offloads over IPv6 tunnels. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24net/mlx5: Use exact encap header size for the FW input bufferOr Gerlitz1-2/+5
The current code is allocating the max encap size supported by the firmware and not the size requested by the caller, fix that. Also, spare a warning when the size of the encapsulation headers is bigger from what is supported by the firmware. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-24phy: marvell: remove conflicting initializerArnd Bergmann1-7/+6
One line was apparently pasted incorrectly during a new feature patch: drivers/net/phy/marvell.c:2090:15: error: initialized field overwritten [-Werror=override-init] .features = PHY_GBIT_FEATURES, I'm removing the extraneous line here to avoid the W=1 warning and restore the previous flags value, and I'm slightly reordering the lines for consistency to make it less likely to happen again in the future. The ordering in the array is still not the same as in the structure definition, instead I picked the order that is most common in this file and that seems to make more sense here. Fixes: 0b04680fdae4 ("phy: marvell: Add support for temperature sensor") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: dummy: Introduce dummy virtual functionsPhil Sutter1-2/+215
The idea for this was born when testing VF support in iproute2 which was impeded by hardware requirements. In fact, not every VF-capable hardware driver implements all netdev ops, so testing the interface is still hard to do even with a well-sorted hardware shelf. To overcome this and allow for testing the user-kernel interface, this patch allows to turn dummy into a PF with a configurable amount of VFs. Since my patch series 'bus-agnostic-num-vf' has been accepted, implementing the required interfaces is pretty straightforward: Iff 'num_vfs' module parameter was given a value >0, a dummy bus type is being registered which implements the 'num_vf()' callback. Additionally, a dummy parent device common to all dummy devices is registered which sits on the above dummy bus. Joint work with Sabrina Dubroca. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: broadcom: bnx2x: use new api ethtool_{get|set}_link_ksettingsPhilippe Reynes1-90/+109
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. As I don't have the hardware, I'd be very pleased if someone may test this patch. Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Merge branch 'packet-sampling-offload'David S. Miller19-0/+929
Jiri Pirko says: ==================== Add support for offloading packet-sampling Yotam says: The first patch introduces the psample module, a netlink channel dedicated to packet sampling implemented using generic netlink. This module provides a generic way for kernel modules to sample packets, while not being tied to any specific subsystem like NFLOG. The second patch adds the sample tc action, which uses psample to randomly sample packets that match a classifier. The user can configure the psample group number, the sampling rate and the packet's truncation (to save kernel-user traffic). The last two patches add the support for offloading the matchall-sample tc command in the mlxsw driver, for ingress qdiscs. An example for psample usage can be found in the libpsample project at: https://github.com/Mellanox/libpsample v1->v2: - Reword first patch's commit message - Fix typo in comment in second patch - Change order of tc_sample uapi enum to match convention - Rename act_sample action callback tcf_sample -> tcf_sample_act ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24mlxsw: spectrum: Add packet sample offloading supportYotam Gigi3-0/+122
Using the MPSC register, add the functions that configure port-based packet sampling in hardware and the necessary datatypes in the mlxsw_sp_port struct. In addition, add the necessary trap for sampled packets and integrate with matchall offloading to allow offloading of the sample tc action. The current offload support is for the tc command: tc filter add dev <DEV> parent ffff: \ matchall skip_sw \ action sample rate <RATE> group <GROUP> [trunc <SIZE>] Where only ingress qdiscs are supported, and only a combination of matchall classifier and sample action will lead to activating hardware packet sampling. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24mlxsw: reg: add the Monitoring Packet Sampling Configuration RegisterYotam Gigi1-0/+41
The MPSC register allows to configure ingress packet sampling on specific port of the mlxsw device. The sampled packets are then trapped via PKT_SAMPLE trap. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net/sched: Introduce sample tc actionYotam Gigi6-0/+364
This action allows the user to sample traffic matched by tc classifier. The sampling consists of choosing packets randomly and sampling them using the psample module. The user can configure the psample group number, the sampling rate and the packet's truncation (to save kernel-user traffic). Example: To sample ingress traffic from interface eth1, one may use the commands: tc qdisc add dev eth1 handle ffff: ingress tc filter add dev eth1 parent ffff: \ matchall action sample rate 12 group 4 Where the first command adds an ingress qdisc and the second starts sampling randomly with an average of one sampled packet per 12 packets on dev eth1 to psample group 4. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: Introduce psample, a new genetlink channel for packet samplingYotam Gigi9-0/+402
Add a general way for kernel modules to sample packets, without being tied to any specific subsystem. This netlink channel can be used by tc, iptables, etc. and allow to standardize packet sampling in the kernel. For every sampled packet, the psample module adds the following metadata fields: PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been truncated during sampling PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the user who initiated the sampling. This field allows the user to differentiate between several samplers working simultaneously and filter packets relevant to him PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The sequence is kept for each group PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets PSAMPLE_ATTR_DATA - the actual packet bits The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast group. In addition, add the GET_GROUPS netlink command which allows the user to see the current sample groups, their refcount and sequence number. This command currently supports only netlink dump mode. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24mlxsw: spectrum_router: Correctly reallocate adjacency entriesIdo Schimmel1-4/+6
mlxsw_sp_nexthop_group_mac_update() is called in one of two cases: 1) When the MAC of a nexthop needs to be updated 2) When the size of a nexthop group has changed In the second case the adjacency entries for the nexthop group need to be reallocated from the adjacency table. In this case we must write to the entries the MAC addresses of all the nexthops that should be offloaded and not only those whose MAC changed. Otherwise, these entries would be filled with garbage data, resulting in packet loss. Fixes: a7ff87acd995 ("mlxsw: spectrum_router: Implement next-hop routing") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24r8152: don't execute runtime suspend if the tx is not emptyhayeswang1-1/+3
Runtime suspend shouldn't be executed if the tx queue is not empty, because the device is not idle. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Merge branch 'mdio_module_driver-misc'David S. Miller2-13/+2
Florian Fainelli says: ==================== net: couple mdio_module_driver changes Small patch series fixing a comment for mdio_module_driver and finally utilizing it in b53_mdio. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: dsa: b53: Utilize mdio_module_driverFlorian Fainelli1-12/+1
Eliminate a bit of boilerplate code. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: phy: Fix typo for MDIO module boilerplate commentFlorian Fainelli1-1/+1
The module boilerplate macro is named mdio_module_driver and not module_mdio_driver, fix that. Fixes: a9049e0c513c ("mdio: Add support for mdio drivers.") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Merge branch 'stmmac-dwmac-meson8b-configurable-RGMII-TX-delay'David S. Miller2-6/+30
Martin Blumenstingl says: ==================== stmmac: dwmac-meson8b: configurable RGMII TX delay Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4 cycle (= 2ns) TX clock delay. This seems to work fine for many boards (for example Odroid-C2 or Amlogic's reference boards) but there are some others where TX traffic is simply broken. There are probably multiple reasons why it's working on some boards while it's broken on others: - some of Amlogic's reference boards are using a Micrel PHY - hardware circuit design - maybe more... iperf3 results on my Mecool BB2 board (Meson GXM, RTL8211F PHY) with TX clock delay disabled on the MAC (as it's enabled in the PHY driver). TX throughput was virtually zero before: $ iperf3 -c 192.168.1.100 -R Connecting to host 192.168.1.100, port 5201 Reverse mode, remote host 192.168.1.100 is sending [ 4] local 192.168.1.206 port 52828 connected to 192.168.1.100 port 5201 [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 108 MBytes 901 Mbits/sec [ 4] 1.00-2.00 sec 94.2 MBytes 791 Mbits/sec [ 4] 2.00-3.00 sec 96.5 MBytes 810 Mbits/sec [ 4] 3.00-4.00 sec 96.2 MBytes 808 Mbits/sec [ 4] 4.00-5.00 sec 96.6 MBytes 810 Mbits/sec [ 4] 5.00-6.00 sec 96.5 MBytes 810 Mbits/sec [ 4] 6.00-7.00 sec 96.6 MBytes 810 Mbits/sec [ 4] 7.00-8.00 sec 96.5 MBytes 809 Mbits/sec [ 4] 8.00-9.00 sec 105 MBytes 884 Mbits/sec [ 4] 9.00-10.00 sec 111 MBytes 934 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 1000 MBytes 839 Mbits/sec 0 sender [ 4] 0.00-10.00 sec 998 MBytes 837 Mbits/sec receiver iperf Done. $ iperf3 -c 192.168.1.100 Connecting to host 192.168.1.100, port 5201 [ 4] local 192.168.1.206 port 52832 connected to 192.168.1.100 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.01 sec 99.5 MBytes 829 Mbits/sec 117 139 KBytes [ 4] 1.01-2.00 sec 105 MBytes 884 Mbits/sec 129 70.7 KBytes [ 4] 2.00-3.01 sec 107 MBytes 889 Mbits/sec 106 187 KBytes [ 4] 3.01-4.01 sec 105 MBytes 878 Mbits/sec 92 143 KBytes [ 4] 4.01-5.00 sec 105 MBytes 882 Mbits/sec 140 129 KBytes [ 4] 5.00-6.01 sec 106 MBytes 883 Mbits/sec 115 195 KBytes [ 4] 6.01-7.00 sec 102 MBytes 863 Mbits/sec 133 70.7 KBytes [ 4] 7.00-8.01 sec 106 MBytes 884 Mbits/sec 143 97.6 KBytes [ 4] 8.01-9.01 sec 104 MBytes 875 Mbits/sec 124 107 KBytes [ 4] 9.01-10.01 sec 105 MBytes 876 Mbits/sec 90 139 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.01 sec 1.02 GBytes 874 Mbits/sec 1189 sender [ 4] 0.00-10.01 sec 1.02 GBytes 873 Mbits/sec receiver iperf Done. I get similar TX throughput on my Meson GXBB "MXQ Pro+" board when I disable the PHY's TX-delay and configure a 4ms TX-delay on the MAC. So changes to at least the RTL8211F PHY driver are needed to get it working properly in all situations. Changes since v4: - add a fallback of 2ns (the value which was previously hardcoded) for the TX delay so we are backwards-compatible with older .dts' - update the documentation with the new fallback value and add a small note that the "amlogic,tx-delay" property is ignored when the phy-mode is "rmii". Changes since v3: - rebased to apply against current net-next branch (fixes a conflict with d2ed0a7755fe14c7 "net: ethernet: stmmac: fix of-node and fixed-link-phydev leaks") Changes since v2: - moved all .dts patches (3-7) to a separate series - removed the default 2ns TX delay when phy-mode RGMII is specified - (rebased against current net-next) Changes since v1: - renamed the devicetree property "amlogic,tx-delay" to "amlogic,tx-delay-ns", which makes the .dts easier to read as we can simply specify human-readable values instead of having "preprocessor defines and calculation in human brain". Thanks to Andrew Lunn for the suggestion! - improved documentation to indicate when the MAC TX-delay should be configured and how to use the PHY's TX-delay - changed the default TX-delay in the dwmac-meson8b driver from 2ns to 0ms when any of the rgmii-*id modes are used (the 2ns default value still applies for phy-mode "rgmii") - added patches to properly reset the PHY on Meson GXBB devices and to use a similar configuration than the one we use on Meson GXL devices (by passing a phy-handle to stmmac and defining the PHY in the mdio0 bus - patch 3-6) - add the "amlogic,tx-delay-ns" property to all boards which are using the RGMII PHY (patch 7) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: stmmac: dwmac-meson8b: make the RGMII TX delay configurableMartin Blumenstingl1-6/+14
Prior to this patch we were using a hardcoded RGMII TX clock delay of 2ns (= 1/4 cycle of the 125MHz RGMII TX clock). This value works for many boards, but unfortunately not for all (due to the way the actual circuit is designed, sometimes because the TX delay is enabled in the PHY, etc.). Making the TX delay on the MAC side configurable allows us to support all possible hardware combinations. This allows fixing a compatibility issue on some boards, where the RTL8211F PHY is configured to generate the TX delay. We can now turn off the TX delay in the MAC, because otherwise we would be applying the delay twice (which results in non-working TX traffic). Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Tested-by: Neil Armstrong <narmstrong@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmacMartin Blumenstingl1-0/+16
This allows configuring the RGMII TX clock delay. The RGMII clock is generated by underlying hardware of the the Meson 8b / GXBB DWMAC glue. The configuration depends on the actual hardware (no delay may be needed due to the design of the actual circuit, the PHY might add this delay, etc.). Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Tested-by: Neil Armstrong <narmstrong@baylibre.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: dsa: Fix inverted test for multiple CPU interfaceAndrew Lunn1-1/+1
Remove the wrong !, otherwise we get false positives about having multiple CPU interfaces. Fixes: b22de490869d ("net: dsa: store CPU switch structure in the tree") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Documentation: net: phy: improve explanation when to specify the PHY IDMartin Blumenstingl1-2/+3
The old description basically read like "ethernet-phy-idAAAA.BBBB" can be specified when you know the actual PHY ID. However, specifying this has a side-effect: it forces Linux to bind to a certain PHY driver (the one that matches the ID given in the compatible string), ignoring the ID which is reported by the actual PHY. Whenever a device is shipped with (multiple) different PHYs during it's production lifetime then explicitly specifying "ethernet-phy-idAAAA.BBBB" could break certain revisions of that device. Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: phy: marvell: Add Wake from LAN support for 88E1510 PHYJingju Hou1-0/+2
Signed-off-by: Jingju Hou <houjingj@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bridge: multicast to unicastFelix Fietkau8-29/+114
Implements an optional, per bridge port flag and feature to deliver multicast packets to any host on the according port via unicast individually. This is done by copying the packet per host and changing the multicast destination MAC to a unicast one accordingly. multicast-to-unicast works on top of the multicast snooping feature of the bridge. Which means unicast copies are only delivered to hosts which are interested in it and signalized this via IGMP/MLD reports previously. This feature is intended for interface types which have a more reliable and/or efficient way to deliver unicast packets than broadcast ones (e.g. wifi). However, it should only be enabled on interfaces where no IGMPv2/MLDv1 report suppression takes place. This feature is disabled by default. The initial patch and idea is from Felix Fietkau. Signed-off-by: Felix Fietkau <nbd@nbd.name> [linus.luessing@c0d3.blue: various bug + style fixes, commit message] Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>