summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-03-02MIPS: eBPF: Fix icache flush end addressPaul Burton1-1/+1
The MIPS eBPF JIT calls flush_icache_range() in order to ensure the icache observes the code that we just wrote. Unfortunately it gets the end address calculation wrong due to some bad pointer arithmetic. The struct jit_ctx target field is of type pointer to u32, and as such adding one to it will increment the address being pointed to by 4 bytes. Therefore in order to find the address of the end of the code we simply need to add the number of 4 byte instructions emitted, but we mistakenly add the number of instructions multiplied by 4. This results in the call to flush_icache_range() operating on a memory region 4x larger than intended, which is always wasteful and can cause crashes if we overrun into an unmapped page. Fix this by correcting the pointer arithmetic to remove the bogus multiplication, and use braces to remove the need for a set of brackets whilst also making it obvious that the target field is a pointer. Signed-off-by: Paul Burton <paul.burton@mips.com> Fixes: b6bd53f9c4e8 ("MIPS: Add missing file for eBPF JIT.") Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: netdev@vger.kernel.org Cc: bpf@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: stable@vger.kernel.org # v4.13+ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-01net/mlx5: Update the list of the PCI supported devicesEran Ben Elisha1-0/+2
Add the upcoming ConnectX-6 Dx. In addition, add "ConnectX Family mlx5Gen Virtual Function" device ID. Every new HCA VF will be identified with this device ID. Different VF models will be distinguished by their revision id. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Set peer flow needed also for multipathRoi Dayan1-2/+9
Update the predicate that determines if to duplicate rules installed on vport reps to account also for the multipath case. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Update check for merged eswitch deviceRoi Dayan1-4/+3
The current check only validates if both netdevs use the same ops which means both are vf reps or both uplink reps. Unlike the case where the two uplinks are bonded (VF LAG), under multipath scheme the switchdev parent id is not unified between the uplink reps (and all the associated vf reps). However, we still want to duplicate in the driver encap flows, adjust the merged eswitch check for that matter. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Use hint to resolve route when in HW multipath modeRoi Dayan1-0/+12
As part of creating the tunnel headers while offloading TC encap rules, we resolve the route and neighbour in order to get the source / destination mac. Since the way we offload multipath route is by having two HW rules, one per uplink port, doing naive route lookup might get us a "wrong" routing path which goes through the peer uplink and this will get us eventually to create a wrong L2 header for the tunnel. To avoid that, we use a device hint to get the correct route. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Always query offloaded tc peer rule counterRoi Dayan1-11/+15
Under multipath when encap rules are duplicated to HW in the driver, it's possible for one flow to be currently un-offloaded (e.g. lack of next-hop route or neigh entry) while the other flow is offloaded. As such, we move to query the counters of both flows at all times. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Re-attempt to offload flows on multipath port affinity eventsRoi Dayan4-12/+71
Under multipath it's possible for us to offload the flow only through the e-switch for which proper route through the uplink exists. When the port is up and the next-hop route is set again we want to offload through it as well. We generate SW event from the FIB event handler when multipath port affinity changes. The tc offloads code gets this event, goes over the flows which were marked as of having missing route and attempts to offload them. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5: Emit port affinity event for multipath offloadsRoi Dayan2-0/+12
Under multipath offload scheme, as part of handling fib events, emit mlx5 port affinity event on the enabled ports which will be handled by the tc offloads code. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Allow one failure when offloading tc encap rules under multipathRoi Dayan1-2/+12
In a similar manner to uplink/VF LAG, under multipath we add encap peer rule on the second port as well. However, unlike the LAG case, we do want to allow failure for adding one of the rules. This happens due to using a routing hint while doing the route lookup when one path (next hop device) is down. Introduce a new flag to indicate that route lookup failed for encap flow. Note that a flow may still not be offloaded to hw due to missing neighbour, in that case, the neigh update event will take care of it. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Don't inherit flow flags on peer flow creationRoi Dayan1-3/+4
Currently the peer flow inherits the flags from the original flow after we've set it. At this time the flags are set according to the flow state, e.g marked as going to slow path and such. Even if not getting us to real bugs now, this opens the door to get us to troubles later. Future proof the code and avoid the inheritance, use the peer flags as were set on input when we started adding the original flow. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Activate HW multipath and handle port affinity based on FIB eventsRoi Dayan6-0/+326
To support multipath offload we are going to track SW multipath route and related nexthops. To do that we register to FIB notifier and handle the route and next-hops events and reflect that as port affinity to HW. When there is a new multipath route entry that all next-hops are the ports of an HCA we will activate LAG in HW. Egress wise, we use HW LAG as the means to emulate multipath on current HW which doesn't support port selection based on xmit hash. In the presence of multiple VFs which use multiple SQs (send queues) this yields fairly good distribution. HA wise, HW LAG buys us the ability for a given RQ (receive queue) to receive traffic from both ports and for SQs to migrate xmitting over the active port if their base port fails. When the route entry is being updated to single path we will update the HW port affinity to use that port only. If a next-hop becomes dead we update the HW port affinity to the living port. When all next-hops are alive again we reset the affinity to default. Due to FW/HW limitations, when a route is deleted we are not disabling the HW LAG since doing so will not allow us to enable it again while VFs are bounded. Typically this is just a temporary state when a routing daemon removes dead routes and later adds them back as needed. This patch only handles events for AF_INET. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5: Add multipath modeRoi Dayan4-2/+28
In order to offload ecmp-on-host scheme where next-hop routes are used, we will make use of HW LAG. Add accessor function to let upper layers in the driver to realize if the lag acts in multi-path mode. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5: Use own workqueue for lag netdev events processingRoi Dayan2-1/+9
Instead of using the system workqueue, allocate our own workqueue. This workqueue will be used to handle more work in the next patch. This patch doesn't change functionality. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5: Expose lag operations in header fileRoi Dayan2-48/+68
The change is a refactoring step towards a multipath use case. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5: Use unsigned int bit instead of bool as a struct memberRoi Dayan1-1/+1
This fix checkpatch check CHECK: Avoid using bool structure members because of possible alignment issues Signed-off-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Don't make internal use of errno to denote missing neighRoi Dayan2-14/+22
EAGAIN is treated as a specific case when we consider the attachment successful but wait for neigh event before offloading the flow. This can result in unwanted behavior when sub calls on the offloading path will return EAGAIN and we pass this error up. Instead of attaching to a specific error code return a boolean value from the attach encap operation saying if the encap is valid or not. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Cleanup attach encap functionRoi Dayan1-14/+17
Remove the tunnel info argument which we can get from the other args. Also reorder the args to have input args first and output args later. This patch doesn't change functionality. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01net/mlx5e: Declare mlx5e_tx_reporter_recover_from_ctx as staticEran Ben Elisha1-1/+1
Function mlx5e_tx_reporter_recover_from_ctx is only used within mlx5e tx reporter, move it to be statically declared in en/reporter_tx.c. Fixes: de8650a82071 ("net/mlx5e: Add tx reporter support") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-01Merge branch 'nfp-control-processor-DMA-support-and-RJ45'David S. Miller3-46/+243
Jakub Kicinski says: ==================== nfp: control processor DMA support and RJ45 This series starts with adding support for reporting twisted pair media type in ethtool. Remaining patches add support for using DMA with the control/service processor. Currently we always copy the command data into card's memory. DMA support allows us to have the NSP read the data from host memory by itself. Unfortunately, the FW loading and flashing cannot directly map the buffers for DMA because (a) the firmware ABI returns const buffers, and (b) the buffers may be vmalloc()ed in many mysterious/unmappable way. So just bite the bullet - allocate new host buffer for the command and copy. As Dirk explains, the NSP now supports updating all FWs at once which means the max flashing time grew significantly. He bumps the max wait to avoid timeouts. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01nfp: nsp: set higher timeout for flash bundleDirk van der Merwe1-4/+1
The management firmware now supports being passed a bundle with multiple components to be stored in flash at once. This makes it easier to update all components to a known state with a single user command, however, this also has the potential to increase the time required to perform the update significantly. The management firmware only updates the components out of a bundle which are outdated, however, we need to make sure we can handle the absolute worst case where a CPLD update can take a long time to perform. We set a very conservative total timeout of 900s which already adds a contingency. Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01nfp: nsp: allow the use of DMA bufferJakub Kicinski1-5/+191
Newer versions of NSP can access host memory. Simplest access type requires all data to be in one contiguous area. Since we don't have the guarantee on where callers of the NSP ABI will allocate their buffers we allocate a bounce buffer and copy the data in and out. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01nfp: nsp: move default buffer handling into its own functionJakub Kicinski1-42/+51
DMA version of NSP communication is coming, move the code which copies data into the NFP buffer into a separate function. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01nfp: nsp: use fractional size of the bufferJakub Kicinski1-6/+7
NSP expresses the buffer size in MB and 4 kB blocks. For small buffers the kB part may make a difference, so count it in. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01nfp: report RJ45 connector in ethtoolJakub Kicinski2-0/+4
Add support for reporting twisted pair port type. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01lan743x: Fix TX Stall IssueBryan Whitehead1-4/+12
It has been observed that tx queue stalls while downloading from certain web sites (example www.speedtest.net) The cause has been tracked down to a corner case where dma descriptors where not setup properly. And there for a tx completion interrupt was not signaled. This fix corrects the problem by properly marking the end of a multi descriptor transmission. Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver") Signed-off-by: Bryan Whitehead <Bryan.Whitehead@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01net: phy: phylink: fix uninitialized variable in phylink_get_mac_stateHeiner Kallweit1-0/+4
When debugging an issue I found implausible values in state->pause. Reason in that state->pause isn't initialized and later only single bits are changed. Also the struct itself isn't initialized in phylink_resolve(). So better initialize state->pause and other not yet initialized fields. v2: - use right function name in subject v3: - initialize additional fields Fixes: 9525ae83959b ("phylink: add phylink infrastructure") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01net: aquantia: regression on cpus with high cores: set mode with 8 queuesDmitry Bogdanov4-0/+29
Recently the maximum number of queues was increased up to 8, but NIC was not fully configured for 8 queues. In setups with more than 4 CPU cores parts of TX traffic gets lost if the kernel routes it to queues 4th-8th. This patch sets a tx hw traffic mode with 8 queues. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202651 Fixes: 71a963cfc50b ("net: aquantia: increase max number of hw queues") Reported-by: Nicholas Johnson <nicholas.johnson@outlook.com.au> Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01selftests: fixes for UDP GROPaolo Abeni2-17/+33
The current implementation for UDP GRO tests is racy: the receiver may flush the RX queue while the sending is still transmitting and incorrectly report RX errors, with a wrong number of packet received. Add explicit timeouts to the receiver for both connection activation (first packet received for UDP) and reception completion, so that in the above critical scenario the receiver will wait for the transfer completion. Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01net: marvell: neta: disable comphy when setting modeMarek Behún1-5/+23
The comphy driver for Armada 3700 by Miquèl Raynal (which is currently in linux-next) does not actually set comphy mode when phy_set_mode_ext is called. The mode is set at next call of phy_power_on. Update the driver to semantics similar to mvpp2: helper mvneta_comphy_init sets comphy mode and powers it on. When mode is to be changed in mvneta_mac_config, first power the comphy off, then call mvneta_comphy_init (which sets the mode to new one). Only do this when new mode is different from old mode. This should also work for Armada 38x, since in that comphy driver methods power_on and power_off are unimplemented. Signed-off-by: Marek Behún <marek.behun@nic.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01Merge branch 'enetc-Add-mdio-support-and-device-tree-nodes'David S. Miller7-1/+340
Claudiu Manoil says: ==================== enetc: Add mdio support and device tree nodes This is the missing part to enable PCI probing of the ENETC ethernet ports on the LS1028A SoC and external traffic on the LS1028A RDB board. It's one of the first items on the TODO list for the recently merged ENETC ethernet driver. v3: Add DT bindings doc for ENETC connections v4: none ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01dt-bindings: net: freescale: enetc: Add connection bindings for ENETC ↵Claudiu Manoil1-0/+69
ethernet nodes Define connection bindings (external PHY connections and internal links) for the ENETC on-chip ethernet controllers. Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01enetc: Add ENETC PF level external MDIO supportClaudiu Manoil4-1/+219
Each ENETC PF has its own MDIO interface, the corresponding MDIO registers are mapped in the ENETC's Port register block. The current patch adds a driver for these PF level MDIO buses, so that each PF can manage directly its own external link. Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01arm64: dts: fsl: ls1028a-rdb: Add ENETC external eth ports for the LS1028A ↵Claudiu Manoil1-0/+17
RDB board The LS1028A RDB board features an Atheros PHY connected over SGMII to the ENETC PF0 (or Port0). ENETC Port1 (PF1) has no external connection on this board, so it can be disabled for now. Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01arm64: dts: fsl: ls1028a: Add PCI IERC node and ENETC endpointsClaudiu Manoil1-0/+35
The LS1028A SoC features a PCI Integrated Endpoint Root Complex (IERC) defining several integrated PCI devices, including the ENETC ethernet controller integrated endpoints (IEPs). The IERC implements ECAM (Enhanced Configuration Access Mechanism) to provide access to the PCIe config space of the IEPs. This means the the IEPs (including ENETC) do not support the standard PCIe BARs, instead the Enhanced Allocation (EA) capability structures in the ECAM space are used to fix the base addresses in the system, and the PCI subsystem uses these structures for device enumeration and discovery. The "ranges" entries contain basic information from these EA capabily structures required by the kernel for device enumeration. The current patch also enables the first 2 ENETC PFs (Physiscal Functions) and the associated VFs (Virtual Functions), 2 VFs for each PF. Each of these ENETC PFs has an external ethernet port on the LS1028A SoC. Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01bpf: drop refcount if bpf_map_new_fd() fails in map_create()Peng Sun1-2/+2
In bpf/syscall.c, map_create() first set map->usercnt to 1, a file descriptor is supposed to return to userspace. When bpf_map_new_fd() fails, drop the refcount. Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID") Signed-off-by: Peng Sun <sironhide0null@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-01netfilter: nf_tables: merge ipv4 and ipv6 nat chain typesFlorian Westphal9-194/+111
Merge the ipv4 and ipv6 nat chain type. This is the last missing piece which allows to provide inet family support for nat in a follow patch. The kconfig knobs for ipv4/ipv6 nat chain are removed, the nat chain type will be built unconditionally if NFT_NAT expression is enabled. Before: text data bss dec hex filename 1576 896 0 2472 9a8 nft_chain_nat_ipv4.ko 1697 896 0 2593 a21 nft_chain_nat_ipv6.ko After: text data bss dec hex filename 1832 896 0 2728 aa8 nft_chain_nat.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: nf_tables: nat: merge nft_masq protocol specific modulesFlorian Westphal9-236/+168
The family specific masq modules are way too small to warrant an extra module, just place all of them in nft_masq. before: text data bss dec hex filename 1001 832 0 1833 729 nft_masq.ko 766 896 0 1662 67e nft_masq_ipv4.ko 764 896 0 1660 67c nft_masq_ipv6.ko after: 2010 960 0 2970 b9a nft_masq.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: nf_tables: nat: merge nft_redir protocol specific modulesFlorian Westphal9-217/+143
before: text data bss dec hex filename 990 832 0 1822 71e nft_redir.ko 697 896 0 1593 639 nft_redir_ipv4.ko 713 896 0 1609 649 nft_redir_ipv6.ko after: text data bss dec hex filename 1910 960 0 2870 b36 nft_redir.ko size is reduced, all helpers from nft_redir.ko can be made static. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: xt_IDLETIMER: fix sysfs callback function typeSami Tolvanen1-10/+4
Use struct device_attribute instead of struct idletimer_tg_attr, and the correct callback function type to avoid indirect call mismatches with Control Flow Integrity checking. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: nf_conntrack: ensure that CONNTRACK_LOCKS is power of 2Li RongQing1-0/+1
CONNTRACK_LOCKS is divisor when computer array index, if it is power of 2, compiler will optimize modulo operation as bitwise AND, or else modulo will lower performance. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: nf_tables: check the result of dereferencing base_chain->statsLi RongQing1-6/+8
Check the result of dereferencing base_chain->stats, instead of result of this_cpu_ptr with NULL. base_chain->stats maybe be changed to NULL when a chain is updated and a new NULL counter can be attached. And we do not need to check returning of this_cpu_ptr since base_chain->stats is from percpu allocator if it is non-NULL, this_cpu_ptr returns a valid value. And fix two sparse error by replacing rcu_access_pointer and rcu_dereference with READ_ONCE under rcu_read_lock. Thanks for Eric's help to finish this patch. Fixes: 009240940e84c1 ("netfilter: nf_tables: don't assume chain stats are set when jumplabel is set") Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: bridge: Don't sabotage nf_hook calls for an l3mdev slaveDavid Ahern1-1/+2
Followup to a173f066c7cf ("netfilter: bridge: Don't sabotage nf_hook calls from an l3mdev"). Some packets (e.g., ndisc) do not have the skb device flipped to the l3mdev (e.g., VRF) device. Update ip_sabotage_in to not drop packets for slave devices too. Currently, neighbor solicitation packets for 'dev -> bridge (addr) -> vrf' setups are getting dropped. This patch enables IPv6 communications for bridges with an address that are enslaved to a VRF. Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01ipvs: get sctphdr by sctphoff in sctp_csum_checkXin Long1-5/+2
sctp_csum_check() is called by sctp_s/dnat_handler() where it calls skb_make_writable() to ensure sctphdr to be linearized. So there's no need to get sctphdr by calling skb_header_pointer() in sctp_csum_check(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: convert the proto argument from u8 to u16Li RongQing3-7/+7
The proto in struct xt_match and struct xt_target is u16, when calling xt_check_target/match, their proto argument is u8, and will cause truncation, it is harmless to ip packet, since ip proto is u8 if a etable's match/target has proto that is u16, will cause the check failure. and convert be16 to short in bridge/netfilter/ebtables.c Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: nft_tunnel: Add dst_cache supportwenxu1-0/+7
The metadata_dst does not initialize the dst_cache field, this causes problems to ip_md_tunnel_xmit() since it cannot use this cache, hence, Triggering a route lookup for every packet. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01netfilter: conntrack: tcp: only close if RST matches exact sequenceFlorian Westphal1-10/+40
TCP resets cause instant transition from established to closed state provided the reset is in-window. Endpoints that implement RFC 5961 require resets to match the next expected sequence number. RST segments that are in-window (but that do not match RCV.NXT) are ignored, and a "challenge ACK" is sent back. Main problem for conntrack is that its a middlebox, i.e. whereas an end host might have ACK'd SEQ (and would thus accept an RST with this sequence number), conntrack might not have seen this ACK (yet). Therefore we can't simply flag RSTs with non-exact match as invalid. This updates RST processing as follows: 1. If the connection is in a state other than ESTABLISHED, nothing is changed, RST is subject to normal in-window check. 2. If the RSTs sequence number either matches exactly RCV.NXT, connection state moves to CLOSE. 3. The same applies if the RST sequence number aligns with a previous packet in the same direction. In all other cases, the connection remains in ESTABLISHED state. If the normal-in-window check passes, the timeout will be lowered to that of CLOSE. If the peer sends a challenge ack, connection timeout will be reset. If the challenge ACK triggers another RST (RST was valid after all), this 2nd RST will match expected sequence and conntrack state changes to CLOSE. If no challenge ACK is received, the connection will time out after CLOSE seconds (10 seconds by default), just like without this patch. Packetdrill test case: 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 // Receive a segment. 0.210 < P. 1:1001(1000) ack 1 win 46 0.210 > . 1:1(0) ack 1001 // Application writes 1000 bytes. 0.250 write(4, ..., 1000) = 1000 0.250 > P. 1:1001(1000) ack 1001 // First reset, old sequence. Conntrack (correctly) considers this // invalid due to failed window validation (regardless of this patch). 0.260 < R 2:2(0) ack 1001 win 260 // 2nd reset, but too far ahead sequence. Same: correctly handled // as invalid. 0.270 < R 99990001:99990001(0) ack 1001 win 260 // in-window, but not exact sequence. // Current Linux kernels might reply with a challenge ack, and do not // remove connection. // Without this patch, conntrack state moves to CLOSE. // With patch, timeout is lowered like CLOSE, but connection stays // in ESTABLISHED state. 0.280 < R 1010:1010(0) ack 1001 win 260 // Expect challenge ACK 0.281 > . 1001:1001(0) ack 1001 win 501 // With or without this patch, RST will cause connection // to move to CLOSE (sequence number matches) // 0.282 < R 1001:1001(0) ack 1001 win 260 // ACK 0.300 < . 1001:1001(0) ack 1001 win 257 // more data could be exchanged here, connection // is still established // Client closes the connection. 0.610 < F. 1001:1001(0) ack 1001 win 260 0.650 > . 1001:1001(0) ack 1002 // Close the connection without reading outstanding data 0.700 close(4) = 0 // so one more reset. Will be deemed acceptable with patch as well: // connection is already closing. 0.701 > R. 1001:1001(0) ack 1002 win 501 // End packetdrill test case. With patch, this generates following conntrack events: [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UNREPLIED] [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED] [UPDATE] 120 FIN_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED] [UPDATE] 60 CLOSE_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED] [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED] Without patch, first RST moves connection to close, whereas socket state does not change until FIN is received. [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UNREPLIED] [UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED] [UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED] Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01ipvs: change some data types from int to boolAndrea Claudi5-18/+18
Change the data type of the following variables from int to bool across ipvs code: - found - loop - need_full_dest - need_full_svc - payload_csum Also change the following functions to use bool full_entry param instead of int: - ip_vs_genl_parse_dest() - ip_vs_genl_parse_service() This patch does not change any functionality but makes the source code slightly easier to read. Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01Merge branch 'bpf-dedup-fixes'Daniel Borkmann4-23/+103
Andrii Nakryiko says: ==================== This patchset fixes a bug in btf_dedup() algorithm, which under specific hash collision causes infinite loop. It also exposes ability to tune BTF deduplication table size, with double purpose of allowing applications to adjust size according to the size of BTF data, as well as allowing a simple way to force hash collisions by setting table size to 1. - Patch #1 fixes bug in btf_dedup testing code that's checking strings - Patch #2 fixes pointer arg formatting in btf.h - Patch #3 adds option to specify custom dedup table size - Patch #4 fixes aforementioned bug in btf_dedup - Patch #5 adds test that validates the fix v1->v2: - remove "Fixes" from formatting change patch - extract roundup_pow2_max func for dedup table size - btf_equal_struct -> btf_shallow_equal_struct - explain in comment why we can't rely on just btf_dedup_is_equiv ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-01selftests/bpf: add btf_dedup test of FWD/STRUCT resolutionAndrii Nakryiko1-0/+45
This patch adds a btf_dedup test exercising logic of STRUCT<->FWD resolution and validating that STRUCT is not resolved to a FWD. It also forces hash collisions, forcing both FWD and STRUCT to be candidates for each other. Previously this condition caused infinite loop due to FWD pointing to STRUCT and STRUCT pointing to its FWD. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-01btf: fix bug with resolving STRUCT/UNION into corresponding FWDAndrii Nakryiko1-3/+17
When checking available canonical candidates for struct/union algorithm utilizes btf_dedup_is_equiv to determine if candidate is suitable. This check is not enough when candidate is corresponding FWD for that struct/union, because according to equivalence logic they are equivalent. When it so happens that FWD and STRUCT/UNION end in hashing to the same bucket, it's possible to create remapping loop from FWD to STRUCT and STRUCT to same FWD, which will cause btf_dedup() to loop forever. This patch fixes the issue by additionally checking that type and canonical candidate are strictly equal (utilizing btf_equal_struct). Fixes: d5caef5b5655 ("btf: add BTF types deduplication algorithm") Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>