summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2021-07-24bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_stateMartin KaFai Lau1-1/+0
A following patch will create a separate struct to store extra bpf_iter state and it will embed the existing tcp_iter_state like this: struct bpf_tcp_iter_state { struct tcp_iter_state state; /* More bpf_iter specific states here ... */ } As a prep work, this patch removes the "struct tcp_seq_afinfo *bpf_seq_afinfo" where its purpose is to tell if it is iterating from bpf_iter instead of proc fs. Currently, if "*bpf_seq_afinfo" is not NULL, it is iterating from bpf_iter. The kernel should not filter by the addr family and leave this filtering decision to the bpf prog. Instead of adding a "*bpf_seq_afinfo" pointer, this patch uses the "seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating from the bpf iter. The bpf_iter_(init|fini)_tcp() is left here to prepare for the change of a following patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210701200554.1034982-1-kafai@fb.com
2021-07-23net: dsa: add support for bridge TX forwarding offloadVladimir Oltean1-0/+18
For a DSA switch, to offload the forwarding process of a bridge device means to send the packets coming from the software bridge as data plane packets. This is contrary to everything that DSA has done so far, because the current taggers only know to send control packets (ones that target a specific destination port), whereas data plane packets are supposed to be forwarded according to the FDB lookup, much like packets ingressing on any regular ingress port. If the FDB lookup process returns multiple destination ports (flooding, multicast), then replication is also handled by the switch hardware - the bridge only sends a single packet and avoids the skb_clone(). DSA keeps for each bridge port a zero-based index (the number of the bridge). Multiple ports performing TX forwarding offload to the same bridge have the same dp->bridge_num value, and ports not offloading the TX data plane of a bridge have dp->bridge_num = -1. The tagger can check if the packet that is being transmitted on has skb->offload_fwd_mark = true or not. If it does, it can be sure that the packet belongs to the data plane of a bridge, further information about which can be obtained based on dp->bridge_dev and dp->bridge_num. It can then compose a DSA tag for injecting a data plane packet into that bridge number. For the switch driver side, we offer two new dsa_switch_ops methods, called .port_bridge_fwd_offload_{add,del}, which are modeled after .port_bridge_{join,leave}. These methods are provided in case the driver needs to configure the hardware to treat packets coming from that bridge software interface as data plane packets. The switchdev <-> bridge interaction happens during the netdev_master_upper_dev_link() call, so to switch drivers, the effect is that the .port_bridge_fwd_offload_add() method is called immediately after .port_bridge_join(). If the bridge number exceeds the number of bridges for which the switch driver can offload the TX data plane (and this includes the case where the driver can offload none), DSA falls back to simply returning tx_fwd_offload = false in the switchdev_bridge_port_offload() call. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23net: dsa: track the number of switches in a treeVladimir Oltean1-0/+3
In preparation of supporting data plane forwarding on behalf of a software bridge, some drivers might need to view bridges as virtual switches behind the CPU port in a cross-chip topology. Give them some help and let them know how many physical switches there are in the tree, so that they can count the virtual switches starting from that number on. Note that the first dsa_switch_ops method where this information is reliably available is .setup(). This is because of how DSA works: in a tree with 3 switches, each calling dsa_register_switch(), the first 2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table() and exit with error code 0 because the topology is not complete. Since probing is parallel at this point, one switch does not know about the existence of the other. Then the third switch comes, and for it, dsa_tree_setup_routing_table() returns complete = true. This switch goes ahead and calls dsa_tree_setup_switches() for everybody else, calling their .setup() methods too. This acts as the synchronization point. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-1/+0
Conflicts are simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forwardVadim Fedorenko1-4/+18
Consolidate IPv4 MTU code the same way it is done in IPv6 to have code aligned in both address families Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21net: ipv6: introduce ip6_dst_mtu_maybe_forwardVadim Fedorenko1-2/+3
Replace ip6_dst_mtu_forward with ip6_dst_mtu_maybe_forward and reuse this code in ip6_mtu. Actually these two functions were almost duplicates, this change will simplify the maintaince of mtu calculation code. Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21ipv6: ioam: Support for IOAM injection with lwtunnelsJustin Iurman1-0/+3
Add support for the IOAM inline insertion (only for the host-to-host use case) which is per-route configured with lightweight tunnels. The target is iproute2 and the patch is ready. It will be posted as soon as this patchset is merged. Here is an overview: $ ip -6 ro ad fc00::1/128 encap ioam6 trace type 0x800000 ns 1 size 12 dev eth0 This example configures an IOAM Pre-allocated Trace option attached to the fc00::1/128 prefix. The IOAM namespace (ns) is 1, the size of the pre-allocated trace data block is 12 octets (size) and only the first IOAM data (bit 0: hop_limit + node id) is included in the trace (type) represented as a bitfield. The reason why the in-transit (IPv6-in-IPv6 encapsulation) use case is not implemented is explained on the patchset cover. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21ipv6: ioam: Data plane support for Pre-allocated TraceJustin Iurman2-0/+67
Implement support for processing the IOAM Pre-allocated Trace with IPv6, see [1] and [2]. Introduce a new IPv6 Hop-by-Hop TLV option, see IANA [3]. A new per-interface sysctl is introduced. The value is a boolean to accept (=1) or ignore (=0, by default) IPv6 IOAM options on ingress for an interface: - net.ipv6.conf.XXX.ioam6_enabled Two other sysctls are introduced to define IOAM IDs, represented by an integer. They are respectively per-namespace and per-interface: - net.ipv6.ioam6_id - net.ipv6.conf.XXX.ioam6_id The value of the first one represents the IOAM ID of the node itself (u32; max and default value = U32_MAX>>8, due to hop limit concatenation) while the other represents the IOAM ID of an interface (u16; max and default value = U16_MAX). Each "ioam6_id" sysctl has a "_wide" equivalent: - net.ipv6.ioam6_id_wide - net.ipv6.conf.XXX.ioam6_id_wide The value of the first one represents the wide IOAM ID of the node itself (u64; max and default value = U64_MAX>>8, due to hop limit concatenation) while the other represents the wide IOAM ID of an interface (u32; max and default value = U32_MAX). The use of short and wide equivalents is not exclusive, a deployment could choose to leverage both. For example, net.ipv6.conf.XXX.ioam6_id (short format) could be an identifier for a physical interface, whereas net.ipv6.conf.XXX.ioam6_id_wide (wide format) could be an identifier for a logical sub-interface. Documentation about new sysctls is provided at the end of this patchset. Two relativistic hash tables are used: one for IOAM namespaces, the other for IOAM schemas. A namespace can only have a single active schema and a schema can only be attached to a single namespace (1:1 relationship). [1] https://tools.ietf.org/html/draft-ietf-ippm-ioam-ipv6-options [2] https://tools.ietf.org/html/draft-ietf-ippm-ioam-data [3] https://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xhtml#ipv6-parameters-2 Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21net: switchdev: remove stray semicolon in switchdev_handle_fdb_del_to_device ↵Vladimir Oltean1-1/+1
shim With the semicolon at the end, the compiler sees the shim function as a declaration and not as a definition, and warns: 'switchdev_handle_fdb_del_to_device' declared 'static' but never defined Reported-by: kernel test robot <lkp@intel.com> Fixes: 8ca07176ab00 ("net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-21xfrm: Add possibility to set the default to block if we have no policySteffen Klassert2-6/+37
As the default we assume the traffic to pass, if we have no matching IPsec policy. With this patch, we have a possibility to change this default from allow to block. It can be configured via netlink. Each direction (input/output/forward) can be configured separately. With the default to block configuered, we need allow policies for all packet flows we accept. We do not use default policy lookup for the loopback device. v1->v2 - fix compiling when XFRM is disabled - Reported-by: kernel test robot <lkp@intel.com> Co-developed-by: Christian Langrock <christian.langrock@secunet.com> Signed-off-by: Christian Langrock <christian.langrock@secunet.com> Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-07-20net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICEVladimir Oltean1-0/+56
Currently DSA has an issue with FDB entries pointing towards the bridge in the presence of br_fdb_replay() being called at port join and leave time. In particular, each bridge port will ask for a replay for the FDB entries pointing towards the bridge when it joins, and for another replay when it leaves. This means that for example, a bridge with 4 switch ports will notify DSA 4 times of the bridge MAC address. But if the MAC address of the bridge changes during the normal runtime of the system, the bridge notifies switchdev [ once ] of the deletion of the old MAC address as a local FDB towards the bridge, and of the insertion [ again once ] of the new MAC address as a local FDB. This is a problem, because DSA keeps the old MAC address as a host FDB entry with refcount 4 (4 ports asked for it using br_fdb_replay). So the old MAC address will not be deleted. Additionally, the new MAC address will only be installed with refcount 1, and when the first switch port leaves the bridge (leaving 3 others as still members), it will delete with it the new MAC address of the bridge from the local FDB entries kept by DSA (because the br_fdb_replay call on deletion will bring the entry's refcount from 1 to 0). So the problem, really, is that the number of br_fdb_replay() calls is not matched with the refcount that a host FDB is offloaded to DSA during normal runtime. An elegant way to solve the problem would be to make the switchdev notification emitted by br_fdb_change_mac_address() result in a host FDB kept by DSA which has a refcount exactly equal to the number of ports under that bridge. Then, no matter how many DSA ports join or leave that bridge, the host FDB entry will always be deleted when there are exactly zero remaining DSA switch ports members of the bridge. To implement the proposed solution, we remember that the switchdev objects and port attributes have some helpers provided by switchdev, which can be optionally called by drivers: switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set. These helpers: - fan out a switchdev object/attribute emitted for the bridge towards all the lower interfaces that pass the check_cb(). - fan out a switchdev object/attribute emitted for a bridge port that is a LAG towards all the lower interfaces that pass the check_cb(). In other words, this is the model we need for the FDB events too: something that will keep an FDB entry emitted towards a physical port as it is, but translate an FDB entry emitted towards the bridge into N FDB entries, one per physical port. Of course, there are many differences between fanning out a switchdev object (VLAN) on 3 lower interfaces of a LAG and fanning out an FDB entry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towards a LAG should be treated specially, because FDB entries are unicast, we can't just install the same address towards 3 destinations. It is imaginable that drivers might want to treat this case specifically, so create some methods for this case and do not recurse into the LAG lower ports, just the bridge ports. DSA also listens for FDB entries on "foreign" interfaces, aka interfaces bridged with us which are not part of our hardware domain: think an Ethernet switch bridged with a Wi-Fi AP. For those addresses, DSA installs host FDB entries. However, there we have the same problem (those host FDB entries are installed with a refcount of only 1) and an even bigger one which we did not have with FDB entries towards the bridge: br_fdb_replay() is currently not called for FDB entries on foreign interfaces, just for the physical port and for the bridge itself. So when DSA sniffs an address learned by the software bridge towards a foreign interface like an e1000 port, and then that e1000 leaves the bridge, DSA remains with the dangling host FDB address. That will be fixed separately by replaying all FDB entries and not just the ones towards the port and the bridge. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: switchdev: introduce helper for checking dynamically learned FDB entriesVladimir Oltean1-0/+6
It is a bit difficult to understand what DSA checks when it tries to avoid installing dynamically learned addresses on foreign interfaces as local host addresses, so create a generic switchdev helper that can be reused and is generally more readable. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: make tag_8021q operations part of the coreVladimir Oltean1-0/+7
Make tag_8021q a more central element of DSA and move the 2 driver specific operations outside of struct dsa_8021q_context (which is supposed to hold dynamic data and not really constant function pointers). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net: dsa: let the core manage the tag_8021q contextVladimir Oltean1-0/+3
The basic problem description is as follows: Be there 3 switches in a daisy chain topology: | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] The CPU will not be able to ping through the user ports of the bottom-most switch (like for example sw2p0), simply because tag_8021q was not coded up for this scenario - it has always assumed DSA switch trees with a single switch. To add support for the topology above, we must admit that the RX VLAN of sw2p0 must be added on some ports of switches 0 and 1 as well. This is in fact a textbook example of thing that can use the cross-chip notifier framework that DSA has set up in switch.c. There is only one problem: core DSA (switch.c) is not able right now to make the connection between a struct dsa_switch *ds and a struct dsa_8021q_context *ctx. Right now, it is drivers who call into tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer, and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del} methods. But with cross-chip notifiers, it is possible for tag_8021q to call drivers without drivers having ever asked for anything. A good example is right above: when sw2p0 wants to set itself up for tag_8021q, the .tag_8021q_vlan_add method needs to be called for switches 1 and 0, so that they transport sw2p0's VLANs towards the CPU without dropping them. So instead of letting drivers manage the tag_8021q context, add a tag_8021q_ctx pointer inside of struct dsa_switch, which will be populated when dsa_tag_8021q_register() returns success. The patch is fairly long-winded because we are partly reverting commit 5899ee367ab3 ("net: dsa: tag_8021q: add a context structure") which made the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we can access "ctx" directly from "ds", this is no longer needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20net/tcp_fastopen: remove tcp_fastopen_ctx_lockEric Dumazet1-1/+0
Remove the (per netns) spinlock in favor of xchg() atomic operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Link: https://lore.kernel.org/r/20210719101107.3203943-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-20net/tcp_fastopen: remove obsolete externEric Dumazet1-1/+0
After cited commit, sysctl_tcp_fastopen_blackhole_timeout is no longer a global variable. Fixes: 3733be14a32b ("ipv4: Namespaceify tcp_fastopen_blackhole_timeout knob") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Link: https://lore.kernel.org/r/20210719092028.3016745-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2-0/+17
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-07-15 The following pull-request contains BPF updates for your *net-next* tree. We've added 45 non-merge commits during the last 15 day(s) which contain a total of 52 files changed, 3122 insertions(+), 384 deletions(-). The main changes are: 1) Introduce bpf timers, from Alexei. 2) Add sockmap support for unix datagram socket, from Cong. 3) Fix potential memleak and UAF in the verifier, from He. 4) Add bpf_get_func_ip helper, from Jiri. 5) Improvements to generic XDP mode, from Kumar. 6) Support for passing xdp_md to XDP programs in bpf_prog_run, from Zvi. =================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-16af_unix: Implement unix_dgram_bpf_recvmsg()Cong Wang1-0/+2
We have to implement unix_dgram_bpf_recvmsg() to replace the original ->recvmsg() to retrieve skmsg from ingress_msg. AF_UNIX is again special here because the lack of sk_prot->recvmsg(). I simply add a special case inside unix_dgram_recvmsg() to call sk->sk_prot->recvmsg() directly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-8-xiyou.wangcong@gmail.com
2021-07-16af_unix: Implement ->psock_update_sk_prot()Cong Wang1-0/+10
Now we can implement unix_bpf_update_proto() to update sk_prot, especially prot->close(). Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-7-xiyou.wangcong@gmail.com
2021-07-14Merge tag 'net-5.14-rc2' of ↵Linus Torvalds11-212/+28
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski. "Including fixes from bpf and netfilter. Current release - regressions: - sock: fix parameter order in sock_setsockopt() Current release - new code bugs: - netfilter: nft_last: - fix incorrect arithmetic when restoring last used - honor NFTA_LAST_SET on restoration Previous releases - regressions: - udp: properly flush normal packet at GRO time - sfc: ensure correct number of XDP queues; don't allow enabling the feature if there isn't sufficient resources to Tx from any CPU - dsa: sja1105: fix address learning getting disabled on the CPU port - mptcp: addresses a rmem accounting issue that could keep packets in subflow receive buffers longer than necessary, delaying MPTCP-level ACKs - ip_tunnel: fix mtu calculation for ETHER tunnel devices - do not reuse skbs allocated from skbuff_fclone_cache in the napi skb cache, we'd try to return them to the wrong slab cache - tcp: consistently disable header prediction for mptcp Previous releases - always broken: - bpf: fix subprog poke descriptor tracking use-after-free - ipv6: - allocate enough headroom in ip6_finish_output2() in case iptables TEE is used - tcp: drop silly ICMPv6 packet too big messages to avoid expensive and pointless lookups (which may serve as a DDOS vector) - make sure fwmark is copied in SYNACK packets - fix 'disable_policy' for forwarded packets (align with IPv4) - netfilter: conntrack: - do not renew entry stuck in tcp SYN_SENT state - do not mark RST in the reply direction coming after SYN packet for an out-of-sync entry - mptcp: cleanly handle error conditions with MP_JOIN and syncookies - mptcp: fix double free when rejecting a join due to port mismatch - validate lwtstate->data before returning from skb_tunnel_info() - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path - mt76: mt7921: continue to probe driver when fw already downloaded - bonding: fix multiple issues with offloading IPsec to (thru?) bond - stmmac: ptp: fix issues around Qbv support and setting time back - bcmgenet: always clear wake-up based on energy detection Misc: - sctp: move 198 addresses from unusable to private scope - ptp: support virtual clocks and timestamping - openvswitch: optimize operation for key comparison" * tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits) net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave() sfc: add logs explaining XDP_TX/REDIRECT is not available sfc: ensure correct number of XDP queues sfc: fix lack of XDP TX queues - error XDP TX failed (-22) net: fddi: fix UAF in fza_probe net: dsa: sja1105: fix address learning getting disabled on the CPU port net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload net: Use nlmsg_unicast() instead of netlink_unicast() octeontx2-pf: Fix uninitialized boolean variable pps ipv6: allocate enough headroom in ip6_finish_output2() net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific net: bridge: multicast: fix MRD advertisement router port marking race net: bridge: multicast: fix PIM hello router port marking race net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340 dsa: fix for_each_child.cocci warnings virtio_net: check virtqueue_add_sgs() return value mptcp: properly account bulk freed memory selftests: mptcp: fix case multiple subflows limited by server mptcp: avoid processing packet if a subflow reset mptcp: fix syncookie process if mptcp can not_accept new subflow ...
2021-07-10mptcp: avoid processing packet if a subflow resetJianguo Wu1-2/+3
If check_fully_established() causes a subflow reset, it should not continue to process the packet in tcp_data_queue(). Add a return value to mptcp_incoming_options(), and return false if a subflow has been reset, else return true. Then drop the packet in tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options() return false. Fixes: d582484726c4 ("mptcp: fix fallback for MP_JOIN subflows") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09net: validate lwtstate->data before returning from skb_tunnel_info()Taehee Yoo1-1/+3
skb_tunnel_info() returns pointer of lwtstate->data as ip_tunnel_info type without validation. lwtstate->data can have various types such as mpls_iptunnel_encap, etc and these are not compatible. So skb_tunnel_info() should validate before returning that pointer. Splat looks like: BUG: KASAN: slab-out-of-bounds in vxlan_get_route+0x418/0x4b0 [vxlan] Read of size 2 at addr ffff888106ec2698 by task ping/811 CPU: 1 PID: 811 Comm: ping Not tainted 5.13.0+ #1195 Call Trace: dump_stack_lvl+0x56/0x7b print_address_description.constprop.8.cold.13+0x13/0x2ee ? vxlan_get_route+0x418/0x4b0 [vxlan] ? vxlan_get_route+0x418/0x4b0 [vxlan] kasan_report.cold.14+0x83/0xdf ? vxlan_get_route+0x418/0x4b0 [vxlan] vxlan_get_route+0x418/0x4b0 [vxlan] [ ... ] vxlan_xmit_one+0x148b/0x32b0 [vxlan] [ ... ] vxlan_xmit+0x25c5/0x4780 [vxlan] [ ... ] dev_hard_start_xmit+0x1ae/0x6e0 __dev_queue_xmit+0x1f39/0x31a0 [ ... ] neigh_xmit+0x2f9/0x940 mpls_xmit+0x911/0x1600 [mpls_iptunnel] lwtunnel_xmit+0x18f/0x450 ip_finish_output2+0x867/0x2040 [ ... ] Fixes: 61adedf3e3f1 ("route: move lwtunnel state to dst_entry") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08bpf: Add function for XDP meta data length checkZvi Effron1-0/+5
This commit prepares to use the XDP meta data length check in multiple places by making it into a static inline function instead of a literal. Co-developed-by: Cody Haas <chaas@riotgames.com> Co-developed-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Cody Haas <chaas@riotgames.com> Signed-off-by: Lisa Watanabe <lwatanabe@riotgames.com> Signed-off-by: Zvi Effron <zeffron@riotgames.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210707221657.3985075-2-zeffron@riotgames.com
2021-07-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2-1/+1
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Do not refresh timeout in SYN_SENT for syn retransmissions. Add selftest for unreplied TCP connection, from Florian Westphal. 2) Fix null dereference from error path with hardware offload in nftables. 3) Remove useless nf_ct_gre_keymap_flush() from netns exit path, from Vasily Averin. 4) Missing rcu read-lock side in ctnetlink helper info dump, also from Vasily. 5) Do not mark RST in the reply direction coming after SYN packet for an out-of-sync entry, from Ali Abdallah and Florian Westphal. 6) Add tcp_ignore_invalid_rst sysctl to allow to disable out of segment RSTs, from Ali. 7) KCSAN fix for nf_conntrack_all_lock(), from Manfred Spraul. 8) Honor NFTA_LAST_SET in nft_last. 9) Fix incorrect arithmetics when restore last_jiffies in nft_last. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-06bonding: Add struct bond_ipesc to manage SATaehee Yoo1-1/+8
bonding has been supporting ipsec offload. When SA is added, bonding just passes SA to its own active real interface. But it doesn't manage SA. So, when events(add/del real interface, active real interface change, etc) occur, bonding can't handle that well because It doesn't manage SA. So some problems(panic, UAF, refcnt leak)occur. In order to make it stable, it should manage SA. That's the reason why struct bond_ipsec is added. When a new SA is added to bonding interface, it is stored in the bond_ipsec list. And the SA is passed to a current active real interface. If events occur, it uses bond_ipsec data to handle these events. bond->ipsec_list is protected by bond->ipsec_lock. If a current active real interface is changed, the following logic works. 1. delete all SAs from old active real interface 2. Add all SAs to the new active real interface. 3. If a new active real interface doesn't support ipsec offload or SA's option, it sets real_dev to NULL. Fixes: 18cb261afd7b ("bonding: support hardware encryption offload to slaves") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-06netfilter: conntrack: add new sysctl to disable RST checkAli Abdallah1-0/+1
This patch adds a new sysctl tcp_ignore_invalid_rst to disable marking out of segments RSTs as INVALID. Signed-off-by: Ali Abdallah <aabdallah@suse.de> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-03Merge branch 'work.iov_iter' of ↵Linus Torvalds1-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter cleanups and fixes. There are followups, but this is what had sat in -next this cycle. IMO the macro forest in there became much thinner and easier to follow..." * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller clean up copy_mc_pipe_to_iter() pipe_zero(): we don't need no stinkin' kmap_atomic()... iov_iter: clean csum_and_copy_...() primitives up a bit copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases iterate_xarray(): only of the first iteration we might get offset != 0 pull handling of ->iov_offset into iterate_{iovec,bvec,xarray} iov_iter: make iterator callbacks use base and len instead of iovec iov_iter: make the amount already copied available to iterator callbacks iov_iter: get rid of separate bvec and xarray callbacks iov_iter: teach iterate_{bvec,xarray}() about possible short copies iterate_bvec(): expand bvec.h macro forest, massage a bit iov_iter: unify iterate_iovec and iterate_kvec iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec iterate_and_advance(): get rid of magic in case when n is 0 csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter() iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant [xarray] iov_iter_npages(): just use DIV_ROUND_UP() iov_iter_npages(): don't bother with iterate_all_kinds() ...
2021-07-02net: ipv6: fix return value of ip6_skb_dst_mtuVadim Fedorenko1-1/+1
Commit 628a5c561890 ("[INET]: Add IP(V6)_PMTUDISC_RPOBE") introduced ip6_skb_dst_mtu with return value of signed int which is inconsistent with actually returned values. Also 2 users of this function actually assign its value to unsigned int variable and only __xfrm6_output assigns result of this function to signed variable but actually uses as unsigned in further comparisons and calls. Change this function to return unsigned int value. Fixes: 628a5c561890 ("[INET]: Add IP(V6)_PMTUDISC_RPOBE") Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02xfrm: Fix RCU vs hash_resize_mutex lock inversionFrederic Weisbecker1-0/+1
xfrm_bydst_resize() calls synchronize_rcu() while holding hash_resize_mutex. But then on PREEMPT_RT configurations, xfrm_policy_lookup_bytype() may acquire that mutex while running in an RCU read side critical section. This results in a deadlock. In fact the scope of hash_resize_mutex is way beyond the purpose of xfrm_policy_lookup_bytype() to just fetch a coherent and stable policy for a given destination/direction, along with other details. The lower level net->xfrm.xfrm_policy_lock, which among other things protects per destination/direction references to policy entries, is enough to serialize and benefit from priority inheritance against the write side. As a bonus, it makes it officially a per network namespace synchronization business where a policy table resize on namespace A shouldn't block a policy lookup on namespace B. Fixes: 77cc278f7b20 (xfrm: policy: Use sequence counters with associated lock) Cc: stable@vger.kernel.org Cc: Ahmed S. Darwish <a.darwish@linutronix.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Varad Gautam <varad.gautam@suse.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-07-02netfilter: conntrack: nf_ct_gre_keymap_flush() removalVasily Averin1-1/+0
nf_ct_gre_keymap_flush() is useless. It is called from nf_conntrack_cleanup_net_list() only and tries to remove nf_ct_gre_keymap entries from pernet gre keymap list. Though: a) at this point the list should already be empty, all its entries were deleted during the conntracks cleanup, because nf_conntrack_cleanup_net_list() executes nf_ct_iterate_cleanup(kill_all) before nf_conntrack_proto_pernet_fini(): nf_conntrack_cleanup_net_list +- nf_ct_iterate_cleanup | nf_ct_put | nf_conntrack_put | nf_conntrack_destroy | destroy_conntrack | destroy_gre_conntrack | nf_ct_gre_keymap_destroy `- nf_conntrack_proto_pernet_fini nf_ct_gre_keymap_flush b) Let's say we find that the keymap list is not empty. This means netns still has a conntrack associated with gre, in which case we should not free its memory, because this will lead to a double free and related crashes. However I doubt it could have gone unnoticed for years, obviously this does not happen in real life. So I think we can remove both nf_ct_gre_keymap_flush() and nf_conntrack_proto_pernet_fini(). Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-01tcp: consistently disable header prediction for mptcpPaolo Abeni1-0/+4
The MPTCP receive path is hooked only into the TCP slow-path. The DSS presence allows plain MPTCP traffic to hit that consistently. Since commit e1ff9e82e2ea ("net: mptcp: improve fallback to TCP"), when an MPTCP socket falls back to TCP, it can hit the TCP receive fast-path, and delay or stop triggering the event notification. Address the issue explicitly disabling the header prediction for MPTCP sockets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/200 Fixes: e1ff9e82e2ea ("net: mptcp: improve fallback to TCP") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01net: remove the caif_hsi driverChristoph Hellwig1-200/+0
The caif_hsi driver relies on a cfhsi_get_ops symbol using symbol_get, but this symbol is not provided anywhere in the kernel tree. Remove this driver given that it is dead code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01net: sock: extend SO_TIMESTAMPING for PHC bindingYangbo Lu1-2/+6
Since PTP virtual clock support is added, there can be several PTP virtual clocks based on one PTP physical clock for timestamping. This patch is to extend SO_TIMESTAMPING API to support PHC (PTP Hardware Clock) binding by adding a new flag SOF_TIMESTAMPING_BIND_PHC. When PTP virtual clocks are in use, user space can configure to bind one for timestamping, but PTP physical clock is not supported and not needed to bind. This patch is preparation for timestamp conversion from raw timestamp to a specific PTP virtual clock time in core net. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01sctp: move 198 addresses from unusable to private scopeXin Long1-3/+1
The doc draft-stewart-tsvwg-sctp-ipv4-00 that restricts 198 addresses was never published. These addresses as private addresses should be allowed to use in SCTP. As Michael Tuexen suggested, this patch is to move 198 addresses from unusable to private scope. Reported-by: Sérgio <surkamp@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-01net: annotate data race around sk_ll_usecEric Dumazet1-1/+1
sk_ll_usec is read locklessly from sk_can_busy_loop() while another thread can change its value in sock_setsockopt() This is correct but needs annotations. BUG: KCSAN: data-race in __skb_try_recv_datagram / sock_setsockopt write to 0xffff88814eb5f904 of 4 bytes by task 14011 on cpu 0: sock_setsockopt+0x1287/0x2090 net/core/sock.c:1175 __sys_setsockopt+0x14f/0x200 net/socket.c:2100 __do_sys_setsockopt net/socket.c:2115 [inline] __se_sys_setsockopt net/socket.c:2112 [inline] __x64_sys_setsockopt+0x62/0x70 net/socket.c:2112 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88814eb5f904 of 4 bytes by task 14001 on cpu 1: sk_can_busy_loop include/net/busy_poll.h:41 [inline] __skb_try_recv_datagram+0x14f/0x320 net/core/datagram.c:273 unix_dgram_recvmsg+0x14c/0x870 net/unix/af_unix.c:2101 unix_seqpacket_recvmsg+0x5a/0x70 net/unix/af_unix.c:2067 ____sys_recvmsg+0x15d/0x310 include/linux/uio.h:244 ___sys_recvmsg net/socket.c:2598 [inline] do_recvmmsg+0x35c/0x9f0 net/socket.c:2692 __sys_recvmmsg net/socket.c:2771 [inline] __do_sys_recvmmsg net/socket.c:2794 [inline] __se_sys_recvmmsg net/socket.c:2787 [inline] __x64_sys_recvmmsg+0xcf/0x150 net/socket.c:2787 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0x00000101 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 14001 Comm: syz-executor.3 Not tainted 5.13.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski8-18/+49
Trivial conflict in net/netfilter/nf_tables_api.c. Duplicate fix in tools/testing/selftests/net/devlink_port_split.py - take the net-next version. skmsg, and L4 bpf - keep the bpf code but remove the flags and err params. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29tcp: change ICSK_CA_PRIV_SIZE definitionEric Dumazet1-1/+1
Instead of a magic number (13 currently) and having to change it every other year, use sizeof_field() macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: sock: introduce sk_error_reportAlexander Aring2-1/+3
This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, for example to trace socket errors. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: reference count the FDB addresses at the cross-chip notifier levelVladimir Oltean1-0/+1
The same concerns expressed for host MDB entries are valid for host FDBs just as well: - in the case of multiple bridges spanning the same switch chip, deleting a host FDB entry that belongs to one bridge will result in breakage to the other bridge - not deleting FDB entries across DSA links means that the switch's hardware tables will eventually run out, given enough wear&tear So do the same thing and introduce reference counting for CPU ports and DSA links using the same data structures as we have for MDB entries. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: reference count the MDB entries at the cross-chip notifier levelVladimir Oltean1-0/+12
Ever since the cross-chip notifiers were introduced, the design was meant to be simplistic and just get the job done without worrying too much about dangling resources left behind. For example, somebody installs an MDB entry on sw0p0 in this daisy chain topology. It gets installed using ds->ops->port_mdb_add() on sw0p0, sw1p4 and sw2p4. | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ x ] [ ] [ ] [ ] [ ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] Then the same person deletes that MDB entry. The cross-chip notifier for deletion only matches sw0p0: | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ x ] [ ] [ ] [ ] [ ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ ] Why? Because the DSA links are 'trunk' ports, if we just go ahead and delete the MDB from sw1p4 and sw2p4 directly, we might delete those multicast entries when they are still needed. Just consider the fact that somebody does: - add a multicast MAC address towards sw0p0 [ via the cross-chip notifiers it gets installed on the DSA links too ] - add the same multicast MAC address towards sw0p1 (another port of that same switch) - delete the same multicast MAC address from sw0p0. At this point, if we deleted the MAC address from the DSA links, it would be flooded, even though there is still an entry on switch 0 which needs it not to. So that is why deletions only match the targeted source port and nothing on DSA links. Of course, dangling resources means that the hardware tables will eventually run out given enough additions/removals, but hey, at least it's simple. But there is a bigger concern which needs to be addressed, and that is our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add() on the upstream port, and a similar thing happens on deletion: dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the upstream port. When there are 2 VLAN-unaware bridges spanning the same switch (which is a use case DSA proudly supports), each bridge will install its own SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the user ports enslaved to br0 and the user ports enslaved to br1. Not good. The host-trapped multicast addresses installed by br1 will be deleted when any state changes in br0 (IGMP timers expire, or ports leave, etc). To avoid this, we could of course go the route of the zero-sum game and delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better design is to just admit that on shared ports like DSA links and CPU ports, we should be reference counting calls, even if this consumes some dynamic memory which DSA has traditionally avoided. On the flip side, the hardware tables of switches are limited in size, so it would be good if the OS managed them properly instead of having them eventually overflow. To address the memory usage concern, we only apply the refcounting of MDB entries on ports that are really shared (CPU ports and DSA links) and not on user ports. In a typical single-switch setup, this means only the CPU port (and the host MDB entries are not that many, really). The name of the newly introduced data structures (dsa_mac_addr) is chosen in such a way that will be reusable for host FDB entries (next patch). With this change, we can finally have the same matching logic for the MDB additions and deletions, as well as for their host-trapped variants. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: introduce dsa_is_upstream_port and dsa_switch_is_upstream_ofVladimir Oltean1-0/+26
In preparation for the new cross-chip notifiers for host addresses, let's introduce some more topology helpers which we are going to use to discern switches that are in our path towards the dedicated CPU port from switches that aren't. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29Merge tag 'for-net-next-2021-06-28' of ↵David S. Miller3-63/+68
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for QCA_ROME device (0cf3:e500) and RTL8822CE - Update management interface revision to 21 - Use of incluse language - Proper handling of HCI_LE_Advertising_Set_Terminated event - Recovery handing of HCI ncmd=0 - Various memory fixes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29sctp: validate from_addr_param returnMarcelo Ricardo Leitner1-1/+1
Ilja reported that, simply putting it, nothing was validating that from_addr_param functions were operating on initialized memory. That is, the parameter itself was being validated by sctp_walk_params, but it doesn't check for types and their specific sizes and it could be a 0-length one, causing from_addr_param to potentially work over the next parameter or even uninitialized memory. The fix here is to, in all calls to from_addr_param, check if enough space is there for the wanted IP address type. Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-1/+1
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-28 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 12 day(s) which contain a total of 56 files changed, 394 insertions(+), 380 deletions(-). The main changes are: 1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney. 2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski. 3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev. 4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria. 5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards. 6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa. 7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi. 8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim. 9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young. 10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer. 11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski. 12) Fix up bpfilter startup log-level to info level, from Gary Lin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messagesAndreas Roeseler1-0/+1
This patch builds off of commit 2b246b2569cd2ac6ff700d0dce56b8bae29b1842 and adds functionality to respond to ICMPV6 PROBE requests. Add icmp_build_probe function to construct PROBE requests for both ICMPV4 and ICMPV6. Modify icmpv6_rcv to detect ICMPV6 PROBE messages and call the icmpv6_echo_reply handler. Modify icmpv6_echo_reply to build a PROBE response message based on the queried interface. This patch has been tested using a branch of the iputils git repo which can be found here: https://github.com/Juniper-Clinic-2020/iputils/tree/probe-request Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29flow_offload: action should not be NULL when it is referencedgushengxian1-5/+7
"action" should not be NULL when it is referenced. Signed-off-by: gushengxian <13145886936@163.com> Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: switchdev: add a context void pointer to struct switchdev_notifier_infoVladimir Oltean1-6/+7
In the case where the driver asks for a replay of a certain type of event (port object or attribute) for a bridge port that is a LAG, it may do so because this port has just joined the LAG. But there might already be other switchdev ports in that LAG, and it is preferable that those preexisting switchdev ports do not act upon the replayed event. The solution is to add a context to switchdev events, which is NULL most of the time (when the bridge layer initiates the call) but which can be set to a value controlled by the switchdev driver when a replay is requested. The driver can then check the context to figure out if all ports within the LAG should act upon the switchdev event, or just the ones that match the context. We have to modify all switchdev_handle_* helper functions as well as the prototypes in the drivers that use these helpers too, because these helpers hide the underlying struct switchdev_notifier_info from us and there is no way to retrieve the context otherwise. The context structure will be populated and used in later patches. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/gitDavid S. Miller2-22/+16
/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2021-06-28 1) Remove an unneeded error assignment in esp4_gro_receive(). From Yang Li. 2) Add a new byseq state hashtable to find acquire states faster. From Sabrina Dubroca. 3) Remove some unnecessary variables in pfkey_create(). From zuoqilin. 4) Remove the unused description from xfrm_type struct. From Florian Westphal. 5) Fix a spelling mistake in the comment of xfrm_state_ok(). From gushengxian. 6) Replace hdr_off indirections by a small helper function. From Florian Westphal. 7) Remove xfrm4_output_finish and xfrm6_output_finish declarations, they are not used anymore.From Antony Antony. 8) Remove xfrm replay indirections. From Florian Westphal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28Merge tag 'mac80211-next-for-net-next-2021-06-25' of ↵David S. Miller2-35/+86
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes berg says: ==================== Lots of changes: * aggregation handling improvements for some drivers * hidden AP discovery on 6 GHz and other HE 6 GHz improvements * minstrel improvements for no-ack frames * deferred rate control for TXQs to improve reaction times * virtual time-based airtime scheduler * along with various little cleanups/fixups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: lwtunnel: handle MTU calculation in forwadingVadim Fedorenko2-8/+20
Commit 14972cbd34ff ("net: lwtunnel: Handle fragmentation") moved fragmentation logic away from lwtunnel by carry encap headroom and use it in output MTU calculation. But the forwarding part was not covered and created difference in MTU for output and forwarding and further to silent drops on ipv4 forwarding path. Fix it by taking into account lwtunnel encap headroom. The same commit also introduced difference in how to treat RTAX_MTU in IPv4 and IPv6 where latter explicitly removes lwtunnel encap headroom from route MTU. Make IPv4 version do the same. Fixes: 14972cbd34ff ("net: lwtunnel: Handle fragmentation") Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>