summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-10-27net/mlx5e: HW_GRO cqe handler implementationKhalid Manaa3-11/+242
this patch updates the SHAMPO CQE handler to support HW_GRO, changes in the SHAMPO CQE handler: - CQE match and flush fields are used to determine if to build new skb using the new received packet, or to add the received packet data to the existing RQ.hw_gro_skb, also this fields are used to determine when to flush the skb. - in the end of the function mlx5e_poll_rx_cq the RQ.hw_gro_skb is flushed. Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net/mlx5e: Add data path for SHAMPO featureBen Ben-Ishay2-0/+189
The header buffer is used to store the headers of the rx packets. The header buffer size deduced from WorkQueue size + restriction of max packets per WorkQueueElement. This commit adds the functionality for posting/updating memory for the header buffer during the posting/updating of WQEs. Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net/mlx5e: Add handle SHAMPO cqe supportKhalid Manaa3-35/+194
This patch adds the new CQE SHAMPO fields: - flush: indicates that we must close the current session and pass the SKB to the network stack. - match: indicates that the current packet matches the oppened session, the packet will be merge into the current SKB. - header_size: the size of the packet headers that written into the headers buffer. - header_entry_index: the entry index in the headers buffer. - data_offset: packets data offset in the WQE. Also new cqe handler is added to handle SHAMPO packets: - The new handler uses CQE SHAMPO fields to build the SKB. CQE's Flush and match fields are not used in this patch, packets are not merged in this patch. Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net/mlx5e: Add control path for SHAMPO featureBen Ben-Ishay6-30/+400
This commit introduces the control path infrastructure for SHAMPO feature. SHAMPO feature enables packet stitching by splitting packets to header and payload, the header is placed on a dedicated buffer and the payload on the RX ring, this allows stitching the data part of a flow together continuously in the receive buffer. SHAMPO feature is implemented as linked list striding RQ feature. To support packets splitting and payload stitching: - Enlarge the ICOSQ and the correspond CQ to support the header buffer memory regions. - Add support to create linked list striding RQ with SHAMPO feature set in the open_rq function. - Add deallocation function and corresponded calls for SHAMPO header buffer. - Add mlx5e_create_umr_klm_mkey to support KLM mkey for the header buffer. - Rename mlx5e_create_umr_mkey to mlx5e_create_umr_mtt_mkey. Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net/mlx5e: Add support to klm_umr_wqeBen Ben-Ishay2-1/+24
This commit adds the needed definitions for using the klm_umr_wqe. UMR stands for user-mode memory registration, is a mechanism to alter address translation properties of MKEY by posting WorkQueueElement aka WQE on send queue. MKEY stands for memory key, MKEY are used to describe a region in memory that can be later used by HW. KLM stands for {Key, Length, MemVa}, KLM_MKEY is indirect MKEY that enables to map multiple memory spaces with different sizes in unified MKEY. klm_umr_wqe is a UMR that use to update a KLM_MKEY. SHAMPO feature uses KLM_MKEY for memory registration of his header buffer. Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net/mlx5e: Rename TIR lro functions to TIR packet merge functionsKhalid Manaa15-92/+95
This series introduces new packet merge type, therefore rename lro functions to packet merge to support the new merge type: - Generalize + rename mlx5e_build_tir_ctx_lro to mlx5e_build_tir_ctx_packet_merge. - Rename mlx5e_modify_tirs_lro to mlx5e_modify_tirs_packet_merge. - Rename lro bit in mlx5_ifc_modify_tir_bitmask_bits to packet_merge. - Rename lro_en in mlx5e_params to packet_merge_type type and combine packet_merge params into one struct mlx5e_packet_merge_param. Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net/mlx5: Add SHAMPO caps, HW bits and enumerationsBen Ben-Ishay4-3/+64
This commit adds SHAMPO bit to hca_cap and SHAMPO capabilities structure, SHAMPO related HW spec hardware fields and enumerations. SHAMPO stands for: split headers and merge payload offload. SHAMPO new fields: WQ: - headers_mkey: mkey that represents the headers buffer, where the packets headers will be written by the HW. - shampo_enable: flag to verify if the WQ supports SHAMPO feature. - log_reservation_size: the log of the reservation size where the data of the packet will be written by the HW. - log_max_num_of_packets_per_reservation: log of the maximum number of packets that can be written to the same reservation. - log_headers_entry_size: log of the header entry size of the headers buffer. - log_headers_buffer_entry_num: log of the entries number of the headers buffer. RQ: - shampo_no_match_alignment_granularity: the HW alignment granularity in case the received packet doesn't match the current session. - shampo_match_criteria_type: the type of match criteria. - reservation_timeout: the maximum time that the HW will hold the reservation. mlx5_ifc_shampo_cap_bits, the capabilities of the SHAMPO feature: - shampo_log_max_reservation_size: the maximum allowed value of the field WQ.log_reservation_size. - log_reservation_size: the minimum allowed value of the field WQ.log_reservation_size. - shampo_min_mss_size: the minimum payload size of packet that can open a new session or be merged to a session. - shampo_max_log_headers_entry_size: the maximum allowed value of the field WQ.log_headers_entry_size Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net/mlx5e: Rename lro_timeout to packet_merge_timeoutBen Ben-Ishay5-9/+9
TIR stands for transport interface receive, the TIR object is responsible for performing all transport related operations on the receive side like packet processing, demultiplexing the packets to different RQ's, etc. lro_timeout is a field in the TIR that is used to set the timeout for lro session, this series introduces new packet merge type, therefore rename lro_timeout to packet_merge_timeout for all packet merge types. Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net: Prevent HW-GRO and LRO features operate togetherBen Ben-ishay1-0/+5
LRO and HW-GRO are mutually exclusive, this commit adds this restriction in netdev_fix_feature. HW-GRO is preferred, that means in case both HW-GRO and LRO features are requested, LRO is cleared. Signed-off-by: Ben Ben-ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27lib: bitmap: Introduce node-aware alloc APITariq Toukan2-0/+15
Expose new node-aware API for bitmap allocation: bitmap_alloc_node() / bitmap_zalloc_node(). Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-27net: phy: fixed warning: Function parameter not describedLuo Jie1-0/+1
Fixed warning: Function parameter or member 'enable' not described in 'genphy_c45_fast_retrain' Signed-off-by: Luo Jie <luoj@codeaurora.org> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20211026102957.17100-1-luoj@codeaurora.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26net/mlx5: remove the recent devlink paramsJakub Kicinski10-207/+9
revert commit 46ae40b94d88 ("net/mlx5: Let user configure io_eq_size param") revert commit a6cb08daa3b4 ("net/mlx5: Let user configure event_eq_size param") revert commit 554604061979 ("net/mlx5: Let user configure max_macs param") The EQE parameters are applicable to more drivers, they should be configured via standard API, probably ethtool. Example of another driver needing something similar: https://lore.kernel.org/all/1633454136-14679-3-git-send-email-sbhatta@marvell.com/ The last param for "max_macs" is probably fine but the documentation is severely lacking. The meaning and implications for changing the param need to be stated. Link: https://lore.kernel.org/r/20211026152939.3125950-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26Merge branch 'phy-supported-interfaces-bitmap'David S. Miller3-2/+81
Russell King says: ==================== Introduce supported interfaces bitmap This series introduces a new bitmap to allow us to indicate which phy_interface_t modes are supported. Currently, phylink will call ->validate with PHY_INTERFACE_MODE_NA to request all link mode capabilities from the MAC driver before choosing an interface to use. This leads in some cases to some rather hairly code. This can be simplified if phylink is aware of the interface modes that the MAC supports, and it can instead walk those modes, calling ->validate for each one, and combining the results. This series merely introduces the support; there is no change of behaviour until MAC drivers populate their supported_interfaces bitmap. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: phylink: use supported_interfaces for phylink validationRussell King (Oracle)2-2/+46
If the network device supplies a supported interface bitmap, we can use that during phylink's validation to simplify MAC drivers in two ways by using the supported_interfaces bitmap to: 1. reject unsupported interfaces before calling into the MAC driver. 2. generate the set of all supported link modes across all supported interfaces (used mainly for SFP, but also some 10G PHYs.) Suggested-by: Sean Anderson <sean.anderson@seco.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: phylink: add MAC phy_interface_t bitmapRussell King1-0/+1
Add a phy_interface_t bitmap so the MAC driver can specifiy which PHY interface modes it supports. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: phy: add phy_interface_t bitmap supportRussell King (Oracle)1-0/+34
Add support for a bitmap for phy interface modes, which includes: - a macro to declare the interface bitmap - an inline helper to zero the interface bitmap - an inline helper to detect an empty interface bitmap - inline helpers to do a bitwise AND and OR operations on two interface bitmaps Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26Merge branch 'dsa-isolation-prep'David S. Miller2-3/+2
Vladimir Oltean says: ==================== DSA preparations for FDB isolation between bridges This series makes 2 small changes to DSA's SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, which will make it possible to offer switch drivers a stable association between a FDB entry and a bridge device in a future series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: dsa: stop calling dev_hold in dsa_slave_fdb_eventVladimir Oltean1-3/+0
Now that we guarantee that SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events have finished executing by the time we leave our bridge upper interface, we've established a stronger boundary condition for how long the dsa_slave_switchdev_event_work() might run. As such, it is no longer possible for DSA slave interfaces to become unregistered, since they are still bridge ports. So delete the unnecessary dev_hold() and dev_put(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: dsa: flush switchdev workqueue when leaving the bridgeVladimir Oltean1-0/+2
DSA is preparing to offer switch drivers an API through which they can associate each FDB entry with a struct net_device *bridge_dev. This can be used to perform FDB isolation (the FDB lookup performed on the ingress of a standalone, or bridged port, should not find an FDB entry that is present in the FDB of another bridge). In preparation of that work, DSA needs to ensure that by the time we call the switch .port_fdb_add and .port_fdb_del methods, the dp->bridge_dev pointer is still valid, i.e. the port is still a bridge port. This is not guaranteed because the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE API requires drivers that must have sleepable context to handle those events to schedule the deferred work themselves. DSA does this through the dsa_owq. It can happen that a port leaves a bridge, del_nbp() flushes the FDB on that port, SWITCHDEV_FDB_DEL_TO_DEVICE is notified in atomic context, DSA schedules its deferred work, but del_nbp() finishes unlinking the bridge as a master from the port before DSA's deferred work is run. Fundamentally, the port must not be unlinked from the bridge until all FDB deletion deferred work items have been flushed. The bridge must wait for the completion of these hardware accesses. An attempt has been made to address this issue centrally in switchdev by making SWITCHDEV_FDB_DEL_TO_DEVICE deferred (=> blocking) at the switchdev level, which would offer implicit synchronization with del_nbp: https://patchwork.kernel.org/project/netdevbpf/cover/20210820115746.3701811-1-vladimir.oltean@nxp.com/ but it seems that any attempt to modify switchdev's behavior and make the events blocking there would introduce undesirable side effects in other switchdev consumers. The most undesirable behavior seems to be that switchdev_deferred_process_work() takes the rtnl_mutex itself, which would be worse off than having the rtnl_mutex taken individually from drivers which is what we have now (except DSA which has removed that lock since commit 0faf890fc519 ("net: dsa: drop rtnl_lock from dsa_slave_switchdev_event_work")). So to offer the needed guarantee to DSA switch drivers, I have come up with a compromise solution that does not require switchdev rework: we already have a hook at the last moment in time when the bridge is still an upper of ours: the NETDEV_PRECHANGEUPPER handler. We can flush the dsa_owq manually from there, which makes all FDB deletions synchronous. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26ifb: Depend on netfilter alternatively to tcLukas Wunner1-1/+1
IFB originally depended on NET_CLS_ACT for traffic redirection. But since v4.5, that may be achieved with NFT_FWD_NETDEV as well. Fixes: 39e6dea28adc ("netfilter: nf_tables: add forward expression to the netdev family") Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: <stable@vger.kernel.org> # v4.5+: bcfabee1afd9: netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress Cc: <stable@vger.kernel.org> # v4.5+ Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26mctp: Implement extended addressingJeremy Kerr5-39/+170
This change allows an extended address struct - struct sockaddr_mctp_ext - to be passed to sendmsg/recvmsg. This allows userspace to specify output ifindex and physical address information (for sendmsg) or receive the input ifindex/physaddr for incoming messages (for recvmsg). This is typically used by userspace for MCTP address discovery and assignment operations. The extended addressing facility is conditional on a new sockopt: MCTP_OPT_ADDR_EXT; userspace must explicitly enable addressing before the kernel will consume/populate the extended address data. Includes a fix for an uninitialised var: Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: ax88796c: Remove pointless check in ax88796c_open()Nathan Chancellor1-5/+4
Clang warns: drivers/net/ethernet/asix/ax88796c_main.c:851:24: error: address of array 'ax_local->phydev->advertising' will always evaluate to 'true' [-Werror,-Wpointer-bool-conversion] if (ax_local->phydev->advertising && ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~ ~~ advertising cannot be NULL here if ax_local is not NULL, which cannot happen due to the check in ax88796c_probe(). Remove the check. Link: https://github.com/ClangBuiltLinux/linux/issues/1492 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: ax88796c: Fix clang -Wimplicit-fallthrough in ax88796c_set_mac()Nathan Chancellor1-0/+2
Clang warns: drivers/net/ethernet/asix/ax88796c_main.c:696:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] case SPEED_10: ^ drivers/net/ethernet/asix/ax88796c_main.c:696:2: note: insert 'break;' to avoid fall-through case SPEED_10: ^ break; drivers/net/ethernet/asix/ax88796c_main.c:706:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] case DUPLEX_HALF: ^ drivers/net/ethernet/asix/ax88796c_main.c:706:2: note: insert 'break;' to avoid fall-through case DUPLEX_HALF: ^ break; Clang is a little more pedantic than GCC, which permits implicit fallthroughs to cases that contain just break or return. Clang's version is more in line with the kernel's own stance in deprecated.rst, which states that all switch/case blocks must end in either break, fallthrough, continue, goto, or return. Add the missing breaks to fix the warning. Link: https://github.com/ClangBuiltLinux/linux/issues/1491 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: mana: Allow setting the number of queues while the NIC is downHaiyang Zhang2-13/+9
The existing code doesn't allow setting the number of queues while the NIC is down. Update the ethtool handler functions to support setting the number of queues while the NIC is at down state. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: hsr: Add support for redbox supervision framesAndreas Oetken4-26/+113
added support for the redbox supervision frames as defined in the IEC-62439-3:2018. Signed-off-by: Andreas Oetken <andreas.oetken@siemens-energy.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26Merge branch 'tcp_stream_alloc_skb'David S. Miller4-19/+15
Eric Dumazet says: ==================== tcp: tcp_stream_alloc_skb() changes sk_stream_alloc_skb() is only used by TCP. Rename it to tcp_stream_alloc_skb() and apply small optimizations. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26tcp: remove unneeded code from tcp_stream_alloc_skb()Eric Dumazet1-3/+0
Aligning @size argument to 4 bytes is not needed. The header alignment has nothing to do with @size. It really depends on skb->head alignment and MAX_TCP_HEADER. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26tcp: use MAX_TCP_HEADER in tcp_stream_alloc_skbEric Dumazet1-2/+2
Both IPv4 and IPv6 uses same reserve, no need risking cache line misses to fetch its value. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26tcp: rename sk_stream_alloc_skbEric Dumazet4-14/+13
sk_stream_alloc_skb() is only used by TCP. Rename it to make this clear, and move its declaration to include/net/tcp.h Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26net: annotate data-race in neigh_output()Eric Dumazet1-3/+8
neigh_output() reads n->nud_state and hh->hh_len locklessly. This is fine, but we need to add annotations and document this. We evaluate skip_cache first to avoid reading these fields if the cache has to by bypassed. syzbot report: BUG: KCSAN: data-race in __neigh_event_send / ip_finish_output2 write to 0xffff88810798a885 of 1 bytes by interrupt on cpu 1: __neigh_event_send+0x40d/0xac0 net/core/neighbour.c:1128 neigh_event_send include/net/neighbour.h:444 [inline] neigh_resolve_output+0x104/0x410 net/core/neighbour.c:1476 neigh_output include/net/neighbour.h:510 [inline] ip_finish_output2+0x80a/0xaa0 net/ipv4/ip_output.c:221 ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423 dst_output include/net/dst.h:450 [inline] ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126 __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525 ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539 __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405 tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline] tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline] tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064 tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079 tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline] tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626 tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642 call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421 expire_timers+0x135/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x430 kernel/time/timer.c:1734 run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747 __do_softirq+0x12c/0x26e kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline] arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline] acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline] acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline] acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688 cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237 cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351 call_cpuidle kernel/sched/idle.c:158 [inline] cpuidle_idle_call kernel/sched/idle.c:239 [inline] do_idle+0x1a3/0x250 kernel/sched/idle.c:306 cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403 secondary_startup_64_no_verify+0xb1/0xbb read to 0xffff88810798a885 of 1 bytes by interrupt on cpu 0: neigh_output include/net/neighbour.h:507 [inline] ip_finish_output2+0x79a/0xaa0 net/ipv4/ip_output.c:221 ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423 dst_output include/net/dst.h:450 [inline] ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126 __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525 ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539 __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405 tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline] tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline] tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064 tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079 tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline] tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626 tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642 call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421 expire_timers+0x135/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x430 kernel/time/timer.c:1734 run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747 __do_softirq+0x12c/0x26e kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline] arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline] acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline] acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline] acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688 cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237 cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351 call_cpuidle kernel/sched/idle.c:158 [inline] cpuidle_idle_call kernel/sched/idle.c:239 [inline] do_idle+0x1a3/0x250 kernel/sched/idle.c:306 cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403 rest_init+0xee/0x100 init/main.c:734 arch_call_rest_init+0xa/0xb start_kernel+0x5e4/0x669 init/main.c:1142 secondary_startup_64_no_verify+0xb1/0xbb value changed: 0x20 -> 0x01 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26Merge branch 'mlxsw-rif-mac-prefixes'David S. Miller14-149/+778
Ido Schimmel says: ==================== mlxsw: Support multiple RIF MAC prefixes Currently, mlxsw enforces that all the netdevs used as router interfaces (RIFs) have the same MAC prefix (e.g., same 38 MSBs in Spectrum-1). Otherwise, an error is returned to user space with extack. This patchset relaxes the limitation through the use of RIF MAC profiles. A RIF MAC profile is a hardware entity that represents a particular MAC prefix which multiple RIFs can reference. Therefore, the number of possible MAC prefixes is no longer one, but the number of profiles supported by the device. The ability to change the MAC of a particular netdev is useful, for example, for users who use the netdev to connect to an upstream provider that performs MAC filtering. Currently, such users are either forced to negotiate with the provider or change the MAC address of all other netdevs so that they share the same prefix. Patchset overview: Patches #1-#3 are preparations. Patch #4 adds actual support for RIF MAC profiles. Patch #5 exposes RIF MAC profiles as a devlink resource, so that user space has visibility into the maximum number of profiles and current occupancy. Useful for debugging and testing (next 3 patches). Patches #6-#8 add both scale and functional tests. Patch #9 removes tests that validated the previous limitation. It is now covered by patch #6 for devices that support a single profile. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26selftests: mlxsw: Remove deprecated test casesDanielle Ratson1-90/+0
After adding the previous patches, the constraint that all the router interface MAC addresses have the same prefix is no longer relevant. Remove the test cases that validated that this constraint is honored. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26selftests: Add an occupancy test for RIF MAC profilesDanielle Ratson1-0/+117
When all the RIF MAC profiles are in use, test that it is possible to change the MAC of a netdev (i.e., a RIF) when its MAC profile is not shared with other RIFs. Test that replacement fails when the MAC profile is shared. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26selftests: mlxsw: Add forwarding test for RIF MAC profilesDanielle Ratson1-0/+213
Verify that MAC profile changes are indeed applied and that packets are forwarded with the correct source MAC. Output example: $ ./rif_mac_profiles.sh TEST: h1->h2: new mac profile [ OK ] TEST: h2->h1: new mac profile [ OK ] TEST: h1->h2: edit mac profile [ OK ] TEST: h2->h1: edit mac profile [ OK ] Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26selftests: mlxsw: Add a scale test for RIF MAC profilesDanielle Ratson5-2/+106
Query the maximum number of supported RIF MAC profiles using devlink-resource and verify that all available MAC profiles can be utilized and that an error is generated when user space tries to exceed this number. Output example in Spectrum-2: $ TESTS='rif_mac_profile' ./resource_scale.sh TEST: 'rif_mac_profile' 4 [ OK ] TEST: 'rif_mac_profile' overflow 5 [ OK ] Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26mlxsw: spectrum_router: Expose RIF MAC profiles to devlink resourceDanielle Ratson3-2/+47
Expose via devlink-resource the maximum number of RIF MAC profiles and their current occupancy, so it can be used for debug and writing generic tests, like in the next patch. Example for Spectrum-2 output: $ devlink resource show pci/0000:06:00.0 ... name rif_mac_profiles size 4 occ 0 unit entry dpipe_tables none Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26mlxsw: spectrum_router: Add RIF MAC profiles supportDanielle Ratson2-47/+272
Currently, mlxsw enforces that all the router interfaces (RIFs) have the same MAC prefix. Relax this limitation by using RIF MAC profiles. Each profile is associated with a particular MAC prefix and multiple RIFs can use the same profile. Therefore, the number of possible MAC prefixes is no longer one, but the number of profiles supported by the device. Store the profiles in an IDR and reference count them according to the number of RIFs using them. Associate a RIF with a profile when the RIF is created and remove the association when the RIF is deleted. Change the association following 'NETDEV_CHANGEADDR' events, except when only one RIF is using the profile. In which case, change the MAC prefix of the profile itself instead of associating the RIF with a new profile. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26mlxsw: spectrum_router: Propagate extack furtherDanielle Ratson1-8/+15
The next patch will set the MAC profile of a router interface (RIF) as part of its configure() callback. The operation can fail in case the maximum number of profiles was exceeded. Add extack to mlxsw_sp_rif_ops::configure() in order to communicate such failures to user space. In addition, the MAC profile of a RIF can change following a 'NETDEV_CHANGEADDR' notification. Propagate extack to mlxsw_sp_router_port_change_event() so that failures could be communicated in this path as well. No functional changes intended. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26mlxsw: resources: Add resource identifier for RIF MAC profilesDanielle Ratson1-0/+2
Add a resource identifier for maximum RIF MAC profiles so that it could be later used to query the information from firmware. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26mlxsw: reg: Add MAC profile ID field to RITR registerDanielle Ratson1-0/+6
Add MAC profile ID field to RITR register so that it could be used for associating a RIF with a MAC profile ID by a later patch. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26Merge branch 'netfilter-vrf-rework'David S. Miller3-5/+51
Florian Westphal says: ==================== vrf: rework interaction with netfilter/conntrack V2: - fix 'plain integer as null pointer' warning - reword commit message in patch 2 to clarify loss of 'ct set untracked' This patch series aims to solve the to-be-reverted change 09e856d54bda5f288e ("vrf: Reset skb conntrack connection on VRF rcv") in a different way. Rather than have skbs pass through conntrack and nat hooks twice, suppress conntrack invocation if the conntrack/nat hook is called from the vrf driver. First patch deals with 'incoming connection' case: 1. suppress NAT transformations 2. skip conntrack confirmation NAT and conntrack confirmation is done when ip/ipv6 stack calls the postrouting hook. Second patch deals with local packets: in vrf driver, mark the skbs as 'untracked', so conntrack output hook ignores them. This skips all nat hooks as well. Afterwards, remove the untracked state again so the second round will pick them up. One alternative to the chosen implementation would be to add a 'caller id' field to 'struct nf_hook_state' and then use that, these patches use the more straightforward check of VRF flag on the state->out device. The two patches apply to both net and net-next, i am targeting -next because I think that since snat did not work correctly for so long that we can take the longer route. If you disagree, apply to net at your discretion. The patches apply both with 09e856d54bda5f288e reverted or still in-place, but only with the revert in place ingress conntrack settings (zone, notrack etc) start working again. I've already submitted selftests for vrf+nfqueue and conntrack+vrf. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26vrf: run conntrack only in context of lower/physdev for locally generated ↵Florian Westphal1-4/+24
packets The VRF driver invokes netfilter for output+postrouting hooks so that users can create rules that check for 'oif $vrf' rather than lower device name. This is a problem when NAT rules are configured. To avoid any conntrack involvement in round 1, tag skbs as 'untracked' to prevent conntrack from picking them up. This gets cleared before the packet gets handed to the ip stack so conntrack will be active on the second iteration. One remaining issue is that a rule like output ... oif $vrfname notrack won't propagate to the second round because we can't tell 'notrack set via ruleset' and 'notrack set by vrf driver' apart. However, this isn't a regression: the 'notrack' removal happens instead of unconditional nf_reset_ct(). I'd also like to avoid leaking more vrf specific conditionals into the netfilter infra. For ingress, conntrack has already been done before the packet makes it to the vrf driver, with this patch egress does connection tracking with lower/physical device as well. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26netfilter: conntrack: skip confirmation and nat hooks in postrouting for vrfFlorian Westphal2-1/+27
The VRF driver invokes netfilter for output+postrouting hooks so that users can create rules that check for 'oif $vrf' rather than lower device name. Afterwards, ip stack calls those hooks again. This is a problem when conntrack is used with IP masquerading. masquerading has an internal check that re-validates the output interface to account for route changes. This check will trigger in the vrf case. If the -j MASQUERADE rule matched on the first iteration, then round 2 finds state->out->ifindex != nat->masq_index: the latter is the vrf index, but out->ifindex is the lower device. The packet gets dropped and the conntrack entry is invalidated. This change makes conntrack postrouting skip the nat hooks. Also skip confirmation. This allows the second round (postrouting invocation from ipv4/ipv6) to create nat bindings. This also prevents the second round from seeing packets that had their source address changed by the nat hook. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26Merge tag 'mlx5-updates-2021-10-25' of ↵David S. Miller27-93/+787
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-10-25 Misc updates for mlx5 driver: 1) Misc updates and cleanups: - Don't write directly to netdev->dev_addr, From Jakub Kicinski - Remove unnecessary checks for slow path flag in tc module - Fix unused function warning of mlx5i_flow_type_mask - Bridge, support replacing existing FDB entry 2) Sub Functions, Reduction in memory usage: - Reduce flow counters bulk query buffer size - Implement max_macs devlink parameter - Add devlink vendor params to control Event Queue sizes - Added SF life cycle trace points by Parav/ 3) From Aya, Firmware health buffer reporting improvements - Print health buffer by log level and more missing information - Periodic update of host time to firmware ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26tcp: don't free a FIN sk_buff in tcp_remove_empty_skb()Jon Maxwell1-1/+1
v1: Implement a more general statement as recommended by Eric Dumazet. The sequence number will be advanced, so this check will fix the FIN case and other cases. A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that the write_queue was not empty as determined by tcp_write_queue_empty() but the sk_buff containing the FIN flag had been freed and the socket was zombied in that state. Corresponding pcaps show no FIN from the Linux kernel on the wire. Some instrumentation was added to the kernel and it was found that there is a timing window where tcp_sendmsg() can run after tcp_send_fin(). tcp_sendmsg() will hit an error, for example: 1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))↩ 1270 ▹ ▹ goto do_error;↩ tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent. If the other side sends a FIN packet the socket will transition to CLOSING and remain that way until the system is rebooted. Fix this by checking for the FIN flag in the sk_buff and don't free it if that is the case. Testing confirmed that fixed the issue. Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases") Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reported-by: Monir Zouaoui <Monir.Zouaoui@mail.schwarz> Reported-by: Simon Stier <simon.stier@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-26Merge branch 'small-fixes-for-true-expression-checks'Jakub Kicinski2-3/+3
Jean Sacren says: ==================== Small fixes for true expression checks This series fixes checks of true !rc expression. ==================== Link: https://lore.kernel.org/r/cover.1634974124.git.sakiwit@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26net: qed_dev: fix check of true !rc expressionJean Sacren1-1/+1
Remove the check of !rc in (!rc && !resc_lock_params.b_granted) since it is always true. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26net: qed_ptp: fix check of true !rc expressionJean Sacren1-2/+2
Remove the check of !rc in (!rc && !params.b_granted) since it is always true. We should also use constant 0 for return. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26Merge branch 'tcp-receive-path-optimizations'Jakub Kicinski11-38/+79
Eric Dumazet says: ==================== tcp: receive path optimizations This series aims to reduce cache line misses in RX path. I am still working on better cache locality in tcp_sock but this will wait few more weeks. ==================== Link: https://lore.kernel.org/r/20211025164825.259415-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26ipv6/tcp: small drop monitor changesEric Dumazet1-2/+2
Two kfree_skb() calls must be replaced by consume_skb() for skbs that are not technically dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>