summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-12-15Merge branch 'mlxsw-introduce-initial-xm-router-support'Jakub Kicinski11-8/+1518
Ido Schimmel says: ==================== mlxsw: Introduce initial XM router support This patch set implements initial eXtended Mezzanine (XM) router support. The XM is an external device connected to the Spectrum-{2,3} ASICs using dedicated Ethernet ports. Its purpose is to increase the number of routes that can be offloaded to hardware. This is achieved by having the ASIC act as a cache that refers cache misses to the XM where the FIB is stored and LPM lookup is performed. Future patch sets will add more sophisticated cache flushing and selftests that utilize cache counters on the ASIC, which we plan to expose via devlink-metric [1]. Patch set overview: Patches #1-#2 add registers to insert/remove routes to/from the XM and to enable/disable it. Patch #3 utilizes these registers in order to implement XM-specific router low-level operations. Patches #4-#5 query from firmware the availability of the XM and the local ports that are used to connect the ASIC to the XM, so that netdevs will not be created for them. Patches #6-#8 initialize the XM by configuring its cache parameters. Patch #9-#10 implement cache management, so that LPM lookup will be correctly cached in the ASIC. Patches #11-#13 implement cache flushing, so that routes insertions/removals to/from the XM will flush the affected entries in the cache. Patch #14 configures the ASIC to allocate half of its memory for the cache, so that room will be left for other entries (e.g., FDBs, neighbours). Patch #15 starts using the XM for IPv4 route offload, when available. [1] https://lore.kernel.org/netdev/20200817125059.193242-1-idosch@idosch.org/ ==================== Link: https://lore.kernel.org/r/20201214113041.2789043-1-idosch@idosch.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 routerJiri Pirko3-1/+11
In case the eXtended mezzanine is present on the system, use it for IPv4 router offload. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3Jiri Pirko4-1/+24
Set a profile option to instruct FW to use 1/2 of KVH for XLT cache, not the whole one. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: spectrum_router_xm: Introduce basic XM cache flushingJiri Pirko1-0/+291
Upon route insertion and removal, it is needed to flush possibly cached entries from the XM cache. Extend XM op context to carry information needed for the flush. Implement the flush in delayed work since for HW design reasons there is a need to wait 50usec before the flush can be done. If during this time comes the same flush request, consolidate it to the first one. Implement this queued flushes by a hashtable. v2: * Fix GENMASK() high bit Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: reg: Add Router LPM Cache Enable RegisterJiri Pirko1-0/+35
The RLPMCE allows disabling the LPM cache. Can be changed on the fly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: reg: Add Router LPM Cache ML Delete RegisterJiri Pirko1-0/+111
The RLCMLD register is used to bulk delete the XLT-LPM cache ML entries. This can be used by SW when L is increased or decreased, thus need to remove entries with old ML values. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: spectrum_router_xm: Implement L-value tracking for M-indexJiri Pirko3-0/+205
There is a table that assigns L-value per M-index. The L is always the biggest from the currently inserted prefixes. Setup a hashtable to track the M-index information and the prefixes that are related to it. Ensure the L-value is always correctly set. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: reg: Add XM Router M Table RegisterJiri Pirko1-2/+31
The XRMT configures the M-Table for the XLT-LPM. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: spectrum_router: Introduce per-ASIC XM initializationJiri Pirko3-0/+88
During the router init flow, call into XM code and initialize couple of items needed for XM functionality: 1) Query the capabilities and sizes. Check the XM device id. 2) Initialize the M-value. Note that currently the M-value is set fixed to 16 for IPv4. In future this may change to better cover the actual inserted routes. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: reg: Add XM Lookup Table Query RegisterJiri Pirko1-3/+63
The XLTQ is used to query HW for XM-related info. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: reg: Add Router XLT M select RegisterJiri Pirko1-0/+32
The RXLTM configures and selects the M for the XM lookups. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: Ignore ports that are connected to eXtended mezanineJiri Pirko4-1/+18
Use the info stored in the bus_info struct about the eXtended mezanine connected ports and don't expose them. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: pci: Obtain info about ports used by eXtended mezanineJiri Pirko3-2/+49
The output of boardinfo command was extended to contain information about XM. Indicates if is present and in case it is, tells which localports are used for the connection. So parse this info and store it in bus_info passed up to the driver. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: spectrum_router: Introduce XM implementation of router low-level opsJiri Pirko4-0/+250
In order to offload entries to XM, implement a set of low-level functions to work with LPM trees in XM and also to pack and write FIB entries into XM. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: reg: Add Router XLT Enable RegisterJiri Pirko1-0/+44
The RXLTE enables XLT (eXtended Lookup Table) LPM lookups if a capable XM is present on the system. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mlxsw: reg: Add XM Direct RegisterJiri Pirko1-3/+271
The XMDR allows direct access to the XM device via the switch. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15Merge branch 'bnxt_en-improve-firmware-flashing'Jakub Kicinski1-75/+139
Michael Chan says: ==================== bnxt_en: Improve firmware flashing. This patchset improves firmware flashing in 2 ways: - If firmware returns NO_SPACE error during flashing, the driver will create the UPDATE directory with more staging area and retry. - Instead of allocating a big DMA buffer for the entire contents of the firmware package size, fallback to a smaller buffer to DMA the contents in multiple DMA operations. ==================== Link: https://lore.kernel.org/r/1607860306-17244-1-git-send-email-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15bnxt_en: Enable batch mode when using HWRM_NVM_MODIFY to flash packages.Michael Chan1-9/+40
The current scheme allocates a DMA buffer as big as the requested firmware package file and DMAs the contents to firmware in one operation. The buffer size can be several hundred kilo bytes and the driver may not be able to allocate the memory. This will cause firmware upgrade to fail. Improve the scheme by using smaller DMA blocks and calling firmware to DMA each block in a batch mode. Older firmware can cause excessive NVRAM erases if the block size is too small so we try to allocate a 256K buffer to begin with and size it down successively if we cannot allocate the memory. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15bnxt_en: Retry installing FW package under NO_SPACE error condition.Pavan Chebbi1-5/+32
In bnxt_flash_package_from_fw_obj(), if firmware returns the NO_SPACE error, call __bnxt_flash_nvram() to create the UPDATE directory and then loop back and retry one more time. Since the first try may fail, we use the silent version to send the firmware commands. Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15bnxt_en: Restructure bnxt_flash_package_from_fw_obj() to execute in a loop.Pavan Chebbi1-32/+28
On NICs with a smaller NVRAM, FW installation may fail after multiple updates due to fragmentation. The driver can retry when FW returns a special error code. To faciliate the retry, we restructure the logic that performs the flashing in a loop. The actual retry logic will be added in the next patch. Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15bnxt_en: Rearrange the logic in bnxt_flash_package_from_fw_obj().Michael Chan1-33/+30
This function will be modified in the next patch to retry flashing the firmware in a loop. To facilate that, we rearrange the code so that the steps that only need to be done once before the loop will be moved to the top of the function. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15bnxt_en: Refactor bnxt_flash_nvram.Pavan Chebbi1-19/+32
Refactor bnxt_flash_nvram() into __bnxt_flash_nvram() that takes an additional dir_item_len parameter. The new function will be used in subsequent patches with the dir_item_len parameter set to create the UPDATE directory during flashing. Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15MAINTAINERS: add mvpp2 driver entryMarcin Wojtas1-0/+8
Since its creation Marvell NIC driver for Armada 375/7k8k and CN913x SoC families mvpp2 has been lacking an entry in MAINTAINERS, which sometimes lead to unhandled bugs that persisted across several kernel releases. Signed-off-by: Marcin Wojtas <mw@semihalf.com> Acked-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20201211165114.26290-1-mw@semihalf.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed ↵Vasily Averin1-0/+6
packet syzbot reproduces BUG_ON in skb_checksum_help(): tun creates (bogus) skb with huge partial-checksummed area and small ip packet inside. Then ip_rcv trims the skb based on size of internal ip packet, after that csum offset points beyond of trimmed skb. Then checksum_tg() called via netfilter hook triggers BUG_ON: offset = skb_checksum_start_offset(skb); BUG_ON(offset >= skb_headlen(skb)); To work around the problem this patch forces pskb_trim_rcsum_slow() to return -EINVAL in described scenario. It allows its callers to drop such kind of packets. Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0 Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15inet_ecn: Use csum16_add() helper for IP_ECN_set_* helpersToke Høiland-Jørgensen1-8/+6
Jakub pointed out that the IP_ECN_set* helpers basically open-code csum16_add(), so let's switch them over to using the helper instead. v2: - Use __be16 for check_add stack variable in IP_ECN_set_ce() (kbot) v3: - Turns out we need __force casts to do arithmetic on __be16 types Reported-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Jonathan Morton <chromatix99@gmail.com> Tested-by: Pete Heist <pete@heistp.net> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20201211142638.154780-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15net: bridge: Fix a warning when del bridge sysfsWang Hai1-1/+4
I got a warining report: br_sysfs_addbr: can't create group bridge4/bridge ------------[ cut here ]------------ sysfs group 'bridge' not found for kobject 'bridge4' WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group fs/sysfs/group.c:279 [inline] WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group+0x153/0x1b0 fs/sysfs/group.c:270 Modules linked in: iptable_nat ... Call Trace: br_dev_delete+0x112/0x190 net/bridge/br_if.c:384 br_dev_newlink net/bridge/br_netlink.c:1381 [inline] br_dev_newlink+0xdb/0x100 net/bridge/br_netlink.c:1362 __rtnl_newlink+0xe11/0x13f0 net/core/rtnetlink.c:3441 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500 rtnetlink_rcv_msg+0x385/0x980 net/core/rtnetlink.c:5562 netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x793/0xc80 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0x139/0x170 net/socket.c:671 ____sys_sendmsg+0x658/0x7d0 net/socket.c:2353 ___sys_sendmsg+0xf8/0x170 net/socket.c:2407 __sys_sendmsg+0xd3/0x190 net/socket.c:2440 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In br_device_event(), if the bridge sysfs fails to be added, br_device_event() should return error. This can prevent warining when removing bridge sysfs that do not exist. Fixes: bb900b27a2f4 ("bridge: allow creating bridge devices with netlink") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Tested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20201211122921.40386-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15ice, xsk: Move Rx allocation out of while-loopBjörn Töpel1-6/+3
Instead doing the check for allocation in each loop, move it outside the while loop and do it every NAPI loop. This change boosts the xdpsock rxdrop scenario with 15% more packets-per-second. Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/r/20201211085410.59350-1-bjorn.topel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15net: mtk_eth: simplify the mediatek code return expressionZheng Yongjun1-12/+4
Simplify the return expression at mtk_eth_path.c file, simplify this all. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Link: https://lore.kernel.org/r/20201211083801.1632-1-zhengyongjun3@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15Merge branch 'add-devlink-and-devlink-health-reporters-to'Jakub Kicinski8-2/+912
George Cherian says: ==================== Add devlink and devlink health reporters to octeontx2 Add basic devlink and devlink health reporters. Devlink health reporters are added for NPA block. Address Jakub's comment to add devlink support for error reporting. https://www.spinics.net/lists/netdev/msg670712.html For now, I have dropped the NIX block health reporters. This series attempts to add health reporters only for the NPA block. As per Jakub's suggestion separate reporters per event is used and also got rid of the counters. ==================== Link: https://lore.kernel.org/r/20201211062526.2302643-1-george.cherian@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15docs: octeontx2: Add Documentation for NPA health reportersGeorge Cherian1-0/+50
Add Documentation for devlink health reporters for NPA block. Signed-off-by: George Cherian <george.cherian@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15octeontx2-af: Add devlink health reporters for NPAGeorge Cherian3-1/+765
Add health reporters for RVU NPA block. NPA Health reporters handle following HW event groups - GENERAL events - ERROR events - RAS events - RVU event Output: #devlink health pci/0002:01:00.0: reporter hw_npa_intr state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true reporter hw_npa_gen state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true reporter hw_npa_err state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true reporter hw_npa_ras state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true #devlink health dump show pci/0002:01:00.0 reporter hw_npa_err NPA_AF_ERR: NPA Error Interrupt Reg : 4096 AQ Doorbell Error #devlink health dump show pci/0002:01:00.0 reporter hw_npa_ras NPA_AF_RVU_RAS: NPA RAS Interrupt Reg : 0 Each reporter dump shows the Register value and the description of the cause. Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: George Cherian <george.cherian@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15octeontx2-af: Add devlink suppoort to af driverGeorge Cherian6-2/+98
Add devlink support to AF driver. Basic devlink support is added. Currently info_get is the only supported devlink ops. devlink ouptput looks like this # devlink dev pci/0002:01:00.0 # devlink dev info pci/0002:01:00.0: driver octeontx2-af # Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> Signed-off-by: George Cherian <george.cherian@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15selftests: test_vxlan_under_vrf: mute unnecessary error messagePo-Hsu Lin1-1/+1
The cleanup function in this script that tries to delete hv-1 / hv-2 vm-1 / vm-2 netns will generate some uncessary error messages: Cannot remove namespace file "/run/netns/hv-2": No such file or directory Cannot remove namespace file "/run/netns/vm-1": No such file or directory Cannot remove namespace file "/run/netns/vm-2": No such file or directory Redirect it to /dev/null like other commands in the cleanup function to reduce confusion. Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Link: https://lore.kernel.org/r/20201211042420.16411-1-po-hsu.lin@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15net: Limit logical shift left of TCP probe0 timeoutCambda Zhu1-1/+3
For each TCP zero window probe, the icsk_backoff is increased by one and its max value is tcp_retries2. If tcp_retries2 is greater than 63, the probe0 timeout shift may exceed its max bits. On x86_64/ARMv8/MIPS, the shift count would be masked to range 0 to 63. And on ARMv7 the result is zero. If the shift count is masked, only several probes will be sent with timeout shorter than TCP_RTO_MAX. But if the timeout is zero, it needs tcp_retries2 times probes to end this false timeout. Besides, bitwise shift greater than or equal to the width is an undefined behavior. This patch adds a limit to the backoff. The max value of max_when is TCP_RTO_MAX and the min value of timeout base is TCP_RTO_MIN. The limit is the backoff from TCP_RTO_MIN to TCP_RTO_MAX. Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com> Link: https://lore.kernel.org/r/20201208091910.37618-1-cambda@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15Merge branch 'mptcp-another-set-of-miscellaneous-mptcp-fixes'Jakub Kicinski9-37/+155
Mat Martineau says: ==================== mptcp: Another set of miscellaneous MPTCP fixes This is another collection of MPTCP fixes and enhancements that we have tested in the MPTCP tree: Patch 1 cleans up cgroup attachment for in-kernel subflow sockets. Patches 2 and 3 make sure that deletion of advertised addresses by an MPTCP path manager when flushing all addresses behaves similarly to the remove-single-address operation, and adds related tests. Patches 4 and 8 do some minor cleanup. Patches 5-7 add MPTCP_FASTCLOSE functionality. Note that patch 6 adds MPTCP option parsing to tcp_reset(). Patch 9 optimizes skb size for outgoing MPTCP packets. ==================== Link: https://lore.kernel.org/r/20201210222506.222251-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mptcp: let MPTCP create max size skbsPaolo Abeni1-5/+9
Currently the xmit path of the MPTCP protocol creates smaller- than-max-size skbs, which is suboptimal for the performances. There are a few things to improve: - when coalescing to an existing skb, must clear the PUSH flag - tcp_build_frag() expect the available space as an argument. When coalescing is enable MPTCP already subtracted the to-be-coalesced skb len. We must increment said argument accordingly. Before: ./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM [...] 131072 16384 16384 30.00 24414.86 After: ./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM [...] 131072 16384 16384 30.05 28357.69 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mptcp: pm: simplify select_local_address()Paolo Abeni1-4/+2
There is no need to unconditionally acquire the join list lock, we can simply splice the join list into the subflow list and traverse only the latter. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mptcp: parse and act on incoming FASTCLOSE optionFlorian Westphal3-0/+54
parse the MPTCP FASTCLOSE subtype. If provided key matches the local one, schedule the work queue to close (with tcp reset) all subflows. The MPTCP socket moves to closed state immediately. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15tcp: parse mptcp options contained in reset packetsFlorian Westphal3-7/+10
Because TCP-level resets only affect the subflow, there is a MPTCP option to indicate that the MPTCP-level connection should be closed immediately without a mptcp-level fin exchange. This is the 'MPTCP fast close option'. It can be carried on ack segments or TCP resets. In the latter case, its needed to parse mptcp options also for reset packets so that MPTCP can act accordingly. Next patch will add receive side fastclose support in MPTCP. Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mptcp: hold mptcp socket before calling tcp_doneFlorian Westphal1-1/+6
When processing options from tcp reset path its possible that tcp_done(ssk) drops the last reference on the mptcp socket which results in use-after-free. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mptcp: use MPTCPOPT_HMAC_LEN macroGeliang Tang1-1/+1
Use the macro MPTCPOPT_HMAC_LEN instead of a constant in struct mptcp_options_received. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15selftests: mptcp: add the flush addrs testcaseGeliang Tang1-14/+36
This patch added the flush addrs testcase. In do_transfer, if the number of removing addresses is less than 8, use the del addr command to remove the addresses one by one. If the number is more than 8, use the flush addrs command to remove the addresses. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mptcp: remove address when netlink flushes addrsGeliang Tang1-5/+10
When the PM netlink flushes the addresses, invoke the remove address function mptcp_nl_remove_subflow_and_signal_addr to remove the addresses and the subflows. Since this function should not be invoked under lock, move __flush_addrs out of the pernet->lock. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15mptcp: attach subflow socket to parent cgroupNicolas Rybowski1-0/+27
It has been observed that the kernel sockets created for the subflows (except the first one) are not in the same cgroup as their parents. That's because the additional subflows are created by kernel workers. This is a problem with eBPF programs attached to the parent's cgroup won't be executed for the children. But also with any other features of CGroup linked to a sk. This patch fixes this behaviour. As the subflow sockets are created by the kernel, we can't use 'mem_cgroup_sk_alloc' because of the current context being the one of the kworker. This is why we have to do low level memcg manipulation, if required. Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15net: mhi: Fix unexpected queue wakeLoic Poulain1-1/+2
This patch checks that MHI queue is not full before waking up the net queue. This fix sporadic MHI queueing issues in xmit. Indeed xmit and its symmetric complete callback (ul_callback) can run concurently, it is then not safe to unconditionnaly waking the queue in the callback without checking queue fullness. Fixes: 3ffec6a14f24 ("net: Add mhi-net driver") Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Link: https://lore.kernel.org/r/1607599507-5879-1-git-send-email-loic.poulain@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15net: dsa: mv88e6xxx: don't set non-existing learn2all bit for 6220/6250Rasmus Villemoes1-3/+10
The 6220 and 6250 switches do not have a learn2all bit in global1, ATU control register; bit 3 is reserverd. On the switches that do have that bit, it is used to control whether learning frames are sent out the ports that have the message_port bit set. So rather than adding yet another chip method, use the existence of the ->port_setup_message_port method as a proxy for determining whether the learn2all bit exists (and should be set). Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Link: https://lore.kernel.org/r/20201210110645.27765-1-rasmus.villemoes@prevas.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15net: openvswitch: fix TTL decrement exception action executionEelco Chaudron1-9/+6
Currently, the exception actions are not processed correctly as the wrong dataset is passed. This change fixes this, including the misleading comment. In addition, a check was added to make sure we work on an IPv4 packet, and not just assume if it's not IPv6 it's IPv4. This was all tested using OVS with patch, https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639, applied and sending packets with a TTL of 1 (and 0), both with IPv4 and IPv6. Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action netlink message format") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextJakub Kicinski25-160/+534
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Missing dependencies in NFT_BRIDGE_REJECT, from Randy Dunlap. 2) Use atomic_inc_return() instead of atomic_add_return() in IPVS, from Yejune Deng. 3) Simplify check for overquota in xt_nfacct, from Kaixu Xia. 4) Move nfnl_acct_list away from struct net, from Miao Wang. 5) Pass actual sk in reject actions, from Jan Engelhardt. 6) Add timeout and protoinfo to ctnetlink destroy events, from Florian Westphal. 7) Four patches to generalize set infrastructure to support for multiple expressions per set element. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: netfilter: nftables: netlink support for several set element expressions netfilter: nftables: generalize set extension to support for several expressions netfilter: nftables: move nft_expr before nft_set netfilter: nftables: generalize set expressions support netfilter: ctnetlink: add timeout and protoinfo to destroy events netfilter: use actual socket sk for REJECT action netfilter: nfnl_acct: remove data from struct net netfilter: Remove unnecessary conversion to bool ipvs: replace atomic_add_return() netfilter: nft_reject_bridge: fix build errors due to code movement ==================== Link: https://lore.kernel.org/r/20201212230513.3465-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-15Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski40-114/+2063
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-12-14 1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest. 2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua. 3) Support for finding BTF based kernel attach targets through libbpf's bpf_program__set_attach_target() API, from Andrii Nakryiko. 4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song. 5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet. 6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo. 7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel. 8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman. 9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner, KP Singh, Jiri Olsa and various others. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits) selftests/bpf: Add a test for ptr_to_map_value on stack for helper access bpf: Permits pointers on stack for helper calls libbpf: Expose libbpf ring_buffer epoll_fd selftests/bpf: Add set_attach_target() API selftest for module target libbpf: Support modules in bpf_program__set_attach_target() API selftests/bpf: Silence ima_setup.sh when not running in verbose mode. selftests/bpf: Drop the need for LLVM's llc selftests/bpf: fix bpf_testmod.ko recompilation logic samples/bpf: Fix possible hang in xdpsock with multiple threads selftests/bpf: Make selftest compilation work on clang 11 selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore selftests/bpf: Drop tcp-{client,server}.py from Makefile selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV selftests/bpf: Xsk selftests - DRV POLL, NOPOLL selftests/bpf: Xsk selftests - SKB POLL, NOPOLL selftests/bpf: Xsk selftests framework bpf: Only provide bpf_sock_from_file with CONFIG_NET bpf: Return -ENOTSUPP when attaching to non-kernel BTF xsk: Validate socket state in xsk_recvmsg, prior touching socket members ... ==================== Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14selftests/bpf: Add a test for ptr_to_map_value on stack for helper accessYonghong Song2-3/+5
Change bpf_iter_task.c such that pointer to map_value may appear on the stack for bpf_seq_printf() to access. Without previous verifier patch, the bpf_iter test will fail. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201210013350.943985-1-yhs@fb.com