summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-06-10net/mlx5: Light probe local SFsShay Drory7-23/+203
In case user wants to configure the SFs, for example: to use only vdpa functionality, he needs to fully probe a SF, configure what he wants, and afterward reload the SF. In order to save the time of the reload, local SFs will probe without any auxiliary sub-device, so that the SFs can be configured prior to its full probe. The defaults of the enable_* devlink params of these SFs are set to false. Usage example: Create SF: $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11 $ devlink port function set pci/0000:08:00.0/32768 \ hw_addr 00:00:00:00:00:11 state active Enable ETH auxiliary device: $ devlink dev param set auxiliary/mlx5_core.sf.1 \ name enable_eth value true cmode driverinit Now, in order to fully probe the SF, use devlink reload: $ devlink dev reload auxiliary/mlx5_core.sf.1 At this point the user have SF devlink instance with auxiliary device for the Ethernet functionality only. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Move esw multiport devlink param to eswitch codeShay Drory2-36/+47
Move the param registration and handling code into the eswitch code as they are related to each other. No point in having the devlink param registration done in separate file. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Split function_setup() to enable and open functionsShay Drory1-25/+58
mlx5_cmd_init_hca() is taking ~0.2 seconds. In case of a user who desire to disable some of the SF aux devices, and with large scale-1K SFs for example, this user will waste more than 3 minutes on mlx5_cmd_init_hca() which isn't needed at this stage. Downstream patch will change SFs which are probe over the E-switch, local SFs, to be probed without any aux dev. In order to support this, split function_setup() to avoid executing mlx5_cmd_init_hca(). Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Set max number of embedded CPU VFsDaniel Jurgens1-0/+1
Set the maximum number of embedded cpu VF functions available. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Update SRIOV enable/disable to handle EC/VFsDaniel Jurgens3-9/+30
Previously on the embedded CPU platform SRIOV was never enabled/disabled via mlx5_core_sriov_configure. Host VF updates are provided by an event handler. Now in the disable flow it must be known if this is a disable due to driver unload or SRIOV detach, or if the user updated the number of VFs. If due to change in the number of VFs only wait for the pages of ECVFs. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Query correct caps for min msix vectorsDaniel Jurgens1-1/+15
The VFs on the host and the embedded CPU platform share function numbers. Set the ec_vf_function field to query the caps for the correct function. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Use correct vport when restoring GUIDsDaniel Jurgens1-3/+7
Prior to enabling EC VF functionality the vport number and function ID were always the same. That's not the case now. Use the correct vport number to modify the HCA vport context. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Add new page type for EC VF pagesDaniel Jurgens3-1/+12
When the embedded cpu supports SRIOV it can be enabled and disabled independently from the host SRIOV. Track the pages separately so we can properly wait for returned VF pages. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Add/remove peer miss rules for EC VFsDaniel Jurgens1-0/+32
Add and remove the peer miss rules for EC VFs. It's possible that there are different amounts of total VFs per function so only create rules for the minimum number of max VFs. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Add management of EC VF vportsDaniel Jurgens3-17/+143
Add init, load, unload, and cleanup of the EC VF vports. This includes changes in how eswitch SRIOV is managed. Previous on an embedded CPU platform the number of VFs provided when enabling the eswitch was always 0, host VFs vports are handled in the eswitch functions change event handler. Now track the number of EC VFs as well, so they can be handled properly in the enable/disable flows. There are only 3 marks available for use in xarrays, all 3 were already in use for this use case. EC VF vports are in a known range so we can access them by index instead of marks. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Update vport caps query/set for EC VFsDaniel Jurgens3-8/+19
These functions are for query/set by vport, there was an underlying assumption that vport was equal to function ID. That's not the case for EC VF functions. Set the ec_vf_function bit accordingly. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Enable devlink port for embedded cpu VF vportsDaniel Jurgens3-1/+33
Enable creation of a devlink port for EC VF vports. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: mlx5_ifc updates for embedded CPU SRIOVDaniel Jurgens1-3/+8
Add ec_vf_vport_base to HCA Capabilities 2. This indicates the base vport of embedded CPU virtual functions that are connected to the eswitch. Add ec_vf_function to query/set_hca_caps. If set this indicates accessing a virtual function on the embedded CPU by function ID. This should only be used with other_function set to 1. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Bodong Wang <bodong@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10net/mlx5: Simplify unload all rep codeDaniel Jurgens1-47/+1
Instead of using type specific iterators which are only used in one place just traverse the xarray. It will provide suitable ordering based on the vport numbers. This will also eliminate the need for changes here when new types are added. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-10Merge branch 'tools-ynl-gen-code-gen-improvements-before-ethtool'Jakub Kicinski8-329/+186
Jakub Kicinski says: ==================== tools: ynl-gen: code gen improvements before ethtool I was going to post ethtool but I couldn't stand the ugliness of the if conditions which were previously generated. So I cleaned that up and improved a number of other things ethtool will benefit from. ==================== Link: https://lore.kernel.org/r/20230608211200.1247213-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: support / skip pads on the way to kernelJakub Kicinski1-0/+6
Kernel does not have padding requirements for 64b attrs. We can ignore pad attrs. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: don't pass op_name to RenderInfoJakub Kicinski1-19/+18
The op_name argument is barely used and identical to op.name in all cases. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: support code gen for eventsJakub Kicinski2-6/+13
Netlink specs support both events and notifications (former can define their own message contents). Plug in missing code to generate types, parsers and include events into notification tables. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: sanitize notification trackingJakub Kicinski2-43/+27
Don't modify the raw dicts (as loaded from YAML) to pretend that the notify attributes also exist on the ops. This makes the code easier to follow. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl: regen: stop generating common notification handlersJakub Kicinski4-98/+0
Remove unused notification handlers. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: stop generating common notification handlersJakub Kicinski1-73/+0
Common notification handler was supposed to be a way for the user to parse the notifications from a socket synchronously. I don't think we'll end up using it, ynl_ntf_check() works for all known use cases. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl: regen: regenerate the if laddersJakub Kicinski4-74/+67
Renegate the code to combine } and else and use tmp variable to store type. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: get attr type outside of if()Jakub Kicinski1-1/+3
Reading attr type with mnl_attr_get_type() for each condition leads to most conditions being longer than 80 chars. Avoid this by reading the type to a variable on the stack. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: combine else with closing bracketJakub Kicinski1-4/+19
Code gen currently prints: } else if (... This is really ugly. Fix it by delaying printing of closing brackets in anticipation of else coming along. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: complete the C keyword listJakub Kicinski1-1/+34
C keywords need to be avoided when naming things. Complete the list (ethtool has at least one thing called "auto"). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl: regen: cleanup user space header includesJakub Kicinski4-12/+4
Remove unnecessary includes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10tools: ynl-gen: cleanup user space header includesJakub Kicinski1-4/+1
Bots started screaming that we're including stdlib.h twice. While at it move string.h into a common spot and drop stdio.h which we don't need. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5464 Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5466 Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5467 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09Revert "tools: ynl: Remove duplicated include in handshake-user.c"Jakub Kicinski1-0/+1
This reverts commit e7c5433c5aaab52ddd5448967a9a5db94a3939cc. Commit e7c5433c5aaa ("tools: ynl: Remove duplicated include in handshake-user.c") was applied too hastily. It changes an auto-generated file, and there's already a proper fix on the list. Link: https://lore.kernel.org/all/ZIMPLYi%2FxRih+DlC@nanopsycho/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09tools: ynl: Remove duplicated include in handshake-user.cYang Li1-1/+0
./tools/net/ynl/generated/handshake-user.c: stdlib.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5464 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09Merge branch 'broadcom-phy-led-brightness'David S. Miller4-8/+56
Florian Fainelli says: ==================== LED brightness support for Broadcom PHYs This patch series adds support for controlling the LED brightness on Broadcom PHYs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09net: phy: broadcom: Add support for setting LED brightnessFlorian Fainelli4-0/+48
Broadcom PHYs have two LEDs selector registers which allow us to control the LED assignment, including how to turn them on/off. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09net: phy: broadcom: Rename LED registersFlorian Fainelli2-8/+8
These registers are common to most PHYs and are not specific to the BCM5482, renamed the constants accordingly, no functional change. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09Merge branch 'net-ncsi-refactoring-for-GMA-cmd'David S. Miller1-71/+22
Ivan Mikhaylov says: ==================== net/ncsi: refactoring for GMA command Make one GMA function for all manufacturers, change ndo_set_mac_address to dev_set_mac_address for notifiying net layer about MAC change which ndo_set_mac_address doesn't do. Changes from v1: 1. delete ftgmac100.txt changes about mac-address-increment 2. add convert to yaml from ftgmac100.txt 3. add mac-address-increment option for ethernet-controller.yaml Changes from v2: 1. remove DT changes from series, will be done in another one ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09net/ncsi: change from ndo_set_mac_address to dev_set_mac_addressIvan Mikhaylov1-2/+3
Change ndo_set_mac_address to dev_set_mac_address because dev_set_mac_address provides a way to notify network layer about MAC change. In other case, services may not aware about MAC change and keep using old one which set from network adapter driver. As example, DHCP client from systemd do not update MAC address without notification from net subsystem which leads to the problem with acquiring the right address from DHCP server. Fixes: cb10c7c0dfd9e ("net/ncsi: Add NCSI Broadcom OEM command") Cc: stable@vger.kernel.org # v6.0+ 2f38e84 net/ncsi: make one oem_gma function for all mfr id Signed-off-by: Paul Fertser <fercerpav@gmail.com> Signed-off-by: Ivan Mikhaylov <fr0st61te@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09net/ncsi: make one oem_gma function for all mfr idIvan Mikhaylov1-69/+19
Make the one Get Mac Address function for all manufacturers and change this call in handlers accordingly. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ivan Mikhaylov <fr0st61te@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09usbnet: ipheth: update Kconfig descriptionFoster Snowhill1-6/+4
This module has for a long time not been limited to iPhone <= 3GS. Update description to match the actual state of the driver. Remove dead link from 2010, instead reference an existing userspace iOS device pairing implementation as part of libimobiledevice. Signed-off-by: Foster Snowhill <forst@pen.gy> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09usbnet: ipheth: add CDC NCM supportFoster Snowhill1-25/+155
Recent iOS releases support CDC NCM encapsulation on RX. This mode is the default on macOS and Windows. In this mode, an iOS device may include one or more Ethernet frames inside a single URB. Freshly booted iOS devices start in legacy mode, but are put into NCM mode by the official Apple driver. When reconnecting such a device from a macOS/Windows machine to a Linux host, the device stays in NCM mode, making it unusable with the legacy ipheth driver code. To correctly support such a device, the driver has to either support the NCM mode too, or put the device back into legacy mode. To match the behaviour of the macOS/Windows driver, and since there is no documented control command to revert to legacy mode, implement NCM support. The device is attempted to be put into NCM mode by default, and falls back to legacy mode if the attempt fails. Signed-off-by: Foster Snowhill <forst@pen.gy> Tested-by: Georgi Valkov <gvalkov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09usbnet: ipheth: transmit URBs without trailing paddingFoster Snowhill1-3/+1
The behaviour of the official iOS tethering driver on macOS is to not transmit any trailing padding at the end of URBs. This is applicable to both NCM and legacy modes, including older devices. Adapt the driver to not include trailing padding in TX URBs, matching the behaviour of the official macOS driver. Signed-off-by: Foster Snowhill <forst@pen.gy> Tested-by: Georgi Valkov <gvalkov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09usbnet: ipheth: fix risk of NULL pointer deallocationGeorgi Valkov1-1/+1
The cleanup precedure in ipheth_probe will attempt to free a NULL pointer in dev->ctrl_buf if the memory allocation for this buffer is not successful. While kfree ignores NULL pointers, and the existing code is safe, it is a better design to rearrange the goto labels and avoid this. Signed-off-by: Georgi Valkov <gvalkov@gmail.com> Signed-off-by: Foster Snowhill <forst@pen.gy> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-09Merge branch ↵Jakub Kicinski25-238/+478
'splice-net-rewrite-splice-to-socket-fix-splice_f_more-and-handle-msg_splice_pages-in-af_tls' David Howells says: ==================== splice, net: Rewrite splice-to-socket, fix SPLICE_F_MORE and handle MSG_SPLICE_PAGES in AF_TLS Here are patches to do the following: (1) Block MSG_SENDPAGE_* flags from leaking into ->sendmsg() from userspace, whilst allowing splice_to_socket() to pass them in. (2) Allow MSG_SPLICE_PAGES to be passed into tls_*_sendmsg(). Until support is added, it will be ignored and a splice-driven sendmsg() will be treated like a normal sendmsg(). TCP, UDP, AF_UNIX and Chelsio-TLS already handle the flag in net-next. (3) Replace a chain of functions to splice-to-sendpage with a single function to splice via sendmsg() with MSG_SPLICE_PAGES. This allows a bunch of pages to be spliced from a pipe in a single call using a bio_vec[] and pushes the main processing loop down into the bowels of the protocol driver rather than repeatedly calling in with a page at a time. (4) Provide a ->splice_eof() op[2] that allows splice to signal to its output that the input observed a premature EOF and that the caller didn't flag SPLICE_F_MORE, thereby allowing a corked socket to be flushed. This attempts to maintain the current behaviour. It is also not called if we didn't manage to read any data and so didn't called the actor function. This needs routing though several layers to get it down to the network protocol. [!] Note that I chose not to pass in any flags - I'm not sure it's particularly useful to pass in the splice flags; I also elected not to return any error code - though we might actually want to do that. (5) Provide tls_{device,sw}_splice_eof() to flush a pending TLS record if there is one. (6) Provide splice_eof() for UDP, TCP, Chelsio-TLS and AF_KCM. AF_UNIX doesn't seem to pay attention to the MSG_MORE or MSG_SENDPAGE_NOTLAST flags. (7) Alter the behaviour of sendfile() and fix SPLICE_F_MORE/MSG_MORE signalling[1] such SPLICE_F_MORE is always signalled until we have read sufficient data to finish the request. If we get a zero-length before we've managed to splice sufficient data, we now leave the socket expecting more data and leave it to userspace to deal with it. (8) Make AF_TLS handle the MSG_SPLICE_PAGES internal sendmsg flag. MSG_SPLICE_PAGES is an internal hint that tells the protocol that it should splice the pages supplied if it can. Its sendpage implementations are then turned into wrappers around that. Link: https://lore.kernel.org/r/499791.1685485603@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ [2] Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=51c78a4d532efe9543a4df019ff405f05c6157f6 # part 1 Link: https://lore.kernel.org/r/20230524153311.3625329-1-dhowells@redhat.com/ # v1 ==================== Link: https://lore.kernel.org/r/20230607181920.2294972-1-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGESDavid Howells1-69/+23
Convert tls_device_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than directly splicing in the pages itself. With that, the tls_iter_offset union is no longer necessary and can be replaced with an iov_iter pointer and the zc_page argument to tls_push_data() can also be removed. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jakub Kicinski <kuba@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09tls/device: Support MSG_SPLICE_PAGESDavid Howells1-0/+26
Make TLS's device sendmsg() support MSG_SPLICE_PAGES. This causes pages to be spliced from the source iterator if possible. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGESDavid Howells1-138/+35
Convert tls_sw_sendpage() and tls_sw_sendpage_locked() to use sendmsg() with MSG_SPLICE_PAGES rather than directly splicing in the pages itself. [!] Note that tls_sw_sendpage_locked() appears to have the wrong locking upstream. I think the caller will only hold the socket lock, but it should hold tls_ctx->tx_lock too. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09tls/sw: Support MSG_SPLICE_PAGESDavid Howells1-0/+41
Make TLS's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be spliced from the source iterator if possible. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09splice, net: Fix SPLICE_F_MORE signalling in splice_direct_to_actor()David Howells1-8/+10
splice_direct_to_actor() doesn't manage SPLICE_F_MORE correctly[1] - and, as a result, it incorrectly signals/fails to signal MSG_MORE when splicing to a socket. The problem I'm seeing happens when a short splice occurs because we got a short read due to hitting the EOF on a file: as the length read (read_len) is less than the remaining size to be spliced (len), SPLICE_F_MORE (and thus MSG_MORE) is set. The issue is that, for the moment, we have no way to know *why* the short read occurred and so can't make a good decision on whether we *should* keep MSG_MORE set. MSG_SENDPAGE_NOTLAST was added to work around this, but that is also set incorrectly under some circumstances - for example if a short read fills a single pipe_buffer, but the next read would return more (seqfile can do this). This was observed with the multi_chunk_sendfile tests in the tls kselftest program. Some of those tests would hang and time out when the last chunk of file was less than the sendfile request size: build/kselftest/net/tls -r tls.12_aes_gcm.multi_chunk_sendfile This has been observed before[2] and worked around in AF_TLS[3]. Fix this by making splice_direct_to_actor() always signal SPLICE_F_MORE if we haven't yet hit the requested operation size. SPLICE_F_MORE remains signalled if the user passed it in to splice() but otherwise gets cleared when we've read sufficient data to fulfill the request. If, however, we get a premature EOF from ->splice_read(), have sent at least one byte and SPLICE_F_MORE was not set by the caller, ->splice_eof() will be invoked. Signed-off-by: David Howells <dhowells@redhat.com> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: David Hildenbrand <david@redhat.com> cc: Christian Brauner <brauner@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/499791.1685485603@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/ [2] Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d452d48b9f8b1a7f8152d33ef52cfd7fe1735b0a [3] Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09kcm: Use splice_eof() to flushDavid Howells1-0/+15
Allow splice to undo the effects of MSG_MORE after prematurely ending a splice/sendfile due to getting an EOF condition (->splice_read() returned 0) after splice had called sendmsg() with MSG_MORE set when the user didn't set MSG_MORE. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Tom Herbert <tom@herbertland.com> cc: Tom Herbert <tom@quantonium.net> cc: Cong Wang <cong.wang@bytedance.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09chelsio/chtls: Use splice_eof() to flushDavid Howells3-0/+11
Allow splice to end a Chelsio TLS record after prematurely ending a splice/sendfile due to getting an EOF condition (->splice_read() returned 0) after splice had called sendmsg() with MSG_MORE set when the user didn't set MSG_MORE. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Ayush Sawal <ayush.sawal@chelsio.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09ipv4, ipv6: Use splice_eof() to flushDavid Howells10-0/+71
Allow splice to undo the effects of MSG_MORE after prematurely ending a splice/sendfile due to getting an EOF condition (->splice_read() returned 0) after splice had called sendmsg() with MSG_MORE set when the user didn't set MSG_MORE. For UDP, a pending packet will not be emitted if the socket is closed before it is flushed; with this change, it be flushed by ->splice_eof(). For TCP, it's not clear that MSG_MORE is actually effective. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Kuniyuki Iwashima <kuniyu@amazon.com> cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09tls/device: Use splice_eof() to flushDavid Howells3-0/+26
Allow splice to end a TLS record after prematurely ending a splice/sendfile due to getting an EOF condition (->splice_read() returned 0) after splice had called TLS with a sendmsg() with MSG_MORE set when the user didn't set MSG_MORE. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-09tls/sw: Use splice_eof() to flushDavid Howells3-0/+77
Allow splice to end a TLS record after prematurely ending a splice/sendfile due to getting an EOF condition (->splice_read() returned 0) after splice had called TLS with a sendmsg() with MSG_MORE set when the user didn't set MSG_MORE. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>