diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-05-29 01:24:36 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-05-29 01:24:36 +0300 |
commit | 1b98f357dadd6ea613a435fbaef1a5dd7b35fd21 (patch) | |
tree | 32a7195aead30f4dcadf3c3f897df2b4611b88b8 /drivers/net/ovpn/tcp.c | |
parent | 47cf96fbe393839b125a9b694a8cfdd3f4216baa (diff) | |
parent | f6bd8faeb113c8ab783466bc5bc1a5442ae85176 (diff) | |
download | linux-1b98f357dadd6ea613a435fbaef1a5dd7b35fd21.tar.xz |
Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing again
the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter:
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools still
use this interface.
- Implement support for wildcard netdevice in netdev basechain and
flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF:
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols:
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the
single flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API:
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling:
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers:
- OpenVPN virtual driver: offload OpenVPN data channels processing to
the kernel-space, increasing the data transfer throughput WRT the
user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the steering table handling to significantly
reduce the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature"
* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
selftests/bpf: Fix bpf selftest build warning
selftests: netfilter: Fix skip of wildcard interface test
net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
net: openvswitch: Fix the dead loop of MPLS parse
calipso: Don't call calipso functions for AF_INET sk.
selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
octeontx2-pf: QOS: Perform cache sync on send queue teardown
net: mana: Add support for Multi Vports on Bare metal
net: devmem: ncdevmem: remove unused variable
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add ipv4 support
net: devmem: preserve sockc_err
page_pool: fix ugly page_pool formatting
net: devmem: move list_add to net_devmem_bind_dmabuf.
selftests: netfilter: nft_queue.sh: include file transfer duration in log message
net: phy: mscc: Fix memory leak when using one step timestamping
...
Diffstat (limited to 'drivers/net/ovpn/tcp.c')
-rw-r--r-- | drivers/net/ovpn/tcp.c | 598 |
1 files changed, 598 insertions, 0 deletions
diff --git a/drivers/net/ovpn/tcp.c b/drivers/net/ovpn/tcp.c new file mode 100644 index 000000000000..7c42d84987ad --- /dev/null +++ b/drivers/net/ovpn/tcp.c @@ -0,0 +1,598 @@ +// SPDX-License-Identifier: GPL-2.0 +/* OpenVPN data channel offload + * + * Copyright (C) 2019-2025 OpenVPN, Inc. + * + * Author: Antonio Quartulli <antonio@openvpn.net> + */ + +#include <linux/skbuff.h> +#include <net/hotdata.h> +#include <net/inet_common.h> +#include <net/ipv6.h> +#include <net/tcp.h> +#include <net/transp_v6.h> +#include <net/route.h> +#include <trace/events/sock.h> + +#include "ovpnpriv.h" +#include "main.h" +#include "io.h" +#include "peer.h" +#include "proto.h" +#include "skb.h" +#include "tcp.h" + +#define OVPN_TCP_DEPTH_NESTING 2 +#if OVPN_TCP_DEPTH_NESTING == SINGLE_DEPTH_NESTING +#error "OVPN TCP requires its own lockdep subclass" +#endif + +static struct proto ovpn_tcp_prot __ro_after_init; +static struct proto_ops ovpn_tcp_ops __ro_after_init; +static struct proto ovpn_tcp6_prot __ro_after_init; +static struct proto_ops ovpn_tcp6_ops __ro_after_init; + +static int ovpn_tcp_parse(struct strparser *strp, struct sk_buff *skb) +{ + struct strp_msg *rxm = strp_msg(skb); + __be16 blen; + u16 len; + int err; + + /* when packets are written to the TCP stream, they are prepended with + * two bytes indicating the actual packet size. + * Parse accordingly and return the actual size (including the size + * header) + */ + + if (skb->len < rxm->offset + 2) + return 0; + + err = skb_copy_bits(skb, rxm->offset, &blen, sizeof(blen)); + if (err < 0) + return err; + + len = be16_to_cpu(blen); + if (len < 2) + return -EINVAL; + + return len + 2; +} + +/* queue skb for sending to userspace via recvmsg on the socket */ +static void ovpn_tcp_to_userspace(struct ovpn_peer *peer, struct sock *sk, + struct sk_buff *skb) +{ + skb_set_owner_r(skb, sk); + memset(skb->cb, 0, sizeof(skb->cb)); + skb_queue_tail(&peer->tcp.user_queue, skb); + peer->tcp.sk_cb.sk_data_ready(sk); +} + +static void ovpn_tcp_rcv(struct strparser *strp, struct sk_buff *skb) +{ + struct ovpn_peer *peer = container_of(strp, struct ovpn_peer, tcp.strp); + struct strp_msg *msg = strp_msg(skb); + size_t pkt_len = msg->full_len - 2; + size_t off = msg->offset + 2; + u8 opcode; + + /* ensure skb->data points to the beginning of the openvpn packet */ + if (!pskb_pull(skb, off)) { + net_warn_ratelimited("%s: packet too small for peer %u\n", + netdev_name(peer->ovpn->dev), peer->id); + goto err; + } + + /* strparser does not trim the skb for us, therefore we do it now */ + if (pskb_trim(skb, pkt_len) != 0) { + net_warn_ratelimited("%s: trimming skb failed for peer %u\n", + netdev_name(peer->ovpn->dev), peer->id); + goto err; + } + + /* we need the first 4 bytes of data to be accessible + * to extract the opcode and the key ID later on + */ + if (!pskb_may_pull(skb, OVPN_OPCODE_SIZE)) { + net_warn_ratelimited("%s: packet too small to fetch opcode for peer %u\n", + netdev_name(peer->ovpn->dev), peer->id); + goto err; + } + + /* DATA_V2 packets are handled in kernel, the rest goes to user space */ + opcode = ovpn_opcode_from_skb(skb, 0); + if (unlikely(opcode != OVPN_DATA_V2)) { + if (opcode == OVPN_DATA_V1) { + net_warn_ratelimited("%s: DATA_V1 detected on the TCP stream\n", + netdev_name(peer->ovpn->dev)); + goto err; + } + + /* The packet size header must be there when sending the packet + * to userspace, therefore we put it back + */ + skb_push(skb, 2); + ovpn_tcp_to_userspace(peer, strp->sk, skb); + return; + } + + /* hold reference to peer as required by ovpn_recv(). + * + * NOTE: in this context we should already be holding a reference to + * this peer, therefore ovpn_peer_hold() is not expected to fail + */ + if (WARN_ON(!ovpn_peer_hold(peer))) + goto err; + + ovpn_recv(peer, skb); + return; +err: + dev_dstats_rx_dropped(peer->ovpn->dev); + kfree_skb(skb); + ovpn_peer_del(peer, OVPN_DEL_PEER_REASON_TRANSPORT_ERROR); +} + +static int ovpn_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, + int flags, int *addr_len) +{ + int err = 0, off, copied = 0, ret; + struct ovpn_socket *sock; + struct ovpn_peer *peer; + struct sk_buff *skb; + + rcu_read_lock(); + sock = rcu_dereference_sk_user_data(sk); + if (unlikely(!sock || !sock->peer || !ovpn_peer_hold(sock->peer))) { + rcu_read_unlock(); + return -EBADF; + } + peer = sock->peer; + rcu_read_unlock(); + + skb = __skb_recv_datagram(sk, &peer->tcp.user_queue, flags, &off, &err); + if (!skb) { + if (err == -EAGAIN && sk->sk_shutdown & RCV_SHUTDOWN) { + ret = 0; + goto out; + } + ret = err; + goto out; + } + + copied = len; + if (copied > skb->len) + copied = skb->len; + else if (copied < skb->len) + msg->msg_flags |= MSG_TRUNC; + + err = skb_copy_datagram_msg(skb, 0, msg, copied); + if (unlikely(err)) { + kfree_skb(skb); + ret = err; + goto out; + } + + if (flags & MSG_TRUNC) + copied = skb->len; + kfree_skb(skb); + ret = copied; +out: + ovpn_peer_put(peer); + return ret; +} + +void ovpn_tcp_socket_detach(struct ovpn_socket *ovpn_sock) +{ + struct ovpn_peer *peer = ovpn_sock->peer; + struct socket *sock = ovpn_sock->sock; + + strp_stop(&peer->tcp.strp); + skb_queue_purge(&peer->tcp.user_queue); + + /* restore CBs that were saved in ovpn_sock_set_tcp_cb() */ + sock->sk->sk_data_ready = peer->tcp.sk_cb.sk_data_ready; + sock->sk->sk_write_space = peer->tcp.sk_cb.sk_write_space; + sock->sk->sk_prot = peer->tcp.sk_cb.prot; + sock->sk->sk_socket->ops = peer->tcp.sk_cb.ops; + + rcu_assign_sk_user_data(sock->sk, NULL); +} + +void ovpn_tcp_socket_wait_finish(struct ovpn_socket *sock) +{ + struct ovpn_peer *peer = sock->peer; + + /* NOTE: we don't wait for peer->tcp.defer_del_work to finish: + * either the worker is not running or this function + * was invoked by that worker. + */ + + cancel_work_sync(&sock->tcp_tx_work); + strp_done(&peer->tcp.strp); + + skb_queue_purge(&peer->tcp.out_queue); + kfree_skb(peer->tcp.out_msg.skb); + peer->tcp.out_msg.skb = NULL; +} + +static void ovpn_tcp_send_sock(struct ovpn_peer *peer, struct sock *sk) +{ + struct sk_buff *skb = peer->tcp.out_msg.skb; + int ret, flags; + + if (!skb) + return; + + if (peer->tcp.tx_in_progress) + return; + + peer->tcp.tx_in_progress = true; + + do { + flags = ovpn_skb_cb(skb)->nosignal ? MSG_NOSIGNAL : 0; + ret = skb_send_sock_locked_with_flags(sk, skb, + peer->tcp.out_msg.offset, + peer->tcp.out_msg.len, + flags); + if (unlikely(ret < 0)) { + if (ret == -EAGAIN) + goto out; + + net_warn_ratelimited("%s: TCP error to peer %u: %d\n", + netdev_name(peer->ovpn->dev), + peer->id, ret); + + /* in case of TCP error we can't recover the VPN + * stream therefore we abort the connection + */ + ovpn_peer_hold(peer); + schedule_work(&peer->tcp.defer_del_work); + + /* we bail out immediately and keep tx_in_progress set + * to true. This way we prevent more TX attempts + * which would lead to more invocations of + * schedule_work() + */ + return; + } + + peer->tcp.out_msg.len -= ret; + peer->tcp.out_msg.offset += ret; + } while (peer->tcp.out_msg.len > 0); + + if (!peer->tcp.out_msg.len) { + preempt_disable(); + dev_dstats_tx_add(peer->ovpn->dev, skb->len); + preempt_enable(); + } + + kfree_skb(peer->tcp.out_msg.skb); + peer->tcp.out_msg.skb = NULL; + peer->tcp.out_msg.len = 0; + peer->tcp.out_msg.offset = 0; + +out: + peer->tcp.tx_in_progress = false; +} + +void ovpn_tcp_tx_work(struct work_struct *work) +{ + struct ovpn_socket *sock; + + sock = container_of(work, struct ovpn_socket, tcp_tx_work); + + lock_sock(sock->sock->sk); + if (sock->peer) + ovpn_tcp_send_sock(sock->peer, sock->sock->sk); + release_sock(sock->sock->sk); +} + +static void ovpn_tcp_send_sock_skb(struct ovpn_peer *peer, struct sock *sk, + struct sk_buff *skb) +{ + if (peer->tcp.out_msg.skb) + ovpn_tcp_send_sock(peer, sk); + + if (peer->tcp.out_msg.skb) { + dev_dstats_tx_dropped(peer->ovpn->dev); + kfree_skb(skb); + return; + } + + peer->tcp.out_msg.skb = skb; + peer->tcp.out_msg.len = skb->len; + peer->tcp.out_msg.offset = 0; + ovpn_tcp_send_sock(peer, sk); +} + +void ovpn_tcp_send_skb(struct ovpn_peer *peer, struct socket *sock, + struct sk_buff *skb) +{ + u16 len = skb->len; + + *(__be16 *)__skb_push(skb, sizeof(u16)) = htons(len); + + spin_lock_nested(&sock->sk->sk_lock.slock, OVPN_TCP_DEPTH_NESTING); + if (sock_owned_by_user(sock->sk)) { + if (skb_queue_len(&peer->tcp.out_queue) >= + READ_ONCE(net_hotdata.max_backlog)) { + dev_dstats_tx_dropped(peer->ovpn->dev); + kfree_skb(skb); + goto unlock; + } + __skb_queue_tail(&peer->tcp.out_queue, skb); + } else { + ovpn_tcp_send_sock_skb(peer, sock->sk, skb); + } +unlock: + spin_unlock(&sock->sk->sk_lock.slock); +} + +static void ovpn_tcp_release(struct sock *sk) +{ + struct sk_buff_head queue; + struct ovpn_socket *sock; + struct ovpn_peer *peer; + struct sk_buff *skb; + + rcu_read_lock(); + sock = rcu_dereference_sk_user_data(sk); + if (!sock) { + rcu_read_unlock(); + return; + } + + peer = sock->peer; + + /* during initialization this function is called before + * assigning sock->peer + */ + if (unlikely(!peer || !ovpn_peer_hold(peer))) { + rcu_read_unlock(); + return; + } + rcu_read_unlock(); + + __skb_queue_head_init(&queue); + skb_queue_splice_init(&peer->tcp.out_queue, &queue); + + while ((skb = __skb_dequeue(&queue))) + ovpn_tcp_send_sock_skb(peer, sk, skb); + + peer->tcp.sk_cb.prot->release_cb(sk); + ovpn_peer_put(peer); +} + +static int ovpn_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) +{ + struct ovpn_socket *sock; + int ret, linear = PAGE_SIZE; + struct ovpn_peer *peer; + struct sk_buff *skb; + + lock_sock(sk); + rcu_read_lock(); + sock = rcu_dereference_sk_user_data(sk); + if (unlikely(!sock || !sock->peer || !ovpn_peer_hold(sock->peer))) { + rcu_read_unlock(); + release_sock(sk); + return -EIO; + } + rcu_read_unlock(); + peer = sock->peer; + + if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_NOSIGNAL)) { + ret = -EOPNOTSUPP; + goto peer_free; + } + + if (peer->tcp.out_msg.skb) { + ret = -EAGAIN; + goto peer_free; + } + + if (size < linear) + linear = size; + + skb = sock_alloc_send_pskb(sk, linear, size - linear, + msg->msg_flags & MSG_DONTWAIT, &ret, 0); + if (!skb) { + net_err_ratelimited("%s: skb alloc failed: %d\n", + netdev_name(peer->ovpn->dev), ret); + goto peer_free; + } + + skb_put(skb, linear); + skb->len = size; + skb->data_len = size - linear; + + ret = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size); + if (ret) { + kfree_skb(skb); + net_err_ratelimited("%s: skb copy from iter failed: %d\n", + netdev_name(peer->ovpn->dev), ret); + goto peer_free; + } + + ovpn_skb_cb(skb)->nosignal = msg->msg_flags & MSG_NOSIGNAL; + ovpn_tcp_send_sock_skb(peer, sk, skb); + ret = size; +peer_free: + release_sock(sk); + ovpn_peer_put(peer); + return ret; +} + +static int ovpn_tcp_disconnect(struct sock *sk, int flags) +{ + return -EBUSY; +} + +static void ovpn_tcp_data_ready(struct sock *sk) +{ + struct ovpn_socket *sock; + + trace_sk_data_ready(sk); + + rcu_read_lock(); + sock = rcu_dereference_sk_user_data(sk); + if (likely(sock && sock->peer)) + strp_data_ready(&sock->peer->tcp.strp); + rcu_read_unlock(); +} + +static void ovpn_tcp_write_space(struct sock *sk) +{ + struct ovpn_socket *sock; + + rcu_read_lock(); + sock = rcu_dereference_sk_user_data(sk); + if (likely(sock && sock->peer)) { + schedule_work(&sock->tcp_tx_work); + sock->peer->tcp.sk_cb.sk_write_space(sk); + } + rcu_read_unlock(); +} + +static void ovpn_tcp_build_protos(struct proto *new_prot, + struct proto_ops *new_ops, + const struct proto *orig_prot, + const struct proto_ops *orig_ops); + +static void ovpn_tcp_peer_del_work(struct work_struct *work) +{ + struct ovpn_peer *peer = container_of(work, struct ovpn_peer, + tcp.defer_del_work); + + ovpn_peer_del(peer, OVPN_DEL_PEER_REASON_TRANSPORT_ERROR); + ovpn_peer_put(peer); +} + +/* Set TCP encapsulation callbacks */ +int ovpn_tcp_socket_attach(struct ovpn_socket *ovpn_sock, + struct ovpn_peer *peer) +{ + struct socket *sock = ovpn_sock->sock; + struct strp_callbacks cb = { + .rcv_msg = ovpn_tcp_rcv, + .parse_msg = ovpn_tcp_parse, + }; + int ret; + + /* make sure no pre-existing encapsulation handler exists */ + if (sock->sk->sk_user_data) + return -EBUSY; + + /* only a fully connected socket is expected. Connection should be + * handled in userspace + */ + if (sock->sk->sk_state != TCP_ESTABLISHED) { + net_err_ratelimited("%s: provided TCP socket is not in ESTABLISHED state: %d\n", + netdev_name(peer->ovpn->dev), + sock->sk->sk_state); + return -EINVAL; + } + + ret = strp_init(&peer->tcp.strp, sock->sk, &cb); + if (ret < 0) { + DEBUG_NET_WARN_ON_ONCE(1); + return ret; + } + + INIT_WORK(&peer->tcp.defer_del_work, ovpn_tcp_peer_del_work); + + __sk_dst_reset(sock->sk); + skb_queue_head_init(&peer->tcp.user_queue); + skb_queue_head_init(&peer->tcp.out_queue); + + /* save current CBs so that they can be restored upon socket release */ + peer->tcp.sk_cb.sk_data_ready = sock->sk->sk_data_ready; + peer->tcp.sk_cb.sk_write_space = sock->sk->sk_write_space; + peer->tcp.sk_cb.prot = sock->sk->sk_prot; + peer->tcp.sk_cb.ops = sock->sk->sk_socket->ops; + + /* assign our static CBs and prot/ops */ + sock->sk->sk_data_ready = ovpn_tcp_data_ready; + sock->sk->sk_write_space = ovpn_tcp_write_space; + + if (sock->sk->sk_family == AF_INET) { + sock->sk->sk_prot = &ovpn_tcp_prot; + sock->sk->sk_socket->ops = &ovpn_tcp_ops; + } else { + sock->sk->sk_prot = &ovpn_tcp6_prot; + sock->sk->sk_socket->ops = &ovpn_tcp6_ops; + } + + /* avoid using task_frag */ + sock->sk->sk_allocation = GFP_ATOMIC; + sock->sk->sk_use_task_frag = false; + + /* enqueue the RX worker */ + strp_check_rcv(&peer->tcp.strp); + + return 0; +} + +static void ovpn_tcp_close(struct sock *sk, long timeout) +{ + struct ovpn_socket *sock; + struct ovpn_peer *peer; + + rcu_read_lock(); + sock = rcu_dereference_sk_user_data(sk); + if (!sock || !sock->peer || !ovpn_peer_hold(sock->peer)) { + rcu_read_unlock(); + return; + } + peer = sock->peer; + rcu_read_unlock(); + + ovpn_peer_del(sock->peer, OVPN_DEL_PEER_REASON_TRANSPORT_DISCONNECT); + peer->tcp.sk_cb.prot->close(sk, timeout); + ovpn_peer_put(peer); +} + +static __poll_t ovpn_tcp_poll(struct file *file, struct socket *sock, + poll_table *wait) +{ + __poll_t mask = datagram_poll(file, sock, wait); + struct ovpn_socket *ovpn_sock; + + rcu_read_lock(); + ovpn_sock = rcu_dereference_sk_user_data(sock->sk); + if (ovpn_sock && ovpn_sock->peer && + !skb_queue_empty(&ovpn_sock->peer->tcp.user_queue)) + mask |= EPOLLIN | EPOLLRDNORM; + rcu_read_unlock(); + + return mask; +} + +static void ovpn_tcp_build_protos(struct proto *new_prot, + struct proto_ops *new_ops, + const struct proto *orig_prot, + const struct proto_ops *orig_ops) +{ + memcpy(new_prot, orig_prot, sizeof(*new_prot)); + memcpy(new_ops, orig_ops, sizeof(*new_ops)); + new_prot->recvmsg = ovpn_tcp_recvmsg; + new_prot->sendmsg = ovpn_tcp_sendmsg; + new_prot->disconnect = ovpn_tcp_disconnect; + new_prot->close = ovpn_tcp_close; + new_prot->release_cb = ovpn_tcp_release; + new_ops->poll = ovpn_tcp_poll; +} + +/* Initialize TCP static objects */ +void __init ovpn_tcp_init(void) +{ + ovpn_tcp_build_protos(&ovpn_tcp_prot, &ovpn_tcp_ops, &tcp_prot, + &inet_stream_ops); + +#if IS_ENABLED(CONFIG_IPV6) + ovpn_tcp_build_protos(&ovpn_tcp6_prot, &ovpn_tcp6_ops, &tcpv6_prot, + &inet6_stream_ops); +#endif +} |