kernel/linux.git/net/rds/recv.c, branch v7.1

net/rds: Trigger rds_send_ping() more than once

2026-02-05T04:46:39+00:00

Even though a peer may have already received a non-zero value for "RDS_EXTHDR_NPATHS" from a node in the past, the current peer may not. Therefore it is important to initiate another rds_send_ping() after a re-connect to any peer: It is unknown at that time if we're still talking to the same instance of RDS kernel modules on the other side. Otherwise, the peer may just operate on a single lane ("c_npaths == 0"), not knowing that more lanes are available. However, if "c_with_sport_idx" is supported, we also need to check that the connection we accepted on lane#0 meets the proper source port modulo requirement, as we fan out: Since the exchange of "RDS_EXTHDR_NPATHS" and "RDS_EXTHDR_SPORT_IDX" is asynchronous, initially we have no choice but to accept an incoming connection (via "accept") in the first slot ("cp_index == 0") for backwards compatibility. But that very connection may have come from a different lane with "cp_index != 0", since the peer thought that we already understood and handled "c_with_sport_idx" properly, as indicated by a previous exchange before a module was reloaded. In short: If a module gets reloaded, we recover from that, but do *not* allow a downgrade to support fewer lanes. Downgrades would require us to merge messages from separate lanes, which is rather tricky with the current RDS design. Each lane has its own sequence number space and all messages would need to be re-sequenced as we merge, all while handling "RDS_FLAG_RETRANSMITTED" and "cp_retrans" properly. Signed-off-by: Gerd Rausch Signed-off-by: Allison Henderson Link: https://patch.msgid.link/20260203055723.1085751-9-achender@kernel.org Signed-off-by: Jakub Kicinski

net/rds: Use the first lane until RDS_EXTHDR_NPATHS arrives

2026-02-05T04:46:39+00:00

Instead of just blocking the sender until "c_npaths" is known (it gets updated upon the receipt of a MPRDS PONG message), simply use the first lane (cp_index#0). But just using the first lane isn't enough. As soon as we enqueue messages on a different lane, we'd run the risk of out-of-order delivery of RDS messages. Earlier messages enqueued on "cp_index == 0" could be delivered later than more recent messages enqueued on "cp_index > 0", mostly because of possible head of line blocking issues causing the first lane to be slower. To avoid that, we simply take a snapshot of "cp_next_tx_seq" at the time we're about to fan-out to more lanes. Then we delay the transmission of messages enqueued on other lanes with "cp_index > 0" until cp_index#0 caught up with the delivery of new messages (from "cp_send_queue") as well as in-flight messages (from "cp_retrans") that haven't been acknowledged yet by the receiver. We also add a new counter "mprds_catchup_tx0_retries" to keep track of how many times "rds_send_xmit" had to suspend activities, because it was waiting for the first lane to catch up. Signed-off-by: Gerd Rausch Signed-off-by: Allison Henderson Link: https://patch.msgid.link/20260203055723.1085751-8-achender@kernel.org Signed-off-by: Jakub Kicinski

net/rds: Encode cp_index in TCP source port

2026-02-05T04:46:38+00:00

Upon "sendmsg", RDS/TCP selects a backend connection based on a hash calculated from the source-port ("RDS_MPATH_HASH"). However, "rds_tcp_accept_one" accepts connections in the order they arrive, which is non-deterministic. Therefore the mapping of the sender's "cp->cp_index" to that of the receiver changes if the backend connections are dropped and reconnected. However, connection state that's preserved across reconnects (e.g. "cp_next_rx_seq") relies on that sender<->receiver mapping to never change. So we make sure that client and server of the TCP connection have the exact same "cp->cp_index" across reconnects by encoding "cp->cp_index" in the lower three bits of the client's TCP source port. A new extension "RDS_EXTHDR_SPORT_IDX" is introduced, that allows the server to tell the difference between clients that do the "cp->cp_index" encoding, and legacy clients that pick source ports randomly. Signed-off-by: Gerd Rausch Signed-off-by: Allison Henderson Link: https://patch.msgid.link/20260203055723.1085751-3-achender@kernel.org Signed-off-by: Jakub Kicinski

net/rds: rds_tcp_accept_one ought to not discard messages

2026-01-23T19:51:31+00:00

RDS/TCP differs from RDS/RDMA in that message acknowledgment is done based on TCP sequence numbers: As soon as the last byte of a message has been acknowledged by the TCP stack of a peer, "rds_tcp_write_space()" goes on to discard prior messages from the send queue. Which is fine, for as long as the receiver never throws any messages away. Unfortunately, that is *not* the case since the introduction of MPRDS: commit 1a0e100fb2c96 "RDS: TCP: Enable multipath RDS for TCP" A new function "rds_tcp_accept_one_path" was introduced, which is entitled to return "NULL", if no connection path is currently available. Unfortunately, this happens after the "->accept()" call, and the new socket often already contains messages, since the peer already transitioned to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED". That's also the case after this [1]: commit 1a0e100fb2c96 "RDS: TCP: Force every connection to be initiated by numerically smaller IP address" which tried to address the situation of pending data by only transitioning connections from a smaller IP address to "RDS_CONN_UP". But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS" handshake has not occurred yet, and therefore we're working with "c_npaths <= 1", "c_conn[0]" may be in a state distinct from "RDS_CONN_DOWN", and therefore all messages on the just accepted socket will be tossed away. This fix changes "rds_tcp_accept_one": * If connected from a peer with a larger IP address, the new socket will continue to get closed right away. With commit [1] above, there should not be any messages in the socket receive buffer, since the peer never transitioned to "RDS_CONN_UP". Therefore it should be okay to not make any efforts to dispatch the socket receive buffer. * If connected from a peer with a smaller IP address, we call "rds_tcp_accept_one_path" to find a free slot/"path". If found, business goes on as usual. If none was found, we save/stash the newly accepted socket into "rds_tcp_accepted_sock", in order to not lose any messages that may have arrived already. We then return from "rds_tcp_accept_one" with "-ENOBUFS". Later on, when a slot/"path" does become available again (e.g. state transitioned to "RDS_CONN_DOWN", or HS extension header was received with "c_npaths > 1") we call "rds_tcp_conn_slots_available" that simply re-issues a "rds_tcp_accept_one_path" worker-callback and picks up the new socket from "rds_tcp_accepted_sock", and thereby continuing where it left with "-ENOBUFS" last time. Since a new slot has become available, those messages won't be lost, since processing proceeds as if that slot had been available the first time around. Signed-off-by: Gerd Rausch Signed-off-by: Jack Vogel Signed-off-by: Allison Henderson Link: https://patch.msgid.link/20260122055213.83608-3-achender@kernel.org Signed-off-by: Jakub Kicinski

rds: Fix endianness annotations for RDS extension headers

2025-08-22T23:44:39+00:00

Per the RDS 3.1 spec [1], RDS extension headers EXTHDR_NPATHS and EXTHDR_GEN_NUM are be16 and be32 values respectively, exchanged during normal operations over-the-wire (RDS Ping/Pong). This contrasts their declarations as host endian unsigned ints. Fix the annotations across occurrences. Flagged by Sparse. [1] https://oss.oracle.com/projects/rds/dist/documentation/rds-3.1-spec.html Signed-off-by: Ujwal Kundur Reviewed-by: Allison Henderson Link: https://patch.msgid.link/20250820175550.498-5-ujwal.kundur@gmail.com Signed-off-by: Jakub Kicinski

net:rds: Fix possible deadlock in rds_message_put

2024-02-13T09:25:30+00:00

Functions rds_still_queued and rds_clear_recv_queue lock a given socket in order to safely iterate over the incoming rds messages. However calling rds_inc_put while under this lock creates a potential deadlock. rds_inc_put may eventually call rds_message_purge, which will lock m_rs_lock. This is the incorrect locking order since m_rs_lock is meant to be locked before the socket. To fix this, we move the message item to a local list or variable that wont need rs_recv_lock protection. Then we can safely call rds_inc_put on any item stored locally after rs_recv_lock is released. Fixes: bdbe6fbc6a2f ("RDS: recv.c") Reported-by: syzbot+f9db6ff27b9bfdcfeca0@syzkaller.appspotmail.com Reported-by: syzbot+dcd73ff9291e6d34b3ab@syzkaller.appspotmail.com Signed-off-by: Allison Henderson Link: https://lore.kernel.org/r/20240209022854.200292-1-allison.henderson@oracle.com Signed-off-by: Paolo Abeni

net: add missing includes of linux/sched/clock.h

2023-01-27T11:19:46+00:00

Number of files depend on linux/sched/clock.h getting included by linux/skbuff.h which soon will no longer be the case. Signed-off-by: Jakub Kicinski Signed-off-by: David S. Miller

net: rds: fix memory leak in rds_recvmsg

2021-06-08T23:32:17+00:00

Syzbot reported memory leak in rds. The problem was in unputted refcount in case of error. int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int msg_flags) { ... if (!rds_next_incoming(rs, &inc)) { ... } After this "if" inc refcount incremented and if (rds_cmsg_recv(inc, msg, rs)) { ret = -EFAULT; goto out; } ... out: return ret; } in case of rds_cmsg_recv() fail the refcount won't be decremented. And it's easy to see from ftrace log, that rds_inc_addref() don't have rds_inc_put() pair in rds_recvmsg() after rds_cmsg_recv() 1) | rds_recvmsg() { 1) 3.721 us | rds_inc_addref(); 1) 3.853 us | rds_message_inc_copy_to_user(); 1) + 10.395 us | rds_cmsg_recv(); 1) + 34.260 us | } Fixes: bdbe6fbc6a2f ("RDS: recv.c") Reported-and-tested-by: syzbot+5134cdf021c4ed5aaa5f@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin Reviewed-by: Håkon Bugge Acked-by: Santosh Shilimkar Signed-off-by: David S. Miller

net/rds: Drop duplicate sin and sin6 assignments

2021-03-10T20:45:15+00:00

There is no need to assign the msg->msg_name to sin or sin6, because there is DECLARE_SOCKADDR statement. Signed-off-by: Yejune Deng Signed-off-by: David S. Miller

rds: Prevent kernel-infoleak in rds_notify_queue_get()

2020-07-31T23:52:48+00:00

rds_notify_queue_get() is potentially copying uninitialized kernel stack memory to userspace since the compiler may leave a 4-byte hole at the end of `cmsg`. In 2016 we tried to fix this issue by doing `= { 0 };` on `cmsg`, which unfortunately does not always initialize that 4-byte hole. Fix it by using memset() instead. Cc: stable@vger.kernel.org Fixes: f037590fff30 ("rds: fix a leak of kernel memory") Fixes: bdbe6fbc6a2f ("RDS: recv.c") Suggested-by: Dan Carpenter Signed-off-by: Peilin Ye Acked-by: Santosh Shilimkar Signed-off-by: David S. Miller