Age | Commit message (Collapse) | Author | Files | Lines |
|
mctp_dump_addrinfo() is one of the last users of
net->dev_index_head[] in the control path.
Switch to for_each_netdev_dump() for better scalability.
Use C99 for mctp_device_rtnl_msg_handlers[] to prepare
future RTNL removal from mctp_dump_addrinfo()
(mdev->addrs is not yet RCU protected)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20241206223811.1343076-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The comparison function cmp_profile_data() violates the C standard's
requirements for qsort() comparison functions, which mandate symmetry
and transitivity:
* Symmetry: If x < y, then y > x.
* Transitivity: If x < y and y < z, then x < z.
When v1 and v2 are equal, the function incorrectly returns 1, breaking
symmetry and transitivity. This causes undefined behavior, which can
lead to memory corruption in certain versions of glibc [1].
Fix the issue by returning 0 when v1 and v2 are equal, ensuring
compliance with the C standard and preventing undefined behavior.
Link: https://www.qualys.com/2024/01/30/qsort.txt [1]
Fixes: 0f223813edd0 ("perf ftrace: Add 'profile' command")
Fixes: 74ae366c37b7 ("perf ftrace profile: Add -s/--sort option")
Cc: stable@vger.kernel.org
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: jserv@ccns.ncku.edu.tw
Cc: chuang@cs.nycu.edu.tw
Link: https://lore.kernel.org/r/20241209134226.1939163-1-visitorckw@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
David Howells says:
====================
rxrpc: Implement jumbo DATA transmission and RACK-TLP
Here's a series of patches to implement two main features:
(1) The transmission of jumbo data packets whereby several DATA packets of
a particular size can be glued together into a single UDP packet,
allowing us to make use of larger MTU sizes. The basic jumbo
subpacket capacity is 1412 bytes (RXRPC_JUMBO_DATALEN) and, say, an
MTU of 8192 allows five of them to be transmitted as one.
An alternative (and possibly more efficient way) would be to
expand/shrink the capacity of each DATA packet to match the MTU and
thus save on header and tail-gap overhead, but the Rx protocol does
not provide a mechanism for splitting the data - especially as the
transported data is encrypted per-packet - and so UDP fragmentation
would be the only way to handle this.
In fact, in the future, AF_RXRPC also needs to look at shrinking the
packet size where the MTU is smaller - for instance in the case of
being carried by IPv6 over wifi where there isn't capacity for a 1412
byte capacity.
(2) RACK-TLP to manage packet loss and retransmission in conjunction with
the congestion control algorithm.
These allow for better data throughput and work towards being able to have
larger transmission windows.
To this end, the following changes are also made:
(1) Use a single large array of kvec structs for the I/O thread rather
than having one per transmission buffer. We need a much bigger
collection of kvecs for ping padding
(2) Implement path-MTU probing by sending padded PING ACK packets and
monitoring for PING RESPONSE ACKs. The pmtud value determined is used
to configure the construction of jumbo DATA packets.
(3) The transmission queue is changed from a linked list of transmission
buffer structs to a linked list of transmission-queue structs, each of
which points to either 32 or 64 transmission buffers (depending on cpu
word size) and various bits of metadata are concentrated in the queue
structs rather than the buffers to make better use of the cpu cache.
(4) SACK data is stored in the transmission-queue structures in batches of
32 or 64 making it faster to process rather than being spread amongst
all the individual packet buffers.
(5) Don't change the DF flag on the UDP socket unless we need to - and
basically only enable it for path-MTU probing.
There are also some additional bits:
(1) Fix the handling of connection aborts to poke the aborted connections.
(2) Don't set the MORE-PACKETS Rx header flag on the wire. No one
actually checks it and it is, in any case, generated inconsistently
between implementations.
(3) Request an ACK when, during call transmission, there's a stall in the
app generating the data to be transmitted.
(4) Fix attention starvation in the I/O thread by making sure we go
through all outstanding events rather than returning to the beginning
of the check cycle after any time we process an event.
(5) Don't use the skbuff timestamp in the calculation of timeouts and RTT
as we really should include local processing time in that too.
Further, getting receive skbuff timestamps may be expensive.
(6) Make RTT tracking per call with the saving of the value between calls,
even within the same connection channel. The initial call timeout
starts off large to allow the server time to set up its state before
the initial reply.
(7) Don't allocate txbuf structs for ACK packets, but rather use page
frags and MSG_SPLICE_PAGES.
(8) Use irq-disabling locks for interactions between app threads and I/O
threads so that the I/O thread doesn't get help up.
(9) Make rxrpc set the REQUEST-ACK flag on an outgoing packet when cwnd is
at RXRPC_MIN_CWND (currently 4), not at 2 which it can never reach.
(10) Add some tracing bits and pieces (including displaying the userStatus
field in an ACK header) and some more stats counters (including
different sizes of jumbo packets sent/received).
Link: https://lore.kernel.org/r/20240306000655.1100294-1-dhowells@redhat.com/ [1]
====================
Link: https://patch.msgid.link/20241204074710.990092-1-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When an rxrpc call is in its transmission phase and is sending a lot of
packets, stalls occasionally occur that cause severe performance
degradation (eg. increasing the transmission time for a 256MiB payload from
0.7s to 2.5s over a 10G link).
rxrpc already implements TCP-style congestion control [RFC5681] and this
helps mitigate the effects, but occasionally we're missing a time event
that deals with a missing ACK, leading to a stall until the RTO expires.
Fix this by implementing RACK/TLP in rxrpc.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rxrpc_prepare_data_subpacket() sets the REQUEST-ACK flag on the outgoing
DATA packet under a number of circumstances, including, theoretically, when
the cwnd is at minimum (or less). However, the minimum in this function is
hard-coded as 2, but the actual minimum is RXRPC_MIN_CWND (which is
currently 4) and so this never occurs.
Without this, we will miss the request of some ACKs, potentially leading to
a transmission stall until a timeout occurs on one side or the other that
leads to an ACK being generated.
Fix the function to use RXRPC_MIN_CWND rather than a hard-coded number.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Manage the determination of RTT on a per-call (ie. per-RPC op) basis rather
than on a per-peer basis, averaging across all calls going to that peer.
The problem is that the RTT measurements from the initial packets on a call
may be off because the server may do some setting up (such as getting a
lock on a file) before accepting the rest of the data in the RPC and,
further, the RTT may be affected by server-side file operations, for
instance if a large amount of data is being written or read.
Note: When handling the FS.StoreData-type RPCs, for example, the server
uses the userStatus field in the header of ACK packets as supplementary
flow control to aid in managing this. AF_RXRPC does not yet support this,
but it should be added.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Record the reason for the transmission of an ACK in the rxrpc_tx_ack
tracepoint, and not just in the rxrpc_propose_ack tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add an indicator to the rxrpc_tx_data tracepoint to indicate what triggered
the transmission of a particular packet. At this point, it's only normal
transmission and retransmission, plus the tracepoint is also used to record
loss injection, but in a future patch, TLP-induced (re-)transmission will
also be a thing.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Tidy up the ACK parsing in the following ways:
(1) Put the serial number of the ACK packet into the rxrpc_ack_summary
struct and access it from there whilst parsing an ACK.
(2) Be consistent about using "if (summary.acked_serial)" rather than "if
(summary.acked_serial != 0)".
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Where a spinlock is used by both the application thread and the I/O thread,
use irq-disabling locking so that an interrupt taken on the app thread
doesn't also slow down the I/O thread.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Don't allocate an rxrpc_txbuf struct for an ACK transmission. There's now
no need as the memory to hold the ACK content is allocated with a page frag
allocator. The allocation and freeing of a txbuf is just unnecessary
overhead.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Send jumbo DATA packets if the path-MTU probing using padded PING ACK
packets shows up sufficient capacity to do so. This allows larger chunks
of data to be sent without reducing the retryability as the subpackets in a
jumbo packet can also be retransmitted individually.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The constant for the initial resend timeout is in milliseconds, but the
variable it's assigned to is in microseconds. Fix the constant to be in
microseconds.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make the following changes to the calculation and use of RTO:
(1) Fix rxrpc_resend() to use the backed-off RTO value obtained by calling
rxrpc_get_rto_backoff() rather than extracting the value itself.
Without this, it may retransmit packets too early.
(2) The RTO value being similar to the RTT causes a lot of extraneous
resends because the RTT doesn't end up taking account of clearing out
of the receive queue on the server. Worse, responses to PING-ACKs are
made as fast as possible and so are less than the DATA-requested-ACK
RTT and so skew the RTT down.
Fix this by putting a lower bound on the RTO by adding 100ms to it and
limiting the lower end to 200ms.
Fixes: c410bf01933e ("rxrpc: Fix the excessive initial retransmission timeout")
Fixes: 37473e416234 ("rxrpc: Clean up the resend algorithm")
Signed-off-by: David Howells <dhowells@redhat.com>
Suggested-by: Simon Wilkinson <sxw@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Display the userStatus field from the Rx packet header in the rxrpc_rx_ack
trace line. This is used for flow control purposes by FS.StoreData-type
kafs RPC calls.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Adjust the rxrpc_rtt_rx tracepoint in the following ways:
(1) Display the collected RTT sample in the rxrpc_rtt_rx trace.
(2) Move the division of srtt by 8 to the TP_printk() rather doing it
before invoking the trace point.
(3) Display the min_rtt value.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Generate rtt_min as this is required by RACK-TLP.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-27-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Don't use received skbuff timestamps, but rather set a timestamp when an
ack is processed so that the time taken to get to rxrpc_input_ack() is
included in the RTT.
The timestamp of the latest ACK received is tracked in
call->acks_latest_ts.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-26-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Store the serial number set on a DATA packet at the point of transmission
in the rxrpc_txqueue struct and when an ACK is received, match the
reference number in the ACK by trawling the txqueue rather than sharing an
RTT table with ACK RTT. This can be done as part of Tx queue rotation.
This means we have a lot more RTT samples available and is faster to search
with all the serial numbers packed together into a few cachelines rather
than being hung off different txbufs.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-25-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the change in the structure of the transmission buffer to store
buffers in bunches of 32 or 64 (BITS_PER_LONG) we can place sets of
per-buffer flags into the rxrpc_tx_queue struct rather than storing them in
rxrpc_tx_buf, thereby vastly increasing efficiency when assessing the SACK
table in an ACK packet.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-24-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Adjust some of the names of fields and constants to make them look a bit
more like the TCP congestion symbol names, such as flight_size -> in_flight
and congest_mode to ca_state.
Move the persistent congestion-related fields from the rxrpc_ack_summary
struct into the rxrpc_call struct rather than copying them out and back in
again. The rxrpc_congest tracepoint can fetch them from the call struct.
Rename the counters for soft acks and nacks to have an 's' on the front to
reflect the softness, e.g. nr_acks -> nr_sacks.
Make fields counting numbers of packets or numbers of acks u16 rather than
u8 to allow for windows of up to 8192 DATA packets in flight in future.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-23-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In /proc/net/rxrpc/stats, display statistics about the numbers of different
sizes of jumbo packets transmitted and received, showing counts for 1
subpacket (ie. a non-jumbo packet), 2 subpackets, 3, ... to 8 and then 9+.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-22-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace the call->acks_first_seq variable (which holds ack.firstPacket from
the latest ACK packet and indicates the sequence number of the first ack
slot in the SACK table) with call->acks_hard_ack which will hold the
highest sequence hard ACK'd. This is 1 less than call->acks_first_seq, but
it fits in the same schema as the other tracking variables which hold the
sequence of a packet, not one past it.
This will fix the rxrpc_congest tracepoint's calculation of SACK window
size which shows one fewer than it should - and will occasionally go to -1.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-21-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that packets are removed from the Tx queue in the rotation function
rather than being cleaned up later, call->acks_hard_ack now advances in
step with call->tx_bottom, so remove it.
Some of the places call->acks_hard_ack is used in the rxrpc tracepoints are
replaced by call->acks_first_seq instead as that's the peer's reported idea
of the hard-ACK point.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-20-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We need to scan the buffers in the transmission queue occasionally when
processing ACKs, but the transmission queue is currently a linked list of
transmission buffers which, when we eventually expand the Tx window to 8192
packets will be very slow to walk.
Instead, pull the fields we need to examine a lot (last sent time,
retransmitted flag) into a new struct rxrpc_txqueue and make each one hold
an array of 32 or 64 packets.
The transmission queue is then a list of these structs, each pointing to a
contiguous set of packets. Scanning is then a lot faster as the flags and
timestamps are concentrated in the CPU dcache.
The transmission timestamps are stored as a number of microseconds from a
base ktime to reduce memory requirements. This should be fine provided we
manage to transmit an entire buffer within an hour.
This will make implementing RACK-TLP [RFC8985] easier as it will be less
costly to scan the transmission buffers.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-19-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We don't need a barrier for the ->tx_bottom value (which indicates the
lowest sequence still in the transmission queue) and the ->acks_hard_ack
value (which tracks the DATA packets hard-ack'd by the latest ACK packet
received and thus indicates which DATA packets can now be discarded) as the
app thread doesn't use either value as a reference to memory to access.
Rather, the app thread merely uses these as a guide to how much space is
available in the transmission queue
Change the code to use READ/WRITE_ONCE() instead.
Also, change rxrpc_check_tx_space() to use the same value for tx_bottom
throughout.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-18-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move to setting the timestamp on DATA packets before transmitting them as
part of the preparation.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-17-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Change how the DF flag is managed on DATA transmissions. Set it on initial
transmission and don't set it on retransmissions. Then remove the handling
for EMSGSIZE in rxrpc_send_data_packet() and just pretend it didn't happen,
leaving it to the retransmission path to retry.
The path-MTU discovery using PING ACKs is then used to probe for the
maximum DATA size - though notification by ICMP will be used if one is
received.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-16-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix the code that injects packet loss for testing to make sure
call->tx_transmitted is updated.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-15-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Starvation can happen in the rxrpc I/O thread because it goes back to the
top of the I/O loop after it does any one thing without trying to give any
other connection or call CPU time. Also, because it processes one call
packet at a time, it tries to do the retransmission loop after each ACK
without checking to see if there are other ACKs already in the queue that
can update the SACK state.
Fix this by:
(1) Add a received-packet queue on each call.
(2) Distribute packets from the master Rx queue to the individual call,
conn and error queues and 'poking' calls to add them to the attend
queue first thing in the I/O thread.
(3) Go through all the attention-seeking connections and calls before
going back to the top of the I/O thread. Each queue is extracted as a
whole and then gone through so that new additions to insert themselves
into the queue.
(4) Make the call event handler go through all the packets currently on
the call's rx_queue before transmitting and retransmitting DATA
packets.
(5) Drop the skb argument from the call event handler as this is now
replaced with the rx_queue. Instead, keep track of whether we
received a packet or an ACK for the tests that used to rely on that.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-14-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a tracepoint to be called right before packets are transmitted for the
first time that shows variable values that are pertinent to how many
subpackets will be added to a jumbo DATA packet.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-13-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Prepare to be able to send jumbo DATA packets if the we decide to, but
don't enable that yet. This will allow larger chunks of data to be sent
without reducing the retryability as the subpackets in a jumbo packet can
also be retransmitted individually.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-12-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Separate the packet length from the data length (txb->len) stored in the
rxrpc_txbuf to make security calculations easier. Also store the
allocation size as that's an upper bound on the size of the security
wrapper and change a number of fields to unsigned short as the amount of
data can't exceed the capacity of a UDP packet.
Also, whilst we're at it, use kzalloc() for txbufs.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-11-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement path-MTU probing (along the lines of RFC8899) by padding some of
the PING ACKs we send. PING ACKs get their own individual responses quite
apart from the acking of data (though, as ACKs, they fulfil that role
also).
The probing concentrates on packet sizes that correspond how many
subpackets can be stuffed inside a jumbo packet as jumbo DATA packets are
just aggregations of individual DATA packets and can be split easily for
retransmission purposes.
If we want to perform probing, we advertise this by setting the maximum
number of jumbo subpackets to 0 in the ack trailer when we send an ACK and
see if the peer is also advertising the service. This is interpreted by
non-supporting Rx stacks as an indication that jumbo packets aren't
supported.
The MTU sizes advertised in the ACK trailer AF_RXRPC transmits are pegged
at a maximum of 1444 unless pmtud is supported by both sides.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-10-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use a single large kvec[] in the rxrpc_local struct rather than one in
every rxrpc_txbuf struct to build large packets to save on memory.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-9-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Set the REQUEST-ACK flag on the DATA packet we're about to send if we're
about to stall transmission because the app layer isn't keeping up
supplying us with data to transmit.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-8-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In /proc/net/rxrpc/stats, show the stats counter for received ACKs that
have the reason code set to 0 as some implementations do this.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-7-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The MORE-PACKETS rxrpc header flag hasn't actually been looked at by
anything since 1988 and not all implementations generate it.
Change rxrpc so that it doesn't set MORE-PACKETS at all rather than setting
it inconsistently.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-6-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Clean up the generation of the header flags when building packet headers
for transmission:
(1) Assemble the flags in a local variable rather than in the txb->flags.
(2) Do the flags masking and JUMBO-PACKET setting in one bit of code for
both the main header and the jumbo headers.
(3) Generate the REQUEST-ACK flag afresh each time. There's a possibility
we might want to do jumbo retransmission packets in future.
(4) Pass the local flags variable to the rxrpc_tx_data tracepoint rather
than the combination of the txb flags and the wire header flags (the
latter belong only to the first subpacket).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-5-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use umin() and umax() rather than min_t()/max_t() where the type specified
is an unsigned type.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-4-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix the handling of a connection abort that we've received. Though the
abort is at the connection level, it needs propagating to the calls on that
connection. Whilst the propagation bit is performed, the calls aren't then
woken up to go and process their termination, and as no further input is
forthcoming, they just hang.
Also add some tracing for the logging of connection aborts.
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a us_to_ktime() helper to go with ms_to_ktime() and ns_to_ktime().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://patch.msgid.link/20241204074710.990092-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
utf8s_to_utf16s() specifies pwcs as a wchar_t pointer (whether big endian
or little endian is passed in as an additional parm), so to remove a
distracting compile warning it needs to be cast as (wchar_t *) in
parse_reparse_wsl_symlink() as done by other callers.
Fixes: 06a7adf318a3 ("cifs: Add support for parsing WSL-style symlinks")
Reviewed-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
In acpi_decode_space() addr->info.mem.caching is checked on main level
for any resource type but addr->info.mem is part of union and thus
valid only if the resource type is memory range.
Move the check inside the preceeding switch/case to only execute it
when the union is of correct type.
Fixes: fcb29bbcd540 ("ACPI: Add prefetch decoding to the address space parser")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20241202100614.20731-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
hv_kvp_daemon uses popen(3) and system(3) as convinience helper to
launch external helpers. These helpers are invoked via a
temporary shell process. There is no need to keep this temporary
process around while the helper runs. Replace this temporary shell
with the actual helper process via 'exec'.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Link: https://lore.kernel.org/linux-hyperv/20241202123520.27812-1-olaf@aepfle.de/
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Remove generated files from 'git status' output after 'make -C tools/hv'.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Link: https://lore.kernel.org/r/20241202124107.28650-1-olaf@aepfle.de
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241202124107.28650-1-olaf@aepfle.de>
|
|
Remove the usage of cat. Replace the shell process with awk with 'exec'.
Also use a generic shell because no bash specific features will be used.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20241202120432.21115-1-olaf@aepfle.de
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241202120432.21115-1-olaf@aepfle.de>
|
|
The reference implementation of hv_get_dns_info which is in the tree uses
/etc/resolv.conf to get DNS servers and this does not require to know which
NIC is queried. Distro specific implementations, however, may want to
provide per-NIC, fine grained information. E.g. NetworkManager keeps track
of DNS servers per connection.
Similar to hv_get_dhcp_info, pass NIC name as a parameter to
hv_get_dns_info script.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20241112150401.217094-1-vkuznets@redhat.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241112150401.217094-1-vkuznets@redhat.com>
|
|
If the KVP (or VSS) daemon starts before the VMBus channel's ringbuffer is
fully initialized, we can hit the panic below:
hv_utils: Registering HyperV Utility Driver
hv_vmbus: registering driver hv_utils
...
BUG: kernel NULL pointer dereference, address: 0000000000000000
CPU: 44 UID: 0 PID: 2552 Comm: hv_kvp_daemon Tainted: G E 6.11.0-rc3+ #1
RIP: 0010:hv_pkt_iter_first+0x12/0xd0
Call Trace:
...
vmbus_recvpacket
hv_kvp_onchannelcallback
vmbus_on_event
tasklet_action_common
tasklet_action
handle_softirqs
irq_exit_rcu
sysvec_hyperv_stimer0
</IRQ>
<TASK>
asm_sysvec_hyperv_stimer0
...
kvp_register_done
hvt_op_read
vfs_read
ksys_read
__x64_sys_read
This can happen because the KVP/VSS channel callback can be invoked
even before the channel is fully opened:
1) as soon as hv_kvp_init() -> hvutil_transport_init() creates
/dev/vmbus/hv_kvp, the kvp daemon can open the device file immediately and
register itself to the driver by writing a message KVP_OP_REGISTER1 to the
file (which is handled by kvp_on_msg() ->kvp_handle_handshake()) and
reading the file for the driver's response, which is handled by
hvt_op_read(), which calls hvt->on_read(), i.e. kvp_register_done().
2) the problem with kvp_register_done() is that it can cause the
channel callback to be called even before the channel is fully opened,
and when the channel callback is starting to run, util_probe()->
vmbus_open() may have not initialized the ringbuffer yet, so the
callback can hit the panic of NULL pointer dereference.
To reproduce the panic consistently, we can add a "ssleep(10)" for KVP in
__vmbus_open(), just before the first hv_ringbuffer_init(), and then we
unload and reload the driver hv_utils, and run the daemon manually within
the 10 seconds.
Fix the panic by reordering the steps in util_probe() so the char dev
entry used by the KVP or VSS daemon is not created until after
vmbus_open() has completed. This reordering prevents the race condition
from happening.
Reported-by: Dexuan Cui <decui@microsoft.com>
Fixes: e0fa3e5e7df6 ("Drivers: hv: utils: fix a race on userspace daemons registration")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20241106154247.2271-3-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241106154247.2271-3-mhklinux@outlook.com>
|
|
If the util_init function call in util_probe() returns an error code,
util_probe() always return ENODEV, and the error code from the util_init
function is lost. The error message output in the caller, vmbus_probe(),
doesn't show the real error code.
Fix this by just returning the error code from the util_init function.
There doesn't seem to be a reason to force ENODEV, as other errors
such as ENOMEM can already be returned from util_probe(). And the
code in call_driver_probe() implies that ENODEV should mean that a
matching driver wasn't found, which is not the case here.
Suggested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20241106154247.2271-2-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241106154247.2271-2-mhklinux@outlook.com>
|