diff options
Diffstat (limited to 'Documentation/networking/timestamping.txt')
-rw-r--r-- | Documentation/networking/timestamping.txt | 368 |
1 files changed, 286 insertions, 82 deletions
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index 897f942b976b..412f45ca2d73 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -1,102 +1,307 @@ -The existing interfaces for getting network packages time stamped are: + +1. Control Interfaces + +The interfaces for receiving network packages timestamps are: * SO_TIMESTAMP - Generate time stamp for each incoming packet using the (not necessarily - monotonous!) system time. Result is returned via recv_msg() in a - control message as timeval (usec resolution). + Generates a timestamp for each incoming packet in (not necessarily + monotonic) system time. Reports the timestamp via recvmsg() in a + control message as struct timeval (usec resolution). * SO_TIMESTAMPNS - Same time stamping mechanism as SO_TIMESTAMP, but returns result as - timespec (nsec resolution). + Same timestamping mechanism as SO_TIMESTAMP, but reports the + timestamp as struct timespec (nsec resolution). * IP_MULTICAST_LOOP + SO_TIMESTAMP[NS] - Only for multicasts: approximate send time stamp by receiving the looped - packet and using its receive time stamp. + Only for multicast:approximate transmit timestamp obtained by + reading the looped packet receive timestamp. -The following interface complements the existing ones: receive time -stamps can be generated and returned for arbitrary packets and much -closer to the point where the packet is really sent. Time stamps can -be generated in software (as before) or in hardware (if the hardware -has such a feature). +* SO_TIMESTAMPING + Generates timestamps on reception, transmission or both. Supports + multiple timestamp sources, including hardware. Supports generating + timestamps for stream sockets. -SO_TIMESTAMPING: -Instructs the socket layer which kind of information should be collected -and/or reported. The parameter is an integer with some of the following -bits set. Setting other bits is an error and doesn't change the current -state. +1.1 SO_TIMESTAMP: -Four of the bits are requests to the stack to try to generate -timestamps. Any combination of them is valid. +This socket option enables timestamping of datagrams on the reception +path. Because the destination socket, if any, is not known early in +the network stack, the feature has to be enabled for all packets. The +same is true for all early receive timestamp options. -SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamps in hardware -SOF_TIMESTAMPING_TX_SOFTWARE: try to obtain send time stamps in software -SOF_TIMESTAMPING_RX_HARDWARE: try to obtain receive time stamps in hardware -SOF_TIMESTAMPING_RX_SOFTWARE: try to obtain receive time stamps in software +For interface details, see `man 7 socket`. + + +1.2 SO_TIMESTAMPNS: + +This option is identical to SO_TIMESTAMP except for the returned data type. +Its struct timespec allows for higher resolution (ns) timestamps than the +timeval of SO_TIMESTAMP (ms). + + +1.3 SO_TIMESTAMPING: + +Supports multiple types of timestamp requests. As a result, this +socket option takes a bitmap of flags, not a boolean. In + + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val); + +val is an integer with any of the following bits set. Setting other +bit returns EINVAL and does not change the current state. -The other three bits control which timestamps will be reported in a -generated control message. If none of these bits are set or if none of -the set bits correspond to data that is available, then the control -message will not be generated: -SOF_TIMESTAMPING_SOFTWARE: report systime if available -SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available (deprecated) -SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available +1.3.1 Timestamp Generation -It is worth noting that timestamps may be collected for reasons other -than being requested by a particular socket with -SOF_TIMESTAMPING_[TR]X_(HARD|SOFT)WARE. For example, most drivers that -can generate hardware receive timestamps ignore -SOF_TIMESTAMPING_RX_HARDWARE. It is still a good idea to set that flag -in case future drivers pay attention. +Some bits are requests to the stack to try to generate timestamps. Any +combination of them is valid. Changes to these bits apply to newly +created packets, not to packets already in the stack. As a result, it +is possible to selectively request timestamps for a subset of packets +(e.g., for sampling) by embedding an send() call within two setsockopt +calls, one to enable timestamp generation and one to disable it. +Timestamps may also be generated for reasons other than being +requested by a particular socket, such as when receive timestamping is +enabled system wide, as explained earlier. -If timestamps are reported, they will appear in a control message with -cmsg_level==SOL_SOCKET, cmsg_type==SO_TIMESTAMPING, and a payload like -this: +SOF_TIMESTAMPING_RX_HARDWARE: + Request rx timestamps generated by the network adapter. + +SOF_TIMESTAMPING_RX_SOFTWARE: + Request rx timestamps when data enters the kernel. These timestamps + are generated just after a device driver hands a packet to the + kernel receive stack. + +SOF_TIMESTAMPING_TX_HARDWARE: + Request tx timestamps generated by the network adapter. + +SOF_TIMESTAMPING_TX_SOFTWARE: + Request tx timestamps when data leaves the kernel. These timestamps + are generated in the device driver as close as possible, but always + prior to, passing the packet to the network interface. Hence, they + require driver support and may not be available for all devices. + +SOF_TIMESTAMPING_TX_SCHED: + Request tx timestamps prior to entering the packet scheduler. Kernel + transmit latency is, if long, often dominated by queuing delay. The + difference between this timestamp and one taken at + SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent + of protocol processing. The latency incurred in protocol + processing, if any, can be computed by subtracting a userspace + timestamp taken immediately before send() from this timestamp. On + machines with virtual devices where a transmitted packet travels + through multiple devices and, hence, multiple packet schedulers, + a timestamp is generated at each layer. This allows for fine + grained measurement of queuing delay. + +SOF_TIMESTAMPING_TX_ACK: + Request tx timestamps when all data in the send buffer has been + acknowledged. This only makes sense for reliable protocols. It is + currently only implemented for TCP. For that protocol, it may + over-report measurement, because the timestamp is generated when all + data up to and including the buffer at send() was acknowledged: the + cumulative acknowledgment. The mechanism ignores SACK and FACK. + + +1.3.2 Timestamp Reporting + +The other three bits control which timestamps will be reported in a +generated control message. Changes to the bits take immediate +effect at the timestamp reporting locations in the stack. Timestamps +are only reported for packets that also have the relevant timestamp +generation request set. + +SOF_TIMESTAMPING_SOFTWARE: + Report any software timestamps when available. + +SOF_TIMESTAMPING_SYS_HARDWARE: + This option is deprecated and ignored. + +SOF_TIMESTAMPING_RAW_HARDWARE: + Report hardware timestamps as generated by + SOF_TIMESTAMPING_TX_HARDWARE when available. + + +1.3.3 Timestamp Options + +The interface supports one option + +SOF_TIMESTAMPING_OPT_ID: + + Generate a unique identifier along with each packet. A process can + have multiple concurrent timestamping requests outstanding. Packets + can be reordered in the transmit path, for instance in the packet + scheduler. In that case timestamps will be queued onto the error + queue out of order from the original send() calls. This option + embeds a counter that is incremented at send() time, to order + timestamps within a flow. + + This option is implemented only for transmit timestamps. There, the + timestamp is always looped along with a struct sock_extended_err. + The option modifies field ee_info to pass an id that is unique + among all possibly concurrently outstanding timestamp requests for + that socket. In practice, it is a monotonically increasing u32 + (that wraps). + + In datagram sockets, the counter increments on each send call. In + stream sockets, it increments with every byte. + + +1.4 Bytestream Timestamps + +The SO_TIMESTAMPING interface supports timestamping of bytes in a +bytestream. Each request is interpreted as a request for when the +entire contents of the buffer has passed a timestamping point. That +is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record +when all bytes have reached the device driver, regardless of how +many packets the data has been converted into. + +In general, bytestreams have no natural delimiters and therefore +correlating a timestamp with data is non-trivial. A range of bytes +may be split across segments, any segments may be merged (possibly +coalescing sections of previously segmented buffers associated with +independent send() calls). Segments can be reordered and the same +byte range can coexist in multiple segments for protocols that +implement retransmissions. + +It is essential that all timestamps implement the same semantics, +regardless of these possible transformations, as otherwise they are +incomparable. Handling "rare" corner cases differently from the +simple case (a 1:1 mapping from buffer to skb) is insufficient +because performance debugging often needs to focus on such outliers. + +In practice, timestamps can be correlated with segments of a +bytestream consistently, if both semantics of the timestamp and the +timing of measurement are chosen correctly. This challenge is no +different from deciding on a strategy for IP fragmentation. There, the +definition is that only the first fragment is timestamped. For +bytestreams, we chose that a timestamp is generated only when all +bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to +implement and reason about. An implementation that has to take into +account SACK would be more complex due to possible transmission holes +and out of order arrival. + +On the host, TCP can also break the simple 1:1 mapping from buffer to +skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The +implementation ensures correctness in all cases by tracking the +individual last byte passed to send(), even if it is no longer the +last byte after an skbuff extend or merge operation. It stores the +relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff +has only one such field, only one timestamp can be generated. + +In rare cases, a timestamp request can be missed if two requests are +collapsed onto the same skb. A process can detect this situation by +enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at +send time with the value returned for each timestamp. It can prevent +the situation by always flushing the TCP stack in between requests, +for instance by enabling TCP_NODELAY and disabling TCP_CORK and +autocork. + +These precautions ensure that the timestamp is generated only when all +bytes have passed a timestamp point, assuming that the network stack +itself does not reorder the segments. The stack indeed tries to avoid +reordering. The one exception is under administrator control: it is +possible to construct a packet scheduler configuration that delays +segments from the same stream differently. Such a setup would be +unusual. + + +2 Data Interfaces + +Timestamps are read using the ancillary data feature of recvmsg(). +See `man 3 cmsg` for details of this interface. The socket manual +page (`man 7 socket`) describes how timestamps generated with +SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved. + + +2.1 SCM_TIMESTAMPING records + +These timestamps are returned in a control message with cmsg_level +SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type struct scm_timestamping { - struct timespec systime; - struct timespec hwtimetrans; - struct timespec hwtimeraw; + struct timespec ts[3]; }; -recvmsg() can be used to get this control message for regular incoming -packets. For send time stamps the outgoing packet is looped back to -the socket's error queue with the send time stamp(s) attached. It can -be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the -original outgoing packet data including all headers preprended down to -and including the link layer, the scm_timestamping control message and -a sock_extended_err control message with ee_errno==ENOMSG and -ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending -bounced packet is ready for reading as far as select() is concerned. -If the outgoing packet has to be fragmented, then only the first -fragment is time stamped and returned to the sending socket. - -All three values correspond to the same event in time, but were -generated in different ways. Each of these values may be empty (= all -zero), in which case no such value was available. If the application -is not interested in some of these values, they can be left blank to -avoid the potential overhead of calculating them. - -systime is the value of the system time at that moment. This -corresponds to the value also returned via SO_TIMESTAMP[NS]. If the -time stamp was generated by hardware, then this field is -empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is -set. - -hwtimeraw is the original hardware time stamp. Filled in if -SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its -relation to system time should be made. - -hwtimetrans is always zero. This field is deprecated. It used to hold -hw timestamps converted to system time. Instead, expose the hardware -clock device on the NIC directly as a HW PTP clock source, to allow -time conversion in userspace and optionally synchronize system time -with a userspace PTP stack such as linuxptp. For the PTP clock API, -see Documentation/ptp/ptp.txt. - - -SIOCSHWTSTAMP, SIOCGHWTSTAMP: +The structure can return up to three timestamps. This is a legacy +feature. Only one field is non-zero at any time. Most timestamps +are passed in ts[0]. Hardware timestamps are passed in ts[2]. + +ts[1] used to hold hardware timestamps converted to system time. +Instead, expose the hardware clock device on the NIC directly as +a HW PTP clock source, to allow time conversion in userspace and +optionally synchronize system time with a userspace PTP stack such +as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt. + +2.1.1 Transmit timestamps with MSG_ERRQUEUE + +For transmit timestamps the outgoing packet is looped back to the +socket's error queue with the send timestamp(s) attached. A process +receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE +set and with a msg_control buffer sufficiently large to receive the +relevant metadata structures. The recvmsg call returns the original +outgoing data packet with two ancillary messages attached. + +A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR +embeds a struct sock_extended_err. This defines the error type. For +timestamps, the ee_errno field is ENOMSG. The other ancillary message +will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This +embeds the struct scm_timestamping. + + +2.1.1.2 Timestamp types + +The semantics of the three struct timespec are defined by field +ee_info in the extended error structure. It contains a value of +type SCM_TSTAMP_* to define the actual timestamp passed in +scm_timestamping. + +The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_* +control fields discussed previously, with one exception. For legacy +reasons, SCM_TSTAMP_SND is equal to zero and can be set for both +SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It +is the first if ts[2] is non-zero, the second otherwise, in which +case the timestamp is stored in ts[0]. + + +2.1.1.3 Fragmentation + +Fragmentation of outgoing datagrams is rare, but is possible, e.g., by +explicitly disabling PMTU discovery. If an outgoing packet is fragmented, +then only the first fragment is timestamped and returned to the sending +socket. + + +2.1.1.4 Packet Payload + +The calling application is often not interested in receiving the whole +packet payload that it passed to the stack originally: the socket +error queue mechanism is just a method to piggyback the timestamp on. +In this case, the application can choose to read datagrams with a +smaller buffer, possibly even of length 0. The payload is truncated +accordingly. Until the process calls recvmsg() on the error queue, +however, the full packet is queued, taking up budget from SO_RCVBUF. + + +2.1.1.5 Blocking Read + +Reading from the error queue is always a non-blocking operation. To +block waiting on a timestamp, use poll or select. poll() will return +POLLERR in pollfd.revents if any data is ready on the error queue. +There is no need to pass this flag in pollfd.events. This flag is +ignored on request. See also `man 2 poll`. + + +2.1.2 Receive timestamps + +On reception, there is no reason to read from the socket error queue. +The SCM_TIMESTAMPING ancillary data is sent along with the packet data +on a normal recvmsg(). Since this is not a socket error, it is not +accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case, +the meaning of the three fields in struct scm_timestamping is +implicitly defined. ts[0] holds a software timestamp if set, ts[1] +is again deprecated and ts[2] holds a hardware timestamp if set. + + +3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP Hardware time stamping must also be initialized for each device driver that is expected to do hardware time stamping. The parameter is defined in @@ -167,8 +372,7 @@ enum { */ }; - -DEVICE IMPLEMENTATION +3.1 Hardware Timestamping Implementation: Device Drivers A driver which supports hardware time stamping must support the SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with |