Highly Accurate Timestamping for Ethernet-Based Clock Synchronization

It is not only for test and measurement of great importance to synchronize clocks of networked devices to timely coordinate data acquisition. In this context the seek for high accuracy in Ethernet-based clock synchronization has been significantly supported by enhancements to the Network Time Protocol (NTP) and the introduction of the Precision Time Protocol (PTP). The latter was even applied to instrumentation and measurement applications through the introduction of LXI. These protocols are usually implemented in software; however, the synchronization accuracy can only substantially be improved by hardware which supports drawing of precise event timestamps. Especially, the quality of the timestamps for ingress and egress synchronization packets has a major influence on the achievable performance of a distributed measurement or control system. This paper analyzes the influence of jitter sources remaining despite hardware support and proposes enhanced methods for up to now unmatched timestamping accuracy in Ethernet-based synchronization protocols. The methods shown in this paper reach sub-nanosecond accuracy, which is proven in theory and practice.


Introduction
In instrumentation and measurement, the General Purpose Interface Bus (GPIB) was for a long time the system for data collection and networking of equipment.This bus system has a dedicated wiring for triggering devices and to simultaneously start measurements.The reason for the continuous usage of this relatively old technology is the excellent tool and driver support and the simplicity of the system.Despite these arguments, GPIB has several drawbacks in the handling (connectors, cable) and generality of the approach.First of all GPIB is limited in terms of cable length and number of bus devices.The parallel data transfer and strict arbitration scheme also limit the achievable data rate and make handling and configuration quite complicated for the user.Second, GPIB is also limited in terms of its functionality and does not comply to modern networked systems.
A solution for the test and measurement industry to tackle the drawbacks of GPIB can be found in the LAN extensions for instrument (LXI) [1] approach.This de facto standard uses the well-established Ethernet technology to network measurement devices.The advantage is clearly that one can embed such a system seamlessly into office and lab networks having all advantages of a full network functionality.The application in test and measurement is however only feasible if it can be ensured that the devices are properly triggered.The approach of LXI is to use synchronized clocks for this: a device which detects a trigger condition sends out the time of the trigger condition causing all other devices to a-posteriori save the data at that previous instant of time.This requires that all devices keep some backlog of historic data.It is clear that the precision and the usability of the data highly depend on the accuracy of the clocks in such systems and therefore the synchronization technology.For synchronizing computer clocks, the most prominent approach is the Network Time Protocol (NTP) [2].This protocol, which is widely used in the Internet, manages accuracies of several milliseconds, with certain extensions even microseconds [3].As this accuracy is not sufficient for high-precision measurements, the need for a new protocol arose: the Precision Time Protocol (PTP) [4].The PTP periodically transmits synchronization messages to update the time on a master-slave basis.For that an algorithm for master election, management, and delay compensation is set on one of the upper layers of the communication stack.In the LXI case, this is the application layer where PTP communicates over User Datagram Protocol (UDP).The standard itself, however, is independent of the communication technology.Although the protocol can be implemented in software, high-precision timestamping has to be done in hardware in order to cancel out protocol stack jitter.This paper outlines the possibilities, analyzes different jitter sources, and proposes new approaches for this highly accurate timestamping.For that, first the state of the art in Ethernet-based clock synchronization is summarized.The motivation points out the reasons for seeking higher precision of timestamps.The following chapters show the influence of the different parameters and required components of a timestamping network interface.Existing accurate methods are given in Section 5 together with their pros and cons in Table 1.Section 6 then presents our new method for highly accurate timestamping, which is then proven to be working in reality by measurements shown in Section 7. The paper is finally rounded up by a conclusion.

State of the Art
Figure 1 shows the typical software and hardware structure of an Ethernet-based clock synchronization node.The protocol stack, for example, NTP or PTP, is typically implemented in userspace; see Figure 1 (6).Thus, event detection (timestamping) on that level [5] suffers from the jitter induced by all operations required to detect an external asynchronous hardware event (reception of a packet) at or below this layer.The main sources are typically the scheduling behavior of the operating system, data-(length-) dependent processing, or variable execution times, for example, due to caching.Similar reasons are also valid for the kernelspace Figure 1(5) usually hosting the network, for example, Internet Protocol (IP), and transport layer, for example, UDP of the protocol stack [3].
Due to the timely uncertainties in software (except for specially designed real-time systems), almost all highaccuracy synchronization systems rely on event detection close to the physical layer.Since in Ethernet even the datalink layer, that is, Media Access Control (MAC), has variable processing time on the transmit path due to the Carrier Sense Multiple Access (CSMA) mechanism and a possible backoff delay [6], accurate solutions rely on timestamping on the Media Independent Interface (MII).The necessary data scanner, that is, the Media-Independent Interface Scanner (MIIS) block, can be attached to the MII as a separate device [7] or integrated within the functionality of the MAC.
The advantage of the MII is that all receive signals are already in the digital domain but are still source synchronous with respect to the analog data on the line.This gives the synchronization node the possibility to determine timestamps with high precision as the interface is phaselocked to the opposite transmitter.In contrast, interfaces synchronous to the local oscillator, for example, R(G)MII, introduce additional jitter as indicated by (2) of Figure 1 because elastic buffering is required to compensate for clock frequency offsets.
All effects that can deteriorate the performance via the physical layer (see Section 4.3) are summarized by Figure 1(1).These mainly include the analog properties of the physical layer entity (PHY), the cable, and the transmission standard.Timestamping can also be done directly at the physical layer, as shown in [8].
Beside the event detection itself, Figure 1 also outlines the integration of necessary synchronization functions in hardware with the Clock Synchronization Cell (CSC).This element is responsible for timekeeping and timestamping.It is driven by an oscillator which itself is again subject to instabilities indicated by Figure 1(4).In this case, due to the separate oscillator (e.g., for higher stability), an additional clock transition, Figure 1(3), is introduced which again adds jitter.This issue can be solved by using a single oscillator for media transmission as well as timekeeping.
To build a complete Ethernet networks, one element is essential: the switch.In principle Ethernet switches use the same PHY and MAC as ordinary nodes.Concerning clock synchronization, the residential time of a packet on a switch has to be measured to compensate for varying switching decision or queuing times.However, it can be said that the principles and influences are similar to a node, and therefore the results of this paper can be applied accordingly.

Motivation
As outlined in the introduction, synchronization is, among other areas, mandatory for test and measurement applications.Precise timestamping is required, because accurate synchronization over packet-oriented depends on a common event that can be detected by the synchronizing nodes.Such events are special messages, regularly sent out by PTP  together with the absolute time of the event.Moreover, PTP uses a master-slave approach, where one clock master synchronizes several slaves.The basic principle of timestamping is indicated in Figure 2.For packet-oriented networks, the synchronization protocols define the occurrence of a packet (mostly, the start of the data frame) on the medium as the common event.As shown in the figure, the master node catches the transmission time and copies it into the synchronization packet.On the other end, the slave detects the reception and compares the event time with the information contained in the packet and adjusts its clock.The remaining offset between the clocks, which is due to the transmission delay on the network, can be compensated by round-trip delay measurements.The latter use the same timestamping techniques by sending a packet from slave to master or vice versa.Assuming that the delay on one link is symmetrical; that is, it takes the same time to transmit a message from node A to B as from node B to A, the line delay can be calculated by using the time difference between send and receive event, reducing it by the residential time at the remote side, and taking the half of the result as the delay of the link.
The alternative to precise timestamping-long-term averaging to enhance the precision-cannot be applied due to nonstationary jitter of several elements within the synchronization loop, in particular the oscillator.Thus, accurate timestamping is the key for any high-accuracy clock synchronization method.This paper focuses on the influence factors affecting the system precision despite the usage of hardware timestamping in Ethernet to develop methods able to acquire precise timestamps.Further, tradeoffs are identified that allow to tune different parameters to achieve a predefined accuracy boundary efficiently.

Sources for Inaccuracy
Since timestamping at a certain network layer avoids timely influences on the accuracy of all layers above, hardware timestamping cancels out software dependencies.Still, also for hardware implementations, certain limitations due to remaining jitter sources exist that influence the achievable accuracy [9].As the first two of the following aspects are influenced by a wide range of parameters, the impact is summarized in this section while more detail is presented in the appendix.

4.1.
Oscillator.An ideal oscillator serving as a timebase for a node would require only a single synchronization at startup to compensate for the initial offset.Due to the fact that every oscillator is subject to a number of physical phenomena, the progress of time is not constant; even worse, the accuracy is dependent on the considered holdover time, which is the interval between two synchronization events.Periodic resynchronization is therefore indispensable.Several short-time noise phenomena, for example, phase noise, additionally complicate precise timestamping.

Synchronization Interval.
Considering only the stability characteristics of a selected oscillator, an optimal synchronization interval can be chosen.Usually the longest interval with the lowest absolute clock jitter is chosen to minimize necessary network traffic between the nodes.However, as inaccuracies in timestamping cannot be distinguished from oscillator jitter, both have to be as low as possible.While the timestamp inaccuracy is independent on the synchronization interval, the timebase error caused by the oscillator instability increases with time.Hence, depending on the interval either the oscillator or the timestamping can be identified as the limiting factor.

Physical Layer
Properties.Since it is not (cost) efficient to replace commercial off-the-shelf (COTS) PHYs with a proprietary solution supporting timestamping, the most reasonable way to add high-precision timestamping to a system is to use the interface of the PHY to the MAC.Thus, the delay behavior of the PHY still has influence on the achievable performance of the timestamping method.
The most important properties are the line coding, the translation to MII, and the internal phase-locked loop (PLL).Since Ethernet is designed as an asynchronous network, the receive side of the physical layer has to recover the transmission clock in order to correctly decode the data.The other direction, the data transmission can be performed with the locally available clock.This does not introduce a clock transition and therefore does not increase the jitter (The reason why clock transitions add jitter is given in Section 4.4).
Beside dynamic link delay changes, also the absolute delay of the PHY-to-PHY system can vary each time the link is newly established.This is due to the fact that, for example, in Fast Ethernet, the 125 MBaud on the line have to be translated to four bit symbols on the 25 MHz MII, which allows five different locking positions [10].
The delay behavior for the three most popular copperbased Ethernet transmission standards is illustrated in Figure 3 using two Marvell 88E1111 PHYs over a 3 m direct connection.Since 10 Base-T keeps the line idle when there is no data to be transmitted, the PLL of the receiver has to resynchronize to the transmission clock with each packet.If the packet rate is low, the behavior is similar to a link reestablishment since the PLL can take any of the possible locking positions, that is, two different, 100 ns apart for Figure 3. On high packet rates, the drift of the receive PLL between two packets is small and thus the PHY can stay synchronized, which then results only in the jitter of the clock recovering process.
Compared to the original Ethernet, Fast Ethernet introduced 4B/5B line encoding and the Idle code-group (clause 24.2.2.1.2 of [6]).The coding replaces four bit by five bit groups, which are coded in a way that long constant bit sequences are avoided to ease clock recovery.Additionally, it is possible to insert control codes, for example, to denote the start of a transmission (/J/, /K/).Hence, the synchronization can be maintained continuously, independent of the data transfer, which results in significantly lower standard deviation of the transmission delay.
In contrast to the enhancement from the original 10 Base-T to 100 Base-T, Gigabit Ethernet does not give a performance boost for clock synchronization.Resulting from the fact that the physical layer uses a single clock for both directions, there is a master and a slave PHY.
Thus, there is a clock transition on the receiver side to the local clock.Due to the 125 MHz clock rate, this gives an equally distributed communication delay over a window of eight nanoseconds on the slave side, while the master only shows PLL and oscillator phase noise.Also the assumption that the problem of asymmetric delays of Ethernet can be solved with Gigabit Ethernet due to the bidirectional usage of all copper-lines does not turn out to be true, as shown in this figure.
Summarizing the findings, it can be said that apart from the initial locking with a specific absolute delay, the communication jitter for Fast Ethernet is by far the lowest (with a standard deviation σ = 0.286 ns compared to 1.387 ns for 10 Base-T and theoretic 2.309 ns for 1000 Base-T).

Timestamp Resolution.
The resolution of the timestamp basically influences the quality of the timebase comparison for the control loop of the synchronization system.While in general it is no problem to gain enough resolution in hardware, the difficulty arises from the fact that the timestamp has to be transferred through various network layers to the application [11].Linux, for example, started supporting timestamps with nanosecond resolution from version 2.6.22 on.The necessary structure to transfer hardware timestamps to the applications were added in version 2.6.30.Currently, the resolution is limited to one nanosecond, which makes it difficult to safely achieve synchronization accuracies below the nanosecond.
Figure 4 illustrates the main problem for highly accurate timestamping.Since Ethernet is an asynchronous network, the ingress packets are asynchronous to the local clock of the timebase.Thus, the issue boils down to detecting the occurrence of a packet, that is, frame active signal, with high precision.In synchronous digital designs, the asynchronous activation of a single event between two clock edges can be detected at the earliest with the next edge.Highly accurate solutions therefore have to measure/estimate  the exact occurrence of the event with respect to the local clock [12].

Existing Single-Shot Methods
Single-shot methods are one solution to the problem of accurate timestamping and measure δ TS directly.That is done by determining how long after the rising edge of the local clock the timestamp signal S TS has been asserted.For that purpose, a clock cycle T l is divided into n ∈ N + equally spaced fractions, which reduce the timestamping variance (the variance of a uniform distribution) to σ 2 TS = T 2 l /(12n 2 ).The main advantage is that the timestamp signal is not required to occur in regular intervals.Events can be detected without any further reference even on their first occurrence.

High-Speed Counter (HSC).
This approach divides T l by a short-width high-frequency counter with a period T h = T l /n.The counter is reset at every rising edge of the local clock, and its value is frozen when S TS is active.The sampled counter value divided by n then describes the relative phase offset δ TS .Since the period of the local clock is exactly a multiple of T h , the clock transition between these two clocks can be designed without any additional jitter, and consequently the timestamping variance reduces by n 2 .The result is similar to a design completely clocked with 1/T h with the advantage that only a few logic elements have to run at a high frequency.
With very low additional effort, n can be doubled using registers sampling with the opposite edge of the clock.Such register pairs using both edges, called Double Data Rate (DDR) registers, are available in many devices to be used for communication links.The additional improvement by a factor of two is achieved without change in the clock rate and is a special case of the next option requiring only an inverter.

Phase-Shifted Clocks (PSCs).
The use of phase-shifted clocks is another method to partition T l .For this technique, n − 1 additional clocks are generated, which are phase shifted by 2π/k with k = 0, 1, . . ., n − 1 with respect to C l .The timestamp signal is registered into n registers, each with a different generated clock.If the S TS gets active, the first i registers still sample the old state.This generates a socalled thermometer code, which is converted to a binary code and used in the same manner as the HSC.Since S TS drives n registers, special care has to be taken that the clock at the registers has the designed phase shift (i.e., the registers are timely equally spaced), since otherwise the thermometer code becomes nonlinear.This effect and the number of output clocks per PLL limit n to a value of about 10 in stateof-the-art Field Programmable Gate Array (FPGA) devices (e.g., Altera Stratix III family [13]).

Tapped Delay Lines (TDLs).
TDLs are a common approach for digitizing times with sub-nanosecond accuracy.The basic configuration of a TDL consists of a serial chain of n latches having a delay τ L , a second chain of noninverting buffers with delay τ B < τ L , and an output logic as described in [14].The signal to be timestamped is then fed through all latches, which freeze its current state at the rising edge of a second signal (clock).The resulting thermometer code can then be evaluated after the next clock edge.
Nevertheless, it has to be considered that such a design uses asynchronous logic, and therefore the delays τ L and τ B are not only placement but also temperature dependent.The linearity of a TDL may be compromised by these effects, and special calibration logic may be required.The possible precision depends on the intrinsic switching speed of the latches and is typically in the range of 100 ps [15].

Proposed Phase-Estimating Solution.
Phase-estimating methods do not measure δ TS directly, but rather estimate it using the fact that S TS is asserted synchronously to the communication clock.The relatively new approach for highly accurate time interval measurements is described in [16] for a active clock skew compensation in VLSI designs using analog components.The authors of [17] present a similar method using a 10 MHz atomic clock source sampled by an analog to digital converter (ADC) driven by the communication clock.The resulting waveform is used to perform a phase estimation by a 1024 pt Fast Fourier Transform (FFT).While the results show a very low timestamping standard deviation of about 10 ps, this approach requires one ADC per timestamper and an atomic clock to achieve the mentioned performance.

Approach.
As with analog single-shot methods, for the targeted application area, timestamping in Ethernet, additional components (especially analog ones) are rather impractical.Consequently, a technique to implement the scheme using pure digital function blocks is presented by us in [12].
In this approach, the frequencies of the involved clocks have to fulfill several requirements in order to benefit from the method of phase estimation.First of all, the rising edge of the local clock T l should occur equally distributed within the clock cycles of T c averaged over a given time span (i.e., the rising edge of the local clock should cover all phases of the communication clock with equal probability).This includes that T c must not match T l or multiples thereof because in this case the rising edge would always coincide with a certain phase of T c and the necessary averaging time span would be infinite.It is known from estimation and detection theory that applying a randomization function may help [18].However, such a randomization function (e.g., a clock with very high clock jitter) is an impractical solution as the clock jitter reduces the maximum attainable clock frequency in the same way.
In order to be able to sample T c (or signals synchronous to it) at a number of different phase states, the local clock has to be a nonmultiple of the communication clock.To represent this criteria, the nominal local clock period T l is given by with the design parameter β as the relative frequency offset factor and n as the nominal oversampling factor.A cycle slip (i.e., when the rising edge of two clocks pass each other) occurs, if the difference in the periods sums up to T l as given by Using (3) cycle slips will happen after every 1/ε local clock periods.This implies that C c has been sampled at 1/ε +1 different phase points over one cycle slip period and the maximal attainable precision is nominal βT l .Selecting a very small β results in a small frequency offset and great resolution, but this can cause T c ≤ nT l due to instabilities of the two involved oscillators during the long theoretical cycle slip period.Consequently, cycle slips might not happen at all, or even worse, reverse cycle slips can occur, resulting in possible data loss.
Furthermore, small frequency differences are unusable since the rising edge instance is only equally distributed within T c over a long averaging window, and for short averaging windows the leakage effect [19] becomes dominant.

Timestamp Averaging (TSA).
Timestamping a frame m times by means of the communication clock is one option to estimate the phase at the timestamping event based on the assumption that δ TS averages to 0.5.Given that the timestamps are centered before and after the assertion of S TS (i.e., (m − 1)/2 timestamps before and after the event) and that the clock rate is not changed during the timestamping period, the final timestamp TS is calculated by a weighted average over the timestamps TS i following This Finite Impulse Response (FIR) filter can be simplified to a window integrator with α i = 1/m with some limitations, namely, leakage effects.The timestamping window should cover one cycle slip period or multiples thereof to get the timestamps equally distributed over one T c period.Since the cycle slip period is dependent on the current frequency offset, it varies with the oscillator drift between the communication and local clock.One solution to this problem is to adjust m to cover always a multiple of the cycle slip periods or by capturing a big number of such periods and using a windowing function to minimize leakage effects.The leakage can also be reduced by selecting a rather large ε with the drawback of reduced resolution.
In the optimal case the resulting timestamping variance reduces to σ 2 TS = T 2 l /(12m) .For example, a typical IEEE 1588 frame with about 80 bytes frame length may create m = 160 timestamps on the 4-bit-wide MII.Given that T l equals 10 ns and all timestamps can be considered uncorrelated, the standard deviation becomes 228 ps.

Digital Phase Estimation (DPE).
The phase of the communication clock can be also directly estimated by phase detectors.Such a detector can be, for instance, based on a mixer, which shifts the spectrum of the clock to a low frequency similar to common superheterodyne receivers.The output of the mixer is low-pass filtered to remove aliases at multiples of the input frequency and is conducted into a phase estimator.Given that the duty cycle of the clock input signal is constant, the low frequency part of the downmixed signal then is a measure for the phase difference at the inputs.Nevertheless, real filters with a low bandwidth introduce significant group delay, which has to be taken into account for the calculation of the timestamp.In order to allow for a digital implementation, the mixer can be replaced by an XOR gate and the output can be filtered in the same way.A further solution would be to use an external analog antialiasing filter in combination with an ADC and only perform the second filtering digitally, similar to Zhu's method [17].
A pure digital implementation without requiring external parts can be achieved, but in such a processing scheme undersampling occurs.As the XORed signal is not bandwidth-limited, sampling results in alias frequencies that can dominate the signal (e.g., if the "1 bit" sampler is sourced from a clock correlated with C l ).One feasible solution is to sample the mixed signal by a clock which is uncorrelated to both inputs and apply the sampled signal to a low-pass filter with very low relative bandwidth.Such filters are typically Infinite Impulse Response (IIR) type since comparable FIR filters would need a big number of filter taps.IIR filters on the other hand have a frequency-dependent group delay, which means that the frequency offset between C l and C c must be estimated in order to compensate for the filter's group delay.

Combined Phase/Frequency Estimation (PFE).
Rather than estimating δ TS directly, it is also possible to estimate the phase by its derivative, the frequency offset, together with a reference point.In the following, the hat notation (e.g., x for an estimation of x) is used to differentiate between the true value and its estimation.The principle of our phase estimation by frequency estimation approach can be implemented as follows.Whenever the communication clock is phase aligned to the local clock (i.e., when a cycle slip occurs), the phase estimation δ TS is set to zero.In every subsequent clock cycle δ TS is incremented by the estimated inverse cycle slip period .In other words, δ TS is the sampled integral of over one cycle slip period.Given that the frequency is stable in the averaging interval, δ TS ideally would reach 1.0 at the next cycle slip as depicted in Figure 5 for n = 2.For the reason of better visibility, a relative high value of 0.13 was chosen for β resulting in a (rather short) cycle slip period of 6.7 local clock cycles.
The accurate detection of the cycle slip instant is critical for the start of the integration of the frequency offset to get δ TS .To make the method independent of the communication clock duty cycle, a derived clock C d with the period 2T c is generated digitally.C d is fed into a shift register with 5 + n taps clocked with the local clock.Further, this clock is used for cycle slip checking and performing edge detection.While the first two shift taps are used for buffering, the middle taps (2 + n down to 2) are used for cycle slip detection.If all n + 1 middle taps contain the same binary value, a cycle slip must have happened.The last two taps (1 down to 0) are used to detect rising and falling edges of C d at which the buffered timestamp signal S TS is checked for a high level.
The dots on the derived communication clock, C d , mark the sampling points of this signal by the local clock, C l .Each time the signal is sampled three times with the same value, a cycle slip must have occurred and the phase is reset.
Due to the fact that the cycle slip period is in general not an integer number, the phase estimation cumulates an error with each reset.Thus, higher orders of the cycle slip period exist.In picture this can be seen; that, every third time, the cycle slip is detected one period (of C c ) later in order to compensate for early detection at the previous two reset events.The higher order can be calculated by taking the remainder, which is 0.7 • T l and summing it up to one period of C c : 3 × 0.7 × T l ≈ T c .Obviously, there is again a remainder, 0.1 • T l , which again causes a higher order periodicity.Note that these higher orders can only be used to refine the timestamp if the remainder of the cycle slip period does not wrap over.Even for selected frequencies, this is only applicable for a very narrow frequency range.Considering the oscillator tolerance for Ethernet (50 ppm), the method is not feasible.However, even if the higher-order periodicities are neglected, TSA can be used to improve the accuracy.The relative frequency offset β can be calculated by monitoring the number of rising edges of C c with respect to C l .In average, every 1/ local clock cycles, a cycle slip will occur, which results in a missing rising edge with respect to the local clock.To achieve a continuous value of , a lowpass filter is used.The bandwidth of the filter must be narrow enough to track frequency changes of the oscillator while removing the cycle slip frequency.In general, IIR filters which have poles close to one are ideally suited for this application.Alternatively, the frequency offset can be calculated by the cycle slip rate 1/ , but this requires a division block, which consumes a significant amount of logic resources.In any case, the calculation of the final timestamp involves summing up the last value of the rate-controlled timebase plus δ TS times the clock rate.5.9.Summary.The selection of the cycle slip rate not only has to consider the accuracy requirements of the application but also the behavior of the physical layer.As mentioned, for example, 10 Base-T does not provide a continuous clock supply that can be measured by the phase/frequency estimation.Therefore, the factor β has to be chosen in a way that the measurement can be done within the reception/transmission time of a single packet, which is rather stringent.Alternatively, the precision can be enhanced by keeping the carrier active via the transmission of several (arbitrary) packets to allow the phase/frequency estimation to settle.The last packet of such a burst can then be timestamped with relatively high precision.Obviously, the provision of a continuous link implies much higher potential for accurate timestamping due to the permanent tracking of phase and frequency.

Implementation and Evaluation
One major advantage of the presented phase estimating method over single-shot techniques is the fact that the measured signal only passes one digital processing path.Unless for the latter techniques, only one sampling register is directly connected to the S TS or the respective synchronous clock signal.Therefore, linearity problems due to unequal signal propagation delays to registers or between other logic elements, for example, buffers in TDLs, cannot occur.This makes the mentioned method also robust against temperature and other effects, which have direct influence on the signal propagation properties within an integrated circuit.Nevertheless, there might be a small (asymmetric) delay between the receive and the transmit path since two separate instances of the mentioned method are required.For each chip, this delay has to be calibrated once to compensate for internal placement and routing issues.
6.1.Precision.Mathematically, the maximal attainable precision is given by βT l = T c (β − β 2 )/n.Using (1) the enhancement over the period of the communication clock results in a factor of n/(β − β 2 ). Figure 6 shows the shape of this function and the possible improvement due to oversampling.Unfortunately, the results for n > 1 are only theoretical values.
As the phase estimation is reset at every cycle slip, cycle slips and the interpolated phase value are correlated.Oversampling increases the number of interpolation steps for δ, yet, the number of reachable values within one cycle slip period stays the same.It can be shown that with an oversampling ratio n only every nth value can be reached by the rising edge of C c .Hence, the effective precision is only T c (β − β 2 ), independent of n, and only scales with the duration of the cycle slip period.
Still, if n is set anything below 1/β, the combined phasefrequency estimation method offers improved precision.Figure 7 depicts an example for a 25 MHz communication clock with three frequency offset factors β.For simple oversampling with a high-speed clock, the timestamp precision is exactly the period of the clock.If PFE is used, the precision improves to βT l until the frequency (or the oversampling factor n resp.) is highly increased.For 1 < n/β, the PFE method degrades to the standard oversampling method.Frequencies in the range of multiple GHz cannot be reached with simple circuits, yet, the PFE method offers the same precision with design frequencies in the range of 50 MHz only.6.2.Physical Limits.One critical parameter for the PFE method or in general for any form of highly precise timestamping is the phase noise of the clock.If the phase noise is low with respect to the calculated precision, then the considerations taken up to this point are directly applicable.However, measurements on PHY revealed that the phase noise is in the range of 100-250 ps RMS, 124 ps for the setup in [9].While the phase noise does not alter the frequency estimation due to its long-term averaging, it causes the phase estimation circuit to see cycle slips too early or too late causing a timestamp jitter similar to the clock jitter.The clock can be cleaned by means of clock conditioners or the PLL inside the FPGA.The latter is readily available in FPGAbased solutions, but itself has a relatively high self-noise (around 200 ps RMS jitter) [13].
As discussed in Section 6, a randomization function is beneficial to remove clock correlation and leakage.Considering that the local clock frequency 1/T l is typically around 50-300 MHz, the randomization by the clock jitter is far too low as it should cover one clock cycle uniformly distributed.For PFE with a high value of β (and therefore low precision), the randomization caused by the clock jitter can reduce the correlation between the timestamps.Hence, if multiple timestamps are averaged, the jitter may actually improve the precision of a packet's timestamp.
As already outlined, timestamping a frame multiple times can increase the precision, if the timestamps are not totally correlated.Since every Ethernet packet is at least 64 bytes long [6], a timestamper attached to the MII can draw 128 timestamps starting at beginning of the frame.These timestamps can then be fed into a minimum mean square error (MMSE) estimator calculating one timestamp for the frame using the least squares method.This combination of two methods PFE and TSA can guarantee at least the  The measured graph shows that it is actually about 120 ps wide.With MMSE averaging the timestamp precision improves to a width 40 ps or 12.036 ps standard deviation, respectively.The results can be interpreted in the following way.For precisions in the low picosecond range physical effects (e.g., clock jitter, thermal noise) dominate the timestamping.For larger values of β, for instance, β = 0.0045, as shown in [12], the calculated precision can be achieved even with the FPGA's PLLs.

Conclusion
A comparison of the methods analyzed within this paper is given in Table 1.Single shot is the method of choice if the event is not aligned to a clock.For events that appear synchronous to a clock, like the case for Ethernet with MII, phase estimation methods offer a similar or even superior performance with respect to, for example, a TDL, while requiring only low complexity and slow logic.The advantage of the purely digital design of the PFE overwhelms the restriction to a low-frequency range in particular for the intended timestamping application.
In conclusion it can be said that highly accurate clock synchronization is of utmost importance for test and measurement applications where a perse non-real-time network is used to sample/collect data as in the LXI approach.The latter uses PTP to achieve that functionality by distributing absolute time information to attached network nodes.The used messages have to be supplied with the actual ingress or egress time at the synchronizing nodes.Providing a precise value for that event was shown to be again an issue of timestamping.
It is further shown that there are several different possibilities to timestamp, and various jitter sources exist within a Ethernet IP/UDP node.It can be stated as a rule of thumb that, the closer one gets to the physical layer, the lower the jitter and therefore the influence on the precision of a timestamp.
As a further conclusion of this paper, it can be said that timestamping is a crucial issue for highly accurate clock synchronization: several methods like HSCs, PSC, TDLs, or the phase-estimating methods as DPE are limited in terms of their reachable accuracy.It was shown that this limitation is however a bound, which is much lower than previously published results in terms of accuracy.Finally, timestamping using a MMSE estimator together with PFE can bring the accuracy down to an equal distribution with a standard deviation of only 12 ps.while the network load is still within a reasonable range of about five packets per second.

Figure 1 :
Figure 1: Synchronization node overview and possible internal sources of jitter.

Figure 2 :
Figure 2: Basic method for clock synchronization in packet-based networks and the necessity of event detection (timestamping).

Figure 3 :
Figure 3: Comparison of various Ethernet physical layer transmission standards.
signal, S TS Local clock, C 1

Figure 6 :
Figure 6: Dependency of the theoretical and the effective precision enhancement on the relative frequency offset.

0Figure 7 :
Figure 7: Timestamp precision with a 25 MHz communication clock for the PFE approach and simple oversampling.

Figure 8 :
Figure 8: Timestamp error for a combined phase/frequency estimation (β = 0 : 002; n = 2) with linear MMSE estimator using 128 samples.The histogram shows a total of 50 kTS in 75 bins, giving a standard deviation of 12.036 ps.

Table 1 :
Comparison of the different approaches.