Approximations of Time Series

A method is proposed to approximate the main features or patterns including interventions that may occur in a time series. Collision data from the Ontario Ministry of Transportation illustrate the approach using monthly collision counts from police reports over a 10-year period from 1990 to 1999. The domain of the time series is partitioned into nonoverlapping subdomains. The major condition on the approximation requires that the series and the approximation have the same average value over each subdomain. To obtain a smooth approximation, based on the second difference of the series, a few iterations are necessary since an iteration over one subdomain is affected by the previous iteration over the adjacent subdomains.


Introduction
Graduated licensing system GLS is a method of gradual exposure of young novice drivers into the driving environment, allowing them to obtain initial experience with driving under supervision, followed by more independent driving under higher-risk circumstances 1 . This model was widely incorporated into driver licensing programs across the US and Canada as well as other countries over the 1990s. Most of these programs have incorporated similar restrictions into their initial phases 2 . These include driving with supervision, restricted driving at night, limited teenage passengers, and zero blood alcohol level while driving. This method has had limited long-term evaluation in North America, but longterm followup in New Zealand suggested a reduced but persistent long-term reduction in young driver collisions as a result of its implementation 3 . The collision data for Ontario drivers, around the time of the introduction of the GLS 2 , p. 126 , illustrate the variety of approximations of time series that are possible.
There are many practical techniques for smoothing a time series 4 . The smoothed value at a point is a weighted average involving the elements in the series that are within a local window about the point. One way to generate the weights in a moving average 2 ISRN Applied Mathematics involves a local polynomial approximation of order 3 or 5 where the window includes 5 or 7 et cetera points. The weights are then defined by regression. Another approach defines the weights in terms of an appropriate kernel, and this method applies more generally to bivariate data 5 . The advantages of local estimates compared with global estimates are discussed in 6 .
The first step in the computational process involves the partition of the domain of the time series into subdomains. The subdomains are then labelled as odd-numbered or evennumbered. Iterations are then performed over the odd numbered subdomains followed by the iterations over the even-numbered subdomains. This process is numerically efficient since the iterations over one set of subdomains update the boundary conditions for the iterations over the remaining subdomains 7 . To determine a smooth and accurate approximation, this set of iterations is repeated a few times.
This paper is organized as follows. Section 2 describes the form of the approximations, the partition of the time series, and the minimization over the subdomains of the partition. In Section 3, the equations for the approximation are derived, and some computational details are given in Section 4. Under certain assumptions, approximations of time series with variable spacing are possible. The time series and their approximations are presented in Section 5, and an example outlines the approach for step level changes, missing data, and outliers in Section 6. In Section 7, the approximation over a subdomain is determined by a fourthorder polynomial and a straight line. Finally, guidelines for the application of the proposed approach to a time series are outlined in "Concluding remarks."

The General Model
The equation for the general model for the approximation of the time series {Z t | t 1, . . . , N} is where Q 0 t T t M t and R n t A n t W n t . In these equations, O k t is the term for outliers if present ; Q k t is the kth approximation; T t is a trend and includes the level changes; M t is a nonperiodic oscillatory function; R n t is the remainder; A n t is a measure of the variation of the remainder. A restriction on the approximations requires that the root-mean-square RMS value of the remainder is a decreasing sequence with increasing n. The form of 2.1 is similar to the asymptotic expansion of a function that contains a small positive parameter 8 , p. 1-4 .
The partition of the domain of the time series {Z t }, henceforth denoted by Z t , is chosen in order to accurately approximate possible patterns in the time series. Let P {E k | k 1, . . . , M} be a partition of 1, N where the nonoverlapping subdomains are E k {t k − n k 1, . . . , t k }, n k ≥ 1, where n k is the number of elements in the kth subdomain E k . The overlapping subdomains, over which the iterations are computed, are defined as An approximation Q t of a time series Z t , where Z t Q t R t , is determined by a few iterations I n t starting with I 0 t Z t . Once the desired accuracy is obtained, the last iteration is defined as Q t . All iterations and the approximation Q t along with the remainder R t satisfy the following properties. and, hence, the average value of the remainder R t is zero over this subdomain. If Z t is a measure of "energy" in the process, then Q t conserves energy over each subdomain in the partition. In the particular case n k 1, E k {t k } and t t k is a fixed point for the approximation so that Q t : Z t at t t k .
2 The measure of smoothness of the iterations at time t for the nth iteration is defined by δ n t I n t 1

2.3
Provided that the number of elements in E k is greater than 1, then the condition that Δ n k has a minimum value is imposed. I n t is required for t t k − n k and t t k 1 in E o k to determine δ n t at the endpoints of E k . These two values of I n t are the boundary conditions for the minimization on E k . 4 In some cases there are two or more approximations over one or more subdomains and a criterion is required to choose the best approximation.
For the example in Section 5, the simplest case occurs when P k is a refinement of P k−1 ; that is, P k ∪P k where P k covers E in P k−1 . Then an approximation over E is Q k t 0 and the other is defined by P k . Let the RMS value of the remainder R k 1 t over P k be denoted by S k 1 , and S k is the RMS value over E in R k t . The approximation defined by P k is a significant improvement if the ratio S k 1 /S k ≤ for a chosen value of . As shown in Section 7, an upper bound for takes on values between 0.75 and 0.9. For the example in Section 6 involving an outlier, there are two approximations for Q 0 t . k B n k , where A is a n k × n k tridiagonal symmetric matrix with elements where −2 is on the main diagonal. The equations for the iterations are obtained by replacing B n k with B n−1 k . Since the sum of I n t for t ∈ E k is a constant for all n, then and E 1 E n k −n k /2 Section 7 . The solution for X n k such that X n k X n k has a minimum is X n The iteration I n t Y n k i , where i 1, . . . , n k and t t k − n k 1, . . . , t k in E k , respectively. For a given n k , E and then V 3 are uniquely determined, and V 3 is computed in advance for all of the subdomains that occur in a time series.

External Boundary Conditions
If t 1 is not a fixed point, then the subdomain E o 1 requires a boundary condition. Here are three possible external boundary conditions to impose at t 0 that can be used to reflect the possible behavior of the time series near the endpoint. 3 An iterative process is used to obtain Z 0 such that the RMS value of the remainder over the adjacent subdomain s has a minimum value.
Similarly, if t N is not a fixed point, then the external boundary conditions are obtained by replacing I n−1

Computational Aspects
For a time series Z t and a partition P , there is a related series defined by Z t a k for t ∈ E k , where a k is the average value of Z t for t ∈ E k . This property holds for all of the approximations in this paper, The approximations for a time series Z t and the averaged time series Z t are the same to the desired accuracy provided that the same partition and the same external boundary conditions (if any) are applied.
Consequently, any time series with variable spacing can be approximated provided that the estimates of the average values of the time series over the subdomains are adequate.
The approximation for the averaged series is employed especially for larger subdomains n k ≈ 12 or more . The efficiency of the computations is increased if the boundary conditions in the first set of iterations even and odd are the average of the four values of the series that straddle the subdomains E k−1 and E k . The averaged series was used in all computations, although the approximation obtained from Z t may be more efficient in special cases.
It is convenient to introduce another notation to represent a partition: P {n 1 , n 2 , . . . ; . . . ; . . . , n M }, where the number of elements in the subdomains in the first block is {n 1 , n 2 ,. . .} and in the last block by {. . . , n M }. These blocks are a convenient way to separate the seasons or a set of months. Also, the approximation Q k t obtained by iterating the time series n times, using the partition P k , is denoted by P n k {R k t } R 0 t : Z t . The number of iterations n is determined from the difference by imposing the condition that max |D n t | < L. L 1 in Figures 1 and 2, L 0.04 in Figures  3 and 4, and L 0.01 in Figure 5. All calculations in this paper were performed using Maple software 9 .

Applications
Two time series, provided by the Ontario Ministry of Transportation 2 , p. 126 , illustrate the approximations. The graph of the time series for the monthly accidents for young novice drivers is given in Figure 1 where the main feature here is the intervention that occurs at 52 months owing to the introduction of the GLS on April 1, 1994. The corresponding graph for all drivers is shown in Figure 3 where the sharp drop in the graph from the maximum in December/January to April, except for the last 2 years, is a strong feature of the series. the case of 6 elements, the RMS value of the remainder is 30. A more accurate approximation is obtained if the subdomains have two elements; however, this approximation has an angular appearance since it more closely approximates the time series. In Figure 2, the approximation Q 0 t of Figure 1 is expressed as a sum of a trend and an oscillatory series. The partition for the trend T t P 6 T {Q 0 t } is P T {12, 12, 12, 12; 6, 6, 6, 6; 12, 12, 12, 12}. The external boundary condition implies that the tangent is horizontal at the endpoints of the series. The trend in this example is defined as a seasonal approximation of the time series where the subdomains contain 6 elements over the domain of the intervention. The remainder is the oscillatory series M t Q 0 t − T t . In Figure 3, the points for January or December plus one November and April are fixed points for the approximation Q 0 The partition is P 0 {1, 2, 1, 7, 1; 3, 1, 7, 1; 3, 1, 7, 1; 3, 1, 8; 1, 2, 1, 7, 1; 3, 1, 6, 2; 1, 2, 1, 8; 1, 2, 1, 6, 1; 2, 2, 1, 8; 1, 2, 1, 7, 1}. The second approximation captures the increase in the number of accidents that occur in the summer months by approximating the remainder The partition P 1 is a refinement of P 0 where 7 is replaced with 3, 4; 6 with 3, 3; 8 with 3,5. Consequently, the approximations Q 0 t and Q 0 t Q 1 t have the same average value over the subdomains of P 0 .
The subdomains that are the same in the two partitions P 0 and P 1 are indicated by the intervals over which the approximation is zero in Figure 3. For the remaining intervals, the ratio of the RMS value of the remainder R 2 t in Figure 4 to the RMS value of R 1 t is equal to 0.49 so that this second approximation is significant. Furthermore, each of the ten segments of Q 1 t , excluding the segments in which the approximation is identically equal to zero, has a

Level Changes, Missing Data, and Outliers
For a step level change between t τ and t τ 1, an approximation may not provide an adequate approximation for the time series in the subdomains on both sides of the step. For t > τ, an external boundary condition is applied at t τ such that the remainder of the approximation has a minimum RMS value. The same approach is applied to the series for t < τ 1. These ideas are illustrated in Figure 5 where the partition for the approximation Q t P 6 {T t } over 1,120 is P {1, 11; 12; . . . ; 12; 11, 1}. The approximation exhibits a phenomenon that is similar to a Fourier series near a discontinuity in that the approximation overshoots on the right and undershoots to the left of the jump. The maximum and minimum of Q t are 1.176 and −0.173. Moving away from the jump in either direction, the oscillations of Q t − T t occur with rapidly decreasing amplitude. The details in item 4 of Section 2 are applied over the subdomains adjacent to t 60 to choose between the two approximations of the trend. Interventions and level shifts, from an autoregressive moving average point of view, are presented in 10 and 11 .
A simple example indicates the approach for a series that has a missing value or a possible outlier at t 6. The series is {Z t } {0.6, −0.3, −0.5, −0.2, 0.4, Z 6 , 1, 0.8, 0.9, 1.1, 1.0, 1.2} where Z 6 is not defined in the case of a missing value. For both cases, the approximation Q 0 t is determined for the partition P {1; 4; 1; 5; 1}, where the value of the series is X at t 6, such that the RMS value of the remainder is a minimum. A good initial estimate for X is the average value of the time series in a window about t 6 5 . Then an iterative process is started to obtain X ≈ 1.25, as shown in Figure 6, and the RMS value of the remainder is 0.104. The smoothest approximation over P occurs for X ≈ 0.12, where 11 2 δ 2 t has a minimum value. For the case of a possible outlier, O 0 t 0 for t / 6 in 2.1 and O 0 6 Z 6 − 1.25 provided that the ratio of the RMS value of the approximation with X 1.25 to the RMS value of the approximation under the assumption that Z 6 is not an outlier satisfies the condition in item 4 of Section 2.

Approximations of Random Samples
The point of this exercise is to determine the in item 4 of Section 2 such that the only reasonable approximation for a series of random samples is Q 0 t equal to the mean of the series. 12,000 random samples from the normal distribution with a mean of 0 and a standard deviation of 1 were generated using Maple to form 100 time series with 120 elements in each series. For each series, five approximations were determined where the subdomains of the uniform partition contained 3, 4, 6, 12, and 24 elements. The external boundary condition for the approximation is the condition of zero slope of the tangent. For each series, the ratio of the RMS value of the remainder for the approximation to the RMS value of the series were calculated, and the results are given in Table 1. The approximations corresponding to 24 and 12 elements are smooth and appear to reflect an underlying pattern in the series; whereas, for the cases 3 and 4, the approximations are contorted. An upper bound for is less than the minimum values in the range.

Quartic Polynomial
The terms in the equation 3.5 for the approximation over

Concluding Remarks
The major input for the approximation of a time series involves the partition of the domain. Initially a uniform partition is chosen and, if seasonal behavior is present in the series, a subset of the partitions cover the domain for the seasons. In general, as the length of the subintervals decreases, the approximation is less smooth and the accuracy of the approximation increases. The best approximation occurs at the point at which the approximation is acceptably smooth. The subintervals can be enlarged to determine a much smoother approximation that can be labelled as a trend while still respecting the seasonal aspects of the series; however, if an intervention is present, then some adjustment of the partition may be necessary in the region of the intervention. For time series with a well-defined local maximum or minimum, the approximation can be assigned the same value as the series by taking the partition to be a single point of the domain. For series with jumps and other complexities, examples are provided to suggest how to proceed in these cases. An approach in the literature, as indicated in the introduction, defines the approximation at a point as a weighted average of the values of the values of the time series in a window about the point. This approach may smooth out interesting features in the time series and, if applied over a smaller intervals, the approximation will not be smooth. Since the proposed model is not based on regression, a comparison of the two approaches has not been considered.