Fuzzy Prediction for Traffic Flow Based on Delta Test

This paper presents a novel approach to one-step-forward prediction of traffic flow based on fuzzy reasoning. The successful construction of a competent fuzzy inference system of Sugeno type largely relies on proper choice of input dimension and accurate estimation of structure parameters and rules. The first issue is addressed with a proposed method, based on δ-test, which can simultaneously determine input dimension and reduce noise level. In response to the second issue, two clustering techniques, based on nearest-neighbor clustering and Gaussian mixture models, are successively employed to determine the antecedent parameters and rules, and the estimation for the consequent parameters is achieved by the least square estimation technique. A number of experiments have been performed on the one-week data of traffic flow to evaluate the proposed approach in terms of denosing, prediction performances, overfitting, and so forth. The experimental results have demonstrated that the proposed prediction approach is effective in removing noise and constructing a competent and compact fuzzy inference system without significant overfitting.


Introduction
Since intelligent transportation systems (ITS) emerged, research efforts have been continuously devoted to traffic flow prediction, as live and accurate prediction on traffic flow is a premise for efficient traffic management and control.While a wide spectrum of prediction techniques, such as Kalman filter [1] and its extension [2], support vector machine [3], Bayesian networks [4], and hybrid approach [5][6][7][8], has been reported in literature, the methodology behind these techniques is based either on influencing factors or on historical data of traffic flow.The influencing factors (e.g., weather) are used as inputs for the underlying traffic system which outputs traffic flows to reflect traffic states.The correlation between the inputs and outputs, established using either analytical model or data-derived approach, is used as the primary reasoning for future traffic flow.On the other hand, the methodology behind the historical data-based approaches is different from those based on influencing factors and the fundamental idea is that the future depends on its past.For a sequence of traffic flow observations with equal sampling interval, called traffic flow time series, the dependent relation between the traffic flow to be predicted at  and its past  −1 ,  −2 , . . .,  − can be formulated as follows: where the term   , called residual in this paper, is the part that cannot be accounted by the model , due to either a lack of functional determination or real noise.Therefore, the term residual in this paper is defined to include modeling error and real noise.The prediction performance by the methods based on influencing factors is considerably dependent on the quality of data collected for different factors which may not be guaranteed always.Therefore, in our ongoing work, the methodology based on historical data is adopted.Also, as only one type of data (i.e., traffic flow) is required, there is no need to manipulate units for different types of data.This paper reports our work on the prediction approach for one step ahead (i.e., short-term).
To model the dependent behavior, a number of techniques have been proposed, ranging from Kalman filter [1,2], artificial neural network [9], and nonparametric regression Mathematical Problems in Engineering [10] to hybrid approach [11].Our research has been focused on the modeling approach based on fuzzy inference system (FIS), in that it is tolerant to noise, resistant to uncertainty, and easy to incorporate expert and field knowledge [12].However, research activity in this area is relatively silent, reflected by a few notable contributions reported in literature over the last decade.Zhang and Ye proposed a prediction methodology by using fuzzy logic system to fuse the outputs of two methods out of autoregressive integrated moving average, backpropagation neural networks, exponential smoothing method, and Kalman filter, resulting in four different combinations [6].Similar idea has been adopted in [7], but the two methods mixed by a fuzzy logical model are history mean and artificial neural network models.Paper [8] describes a hybrid methodology that two fuzzy rulebased systems are constructed, one providing the next flow estimation based on the current flow only and the other predicting the one-step-ahead flow based on the current flow at the current location and the upstream location.A genetic algorithm is used to tune the parameters for the fuzzy rule-based systems by minimizing the mean absolute relative error between the estimated and the observed values.Paper [13] presents a prediction approach for short-term traffic flow prediction primarily based on a Sugeno fuzzy system (also known as TSK fuzzy system).The initial structure is formed by partitioning the input vector space by the mean shift clustering algorithm and subsequently optimized to eliminate redundant structure by the mean firing technique, and finally the other parameters are determined by particle swarm optimization with the aim of minimizing root mean squared error (RMSE).
A multiple-inputs (or single input vector) and singleoutput (MISO) FIS captures the conditional dependence between inputs and output, (  |  −1 ,  −2 , . . .,  − ), using the following rules with fuzzified inputs: Whether the consequent part in fuzzy rules (as described above) is based on fuzzy sets or linear functions of input vector yields two common types of FISs, namely, Mamdani and Sugeno FISs.Although the prediction method proposed in this paper adopts the Sugeno approach, it is interesting to investigate the predictability of Mamdani FIS as well.
To create a competent Sugeno FIS, careful decisions should be made on the following steps: (1) the input dimension: although all traffic flow recordings in the past should in principle be taken into account when predicting the future flow, it is not practical to consider the whole set of historical data as the computational complexity grows exponentially with the number of input dimension and curse of dimensionality [14]; (2) parameters and rules: a large number of membership functions can reduce model error, but the model suffers from the computational complexity, again.To compromise the complexity and accuracy, a proper choice should be made on the number of membership functions and the parameters associated should be determined in an optimal or near-optimal fashion.As a Sugeno FIS is adopted as the basis for the prediction, the set of parameters for the consequent part should be optimized to improve modeling performance.
This paper proposes a novel approach to one-step-ahead prediction (e.g., but multistep-ahead prediction can be realized simply by recursive prediction) for traffic flow based on a Sugeno FIS, with a particular emphasis on the abovementioned issues.Firstly, a nonparametric residual variance estimation method, called -test [15], is used to measure the noise level reduced using a wavelet-based denoising technique and the input dimension is determined when the noise reaches a reasonable level.To address the second issue, a clustering is performed using the nearest-neighbor clustering (NNC) method [16] to obtain the number of membership functions and a Gaussian mixture model (GMM) [17] is subsequently applied to determine the parameters associated with membership functions.Finally, the consequent parameters are obtained using the least square estimation (LSE) technique [18].

Prediction Approach
2.1.Approach Outline.The prediction approach presented in this paper uses a first-order Sugeno FIS as a basis of the predictor.Such choice has been made due to the fact that Sugeno FIS has been proved to be a universal approximator [19][20][21].Additionally, FIS has been reported to be tolerant to noise and resistant to uncertainty [12].To achieve a concrete implementation for the FIS, we follow the commonly adopted model structure: T-norm for conjunction operations, Gaussian membership functions, linear function for consequent part, and product inference of rules.
Figure 1 schematically shows the approach framework used to build the FIS based on the historical observations of traffic flow.Note that it is assumed that an appropriate preprocess has been taken to exclude any outliers and make up all missing recordings for the traffic flow time series.The algorithm firstly generates a set of input vectors by incrementing the dimension which has been bounded to the maximum dimension specified by users.For the set of input vectors generated, the -test is used to estimate the residual variance for each input vector without explicitly building a model.Based on the estimated residual variances, the algorithm subsequently evaluates whether any input vectors satisfy the requirement set by users.If the evaluation indicates none of them can meet the requirement, which implies the noise level still remains high, the time series is processed by a wavelet-based technique to reduce the noise level.Such process is iteratively performed until the residual error can be reduced to meet the requirement.During the second step, two clustering algorithms (NNC and GMM) are successively employed to determine the number of membership functions and the associated parameters, respectively.Also, the clustering results obtained from the GMM algorithm help determine the rules for the FIS to be constructed.In our approach, the LSE technique is used to determine a set of parameters for consequent part that minimizes the mean square error (MSE) between the model output and training samples.Those steps and how they are specifically implemented are discussed in detail in the following subsections.

Input Selection and Noise Reduction.
A proper selection of input dimension can alleviate the curse of dimensionality problem, while all the mostly related historical states can still be utilized for the prediction.Our approach employs the -test to determine the input dimension from a set of candidate input vectors that have been generated using recent recordings and the maximum number of input dimensions is predefined.
The -test was first introduced in [22], improved in [23], simplified in [24], and used in [15,25].The idea behind all versions of the -test is simple and intuitive.As expressed in (1), if the underlying relationship of the system can be completely determined by a smooth function ,  should be real noise which is assumed to be independent to input vector x (i.e., ( −1 ,  −2 , . . .,  − )) and have zero mean and bounded variance.For a given input vector x and its nearest neighbor x  , the continuity of regression function implies the outputs (x) and (x  ) should be close enough in the output space (note that it is not necessary to be the closest pair for the outputs).In other words, if a deterministic (time-invariant) system is fed with the same values at different time instants, the same values should be outputted from the system, apart from the noise.Alternatively, the corresponding outputs are different due to the presence of noise.
This work uses the simplified version of the -test, which first searches for the nearest neighbor (its index is denoted as ) from  points for a given point in the input space according to a distance metric (Euclidean distance is used in this context): Then, the variance of , var[], can be estimated by The result obtained from the -test provides a criterion for input selection.The algorithm starts by including only the current observation as the initial dimension of the candidate input vector and the residual variance for such origination is calculated.Then, the dimension increments by considering the next recent recording to form a new candidate input vector which is subsequently examined by the -test.Such iterative process terminates when the input dimension exceeds the maximum number predefined.
In contrast to the model-based approaches, it is not necessary to build a model for input dimension determination by using the -test to select a competent input vector within the maximum dimension which may require domain knowledge or assistance of a model.However, the result from the test cannot implicitly tell us to what extent an optimal model can fit a noisy time series, as the estimated residual variance encompasses modeling error and real noise.
The data-driven approach is often used to construct a FIS based on a set of training data, whenever a prior knowledge about the system is unavailable or incomplete.The goodness of the model is usually measured by the MSE between the training data and the outputs from the model.If the training data has been corrupted by some noise, there is a risk of overfitting to the noisy training data if the objective is to minimize the MSE and the model will consequently perform poorly on the previously unseen inputs.This is the primary reason that we designed the denoising substage, as the consequent parameters estimated are subject to the minimization of the MSE (described in the next section).
To this end, the well-known denoising technique based on wavelet transform [26] is employed and one-dimensional denoising process is performed directly on the collected time series of traffic flow.In general, the denoising procedure follows the three steps: (1) decomposition: compute the wavelet decomposition of the noisy data by applying a wavelet ("db3" in our case) to it at the specified level; (2) detail coefficients thresholding: select appropriate threshold for each level and apply thresholding method (soft thresholding in our implementation) to remove the noises; (3) reconstruction: inverse wavelet transform of the thresholded wavelet coefficients to generate clean data.Further information on wavelet-based denoising technique can be obtained from the papers [26][27][28].
The decomposition level is one of the critical factors, which has significant impact on the denoising performance (the discussions on the others, such as wavelet base, are beyond the scope of this paper) [29].The simplest method is to choose a fixed decomposition level of wavelet transform based on experience before denoising, but it will be problematic when the noise level embedded is not what is expected.Therefore, a number of adaptive approaches have been proposed [30,31].Those approaches determine the decomposition level by iteratively examining the denoising performance for the incremented level until the performance meets some stop criteria which are set normally based on either signal-to-noise ratio (SNR) [32] or singular spectrum [30].However, our method is different from those reported Mathematical Problems in Engineering in literature in that it is based on the residual variance estimated by the -test and the input vector can be determined simultaneously.
As shown in Figure 1, a number of candidate input vectors are generated by incrementing the dimension in turn and analyzed via the -test.If the minimum residual variance obtained from the set of candidate input vectors does not satisfy the following criterion (called variance ratio  in this paper), where  is the original time series or the reconstructed time series (if denoising is performed),  is a set of residual variance values corresponding to the set of candidate input vectors, and  is a constant which needs to be predetermined, then the wavelet-based denoising technique is applied to the time series to reduce noise level and the decomposition starts from one level.The denoised time series generates another set of candidate input vectors which are subsequently examined again by the -test.If the noise level is still high, the decomposition level increments and the denoising process is performed again.Such iterative process terminates until the stop criterion is met and the input vector that has produced the minimum residual variance will be selected as the input for the FIS.

Parameters and Rules Determination.
The NNC [16] is a simple but effective clustering technique that can be used to determine the number of membership functions and provide initial cluster centers for the subsequent tuning process.The clustering process by the NNC algorithm is as follows: (1) choose the first data from the set of training data as the first cluster center; (2) group the data into the first cluster if its distance to the cluster center is less than the predefined threshold value (0.8 for all experiments presented in this paper); (3) otherwise, set the data as the second cluster center; (4) examine the next data by calculating its distances to all existing cluster centers; (5) group the next data into the cluster whose center is closest to it and their distance is less than the threshold; (6) otherwise, generate a new cluster by setting it as the center.Such process terminates if all data have been checked and the final number of clusters obtained is used as the number of membership functions.
During the tuning process, a GMM is used to group the training data into a set of clusters whose number and initial centers have been determined by the preceding process.
A mixture of  component Gaussians can be expressed as follows: where (x; v  , Σ  ) is the probability density function of th component Gaussian with mean vector v  and covariance matrix Σ  , and its weight   satisfies To determine the parameter vector  (i.e.,   , v  , and Σ  for each component) for a Gaussian mixture model, the Expectation Maximization (EM) algorithm is applied to the set of training data.An unobserved latent variable y is added and this latent variable determines the component from which the observed data x originates.Thus, the log-likelihood function can be formulated as follows: The EM algorithm [33] fits a mixture model to a set of training data by alternately performing between expectation step and maximization step.Suppose the current step is at  step.Expectation Step.Determine the expectation of the log-likelihood function, with respect to the conditional distribution of y given x under the current estimate of the parameters.Maximization Step.Update the parameters that maximize the expectation of the log-likelihood function obtained in expectation step.The process terminates either if a predefined number of iterations is exceeded or if the parameters improvement between two consecutive iterations is less than the minimum amount of improvement predefined.
As Gaussian membership functions are adopted for the premises of the FIS, the result from the clustering by GMM can be directly used to determine the parameters for the membership functions.Meanwhile, the rules can be extracted from the clustering results and the number of rules for the FIS is the number of clusters.
The consequent part is a linear function of the input components, and, therefore, we use a well-known method, least square estimation [18], to determine the parameters for the linear function.

Experiments
The proposed approach has been implemented in MATLAB and evaluated through a series of experiments on the traffic flow data which was collected and aggregated at intervals of 2 minutes' duration for a road in Beijing for one week, from November 20 (Monday) to November 26 (Sunday), 2006.
Before examining the proposed prediction approach, the performance of the -test should be evaluated as it has been employed to determine the decomposition level and input dimension.The evaluations on the -test from various aspects have been reported in [22][23][24][25], apart from its sensitiveness on the choice of the starting point.Therefore, this work is concentrated to a sensitive analysis to evaluate the dependence of the -test on the starting point.The oneweek traffic flow data with 100 different starting points have been used by the -test to estimate the residual variance and the coefficients of variation have been calculated and listed in When determining the input dimension, the maximum number of input dimensions has to be specified by users.To this end, users may take advantage of their experience and knowledge on the traffic flow field.Here, we provide a simple method to determine the maximum number of input dimensions based on the calculation of the partial correlation coefficients [34] for the traffic flow data.Figure 2 shows the partial correlation coefficients obtained for the traffic flow data of each day with the maximum lag of 100 steps.This experiment implies the traffic flow with lag greater than 10 generally has much less significant association with the current flow, as most coefficients fall into 95% confidence intervals, as shown in Figure 2. Therefore, the maximum number of input dimensions has been selected to be 10 for the experiments presented in this paper.
The next experiment has been designed to evaluate the denoising and prediction performance of the proposed approach on the traffic flow data of one week.To evaluate the denoising performance for the proposed approach on the traffic flow, Kolmogorov-Smirnov test (K-S test) has been employed for each test to determine whether the denoised data and noise data estimated by the wavelet denoising technique follow the same distribution and the correlation coefficient between the denoised data and the estimated noise has been also calculated for each day to measure their dependency.On the other hand, Root Mean Square Error (RMSE) [13] was used as an index to evaluate the prediction performance of the proposed approach.Although it is not appropriate to use RMSE as an objective to build a model as the minimization of RMSE is likely to result in overfitting, it is suitable for performance evaluation particularly when comparison is carried out.For all the following experiments, the first 500 datasets of traffic flow was used to train a FIS and the last 200 datasets to test the prediction performance.
The obtained results are shown in Table 2 and Figures 3 and 4. The decomposition level and input dimension for the traffic flow on each day have been obtained when the variance ratio meets the required criterion.The number of rules (the same as the number of membership functions) has been determined by performing the NNC algorithm on the training data with the cluster radii set as 0.8.
Figure 3 shows the histograms for the denoised data and the noise data estimated by the wavelet denosing method with the decomposition levels determined by the proposed approach and listed in Table 2.It is obvious that the denoised data and the noise data do not follow the same distribution which is also proved by the  values obtained with 95% confidence level through the T-S test.From Figure 3, it can be seen that the estimated noise data generally follow a Gaussian distribution.Additionally, the correlation coefficients calculated for the denoised flow and the noise data also indicate there is a very weak correlation between them.Furthermore, the original traffic flow data and the denoised flow are also illustrated in Figure 4, with grey solid lines and black dotted lines, respectively.From Figure 4, it can be seen that the denoised flow data is able to retain the general flow trend without losing the mostly significant peaks and valleys, while a smooth traffic flow pattern is obtained.
Regarding the training and prediction performance, RMSEs have been calculated and listed in Table 2, together with the fitting data and the predicted data depicted against the original data and the denoised data in Figure 4. From Table 2, although the RMSEs for training are lower than those for prediction, the prediction in general performed well.The larger RMSE values obtained during the prediction stage may be caused by inefficient samples presented during the training stage.Furthermore, the discrepancies between the fitting flow or the predicted flow and the denoised data are hardly seen from Figure 4 for all days, reflecting the proposed approach can accurately predict the flow trend.
To evaluate the proposed algorithm for input dimension determination, the input dimension has been varied from 1 to 10 while the decomposition level and the number of rules is kept the same as those listed in Table 2.When varying the input dimension, it is difficult to generate the same number of clusters (i.e., the number of rules) by the NNC algorithm and thus it was set to be ineffective for this experiment.Consequently, the initial centers for the GMM under construction were set randomly.To eliminate any bias that might result from the random initialization, 100 independent runs have been performed and the averaged results are presented in Table 3.While the RMSEs for training (denoted as ) and prediction (denoted as ) are presented, the ratios of the RMSEs for prediction over those for training are listed as well in an attempt to indicate the degree of overfitting.
In general, the RMSEs for each day of the week decrease with the growth of the input dimension, indicating a trend that the modeling performance is gradually improved.However, the situation for prediction is quite complicated, as the pattern is not very evident, though it appears that a concave pattern generally holds for the most of days.The RMSE ratios (indicated by /), in the cells with bold font in Table 3, correspond to the input dimensions shown in Table 2 and the smallest ratio for each day of the week is also highlighted by italic font.For the Monday, although the RMSE ratio corresponding to the input dimension of 2 is the third smallest RMSE ratio, the overfitting problem is not significant.However, the RMSE ratio for Tuesday is quite high, but the RMSE values obtained during the training  and predicting phase are both smaller than those indicated with italic font, which gives the smallest RMSE ratio.A similar pattern is also held for Wednesday.For Thursday, the overfitting effect has been reduced to the minimum when the input dimension corresponds to that listed in Table 2.
Although the RMSE ratios for Friday and Sunday are not the smallest, they are bounded within 2, implying overfittings are not considerable.For Saturday, the RMSE ratio is ranked as the second in terms of the smallest value, but the RMSE values for the training and prediction phase are all smaller than those with the smallest ratio (denoted as italic font).After all, the experimental results indicate that the overfitting problem can be generally restrained with the input dimension determined by the proposed approach which is not designed to explicitly deal with the overfitting problem.
In order to evaluate the NNC algorithm for the determination of the number of rules (clusters), we have performed another experiment by varying the number of clusters from 1 to 10 and the RMSE values and the RMSE ratios (i.e., /) obtained from the FISs created by the GMMs with random initialization are presented in Table 4.Note that the RMSEs and the RMSE ratios presented have been averaged over 100 independent runs as the initial center for each Gaussian component in a mixture model was initialized randomly.
The RMSEs listed in the column corresponding to 1 cluster are considerably large, as compared to those in other columns, and this is due to the fact that all training data have been grouped in one cluster.Apart from the first column, the performances are quite constant for the whole week, reflected by a general decreasing and increasing patterns possessed for modeling and prediction, respectively.In Table 4, the RMSE ratios in the second column have been highlighted with bold font to indicate they have been obtained when the number of rules (clusters) was determined as 2, as shown in Table 2.The RMSE ratios obtained with the choice of 2 rules for the days of the week except for Friday are all closer to 1 than those with the number of rules greater than 2. For Friday, the RMSE ratio (highlighted with the italic font in Table 2) mostly approaches 1 among those with the number of rules greater than 1, but the number of rules is just one more than that determined by the NNC algorithm.In summary, the choice of 2 rules (clusters) is proper as the overfitting problem can be considerably alleviated in general and consequently results in a compact structure for the FIS.

Conclusions
The approach proposed for one-step-ahead prediction in this paper consists of a number of novel elements that have been developed to deal with the issues relating to the input dimension determination and parameters and rules estimation.A series of experiments have been performed based on the one-week traffic flow to evaluate the proposed prediction approach.The first experiment demonstrates a method introduced to determine the maximum number of input dimensions, a parameter required for the determination of the input dimension, by calculating partial correlation coefficient.By analyzing the distribution discrepancy between the denoised flow and the estimated noise data, the denoising performance has been evaluated and the experimental results indicate the denoising procedure is effective in the noise removal.The prediction performance has been evaluated by calculating the RMSE values for the training phase and the predicting phase.
Further experiments have been performed to evaluate the input dimension and the number of rules determined by the proposed approach in terms of overfitting.Although the prediction approach has not been designed with the particular emphasis on the overfitting issue, the experimental results have shown that no significant overfitting is caused from the proposed prediction approach.Based on the experimental analyses, it can be concluded that the proposed approach is compact and competent, but further improvements are possible, by refining the stopping criterion for denosing and input dimension determination process, for example.

Figure 1 :
Figure 1: Framework of the prediction approach proposed.

Figure 4 :
Figure 4: Evaluation of the proposed prediction approach for the training and prediction phases on the traffic flow data in the week, November 20 to 26, 2006 (the original flow data and the denoised flow data are represented by the grey solid lines and the black dotted lines, resp., while the black solid lines indicate the fitting and predicted traffic flow).

Table 1 :
Coefficients of variation for the residual variances estimated for the traffic flow with 100 different initial points.

Table 2 :
Evaluation of the proposed approach on the one-week traffic flow.

Table 3 :
Evaluation of the impact of the input dimension on the prediction by varying the dimension from 1 to 10 for the one-week traffic flow (RMSEs averaged over 100 independent runs are presented,  and  denote the RMSE values obtained from training and prediction, resp., and / presents the RMSEs for prediction over those for training).

Table 4 :
Evaluation of the impact of the number of clusters on the prediction by varying it from 1 to 10 for the one-week traffic flow (RMSEs averaged over 100 independent runs are presented, and  and  denote the RMSE values obtained from training and prediction, resp.).