Short-Term Traffic Flow Local Prediction Based on Combined Kernel Function Relevance Vector Machine Model

Short-termtrafficflowpredictionisoneofthemostimportantissuesinthefieldofadaptivetrafficcontrolsystemanddynamictraffic guidancesystem.Inordertoimprovetheaccuracyofshort-termtrafficflowprediction,ashort-termtrafficflowlocalprediction methodbasedoncombinedkernelfunctionrelevancevectormachine(CKF-RVM)modelisputforward.TheC-Cmethodisused tocalculatedelaytimeandembeddingdimension.ThenumberofneighboringpointsisdeterminedbyuseofHannan-Quinn criteria,andtheCKF-RVMmodelisbuiltbasedongeneticalgorithm.Finally,casevalidationiscarriedoutusinginductiveloop datameasuredfromthenorth–southviaductinShanghai.TheexperimentalresultsdemonstratethattheCKF-RVMmodelis31.1% and52.7%higherthanGKF-RVMmodelandGKF-SVMmodelintheaspectofMAPE.Moreover,itisalsosuperiortotheother twomodelsintheaspectofEC.


Introduction
Short-term traffic flow prediction is an important basis for intelligent transportation systems (ITS).Real-time and accurate prediction information can be directly applied to the advanced traffic management system (ATMS) and advanced traffic information service system (ATIS).Because of its importance, short-term traffic flow predication has generated great interest among the scientific community and a large number of relevant methods exist in the literature.These include the spectral analysis model [1,2], time series model [3,4], regression model [5,6], the Kalman filtering model [7,8], neural network model [9,10], support vector machine model [11,12], and wavelet network model [13].Reader interested in details of models applied in traffic flow prediction field could refer to review papers such as [14][15][16].With the development of chaos theory, recent studies such as [17][18][19] have found that the short-term traffic flow time series data had nonlinear chaotic phenomena.Therefore, short-term traffic flow chaotic predictions have gained special attention.The prediction of chaotic time series could be generally classified into two categories: global prediction and local prediction.Global prediction methods use all phase points to describe the evolution law and then to predict the future value.A number of researchers have utilized global prediction methods in prediction of chaotic time series.Karunasinghe and Liong [20] investigated the performance of artificial neural network as a global model in chaotic time series predictions compared to local prediction models.Dong et al. [21] adapted the Elman neural network to realize short-time traffic flow prediction based on chaos analysis.Baydaroglu and Kocak [22] used support vector regression model to predict evaporation amounts, and phase space reconstruction is used to prepare input data for SVR.Local prediction methods select  neighboring points to fit the brief evolution trend of phase points and then to obtain the predicted value.Local prediction methods mainly include local average prediction method [23], weighted first-order local prediction method [24], the Lyapunov index prediction method [25], and support vector machine model [26].Due to the less number of fitting phase points, the local prediction method has the advantage of low computational complexity and high fitting degree.Farmer and Sidorowich [27] had already proved that the performance of local prediction methods was better than global prediction method under the same embedding dimension.Therefore, local prediction is adopted to achieve short-term traffic flow prediction in this paper.
In order to get the accurate prediction results, we need to find the nonlinear prediction function.However, it is hard to get the accurate function due to the interference of inside and external excitations.But determining the linear function is not hard since detecting linear relations has been focus of much research in statistics and machine learning fields for decades and the resulting algorithms are well understood, well developed, and efficient.So if we could combine both, it will solve the problem.Instead of trying to fit a nonlinear model, we can map the problem from the input space to the feature space by doing a nonlinear transformation using suitably chosen basis functions and then use a linear model in the feature space.The basis function is called kernel function.The linear model in the feature space corresponds to a nonlinear model in the input space.This is the main idea of relevance vector machine (RVM) model.Due to RVM theoretical advantages, it has gained special attention in recent years, such as [28][29][30].This paper is motivated to build the short-term traffic flow forecasting model based on RVM because of its ability to deal with the dynamic, nonlinear, and complex traffic flow time series.consequently, it is very suitable for short-term traffic flow prediction.
For these reasons, and with the goal of improving the accuracy of short-term traffic flow prediction, we put forward a short-term traffic flow local prediction method based on combined kernel function relevance vector machine model.The remainder of this paper is structured as follows: Section 2 presents the phase space reconstruction theory.Section 3 gives the process of building combined kernel function relevance vector machine model.Section 4 describes the experiment setup and case study.Section 5 draws some conclusions.

Phase Space Reconstruction Theory
Phase space reconstruction theory proposed by Packard et al. [31] is a powerful tool in the study of complicated system.According to the theory of chaos dynamics, the time series contains total useful information and reflects the process of system evolution in a long term.Complex characteristics found in a time series may be the result of temporal evolution on a chaotic attractor, objects of fractal dimension created by means of stretching and folding of space.If we could capture chaotic behavior from the time series signal of traffic flow models, we could enhance our knowledge about the inherent properties of the traffic flow system.Phase space reconstruction theory is used to create topologically equivalent attractors to the original dynamical system using the information from a scalar time series only [32].
Phase space can be reconstructed using delay coordinate method.The basic idea of delay coordinate method is that the evolution of any single variable of a system is determined by the other variables with which it interacts.Information about the relevant variables is thus implicitly contained in the history of any single variable.For a time series {(),  = 1, 2, . . ., }, the phase space can be reconstructed according to where  is delay time and  is embedding dimension.Embedding dimension and delay time are the key parameters for phase space reconstruction.At present, there are two kinds of views about the selection of these two parameters.One view is that the two parameters are independent and could be determined separately.The methods of calculating delay time include Average Displacement method [33], Mutual Information method [34], and Autocorrelation Function method [35].The methods of calculating embedding dimension include False Nearest Neighbors method [36], Cao method [37], and G-P method [38].Another view is that the two parameters are interrelated and should be determined simultaneously, such as C-C method [39].C-C method can obtain embedding dimension and delay time simultaneously.Compared with other methods, C-C method has the advantage of small amount of calculation and strong anti-interference.Therefore, C-C method is employed to determine delay time  and embedding window width   , and then the embedding dimension  is calculated according to   = ( − 1).The principle of C-C method is as follows.
{(),  = 1, 2, . . ., } denotes time series data; a new set of vector series denoted by  = () could be obtained through phase space reconstruction.The correlation integral for the embedded time series is the following function: where  is the neighborhood radius, () is phase point in phase space,  is delay time,  is embedding dimension,  = −(−1) is the number of embedded points in phase space,  is the length of time series, ‖ ⋅ ‖ ∞ denotes sup-norm, and The correlation integral is a cumulative distribution function and denotes the probability that the distance between any two points is less than .We define the test statistics The time series (),  = 1, 2, . . ., , can be divided into  disjoint time series.The results are as follows: The test statistics is where   denotes the correlation integral of the th subsequence.
As  → ∞, we can write For fixed embedding dimension  and delay time , as  → ∞, (, , ) will be identically equal to 0 for all  if the time series data are independent and identical distribution.However, the actual time series data are finite and correlated, so (, , ) is not equal to 0 generally.Thus, the local optimal times may be either the zero crossings of (, , ) or the times at which (, , ) shows the least variation with , because this indicates that these points are uniform distribution.Hence, we select the maximum and minimum radius to define quantity. Consider Δ(, ) measures the maximum deviation of (, , ) ∼  with .Therefore, the optimal delay time is the first zero crossings of (, , ) ∼  or the first local minimum point of Δ(, ) ∼ .
Because there are many parameters in the model, the maximum likelihood estimates of  and  2 will lead to severe overfitting.Therefore, the sparse Bayesian theory is adopted and a prior zero-mean Gaussian distribution over  is as follows: where  = { 0 ,  1 , . . .,   } is a vector of  + 1 hyperparameters.Each weight is individually associated with a parameter, which controls the influence of the prior distribution over associated weight.
Because we have defined the prior probability distribution and the likelihood distribution, the posterior probability distribution is as follows according to the Bayesian theory: Posterior covariance matrix and mean value are as follows, respectively: where  = diag( 0 ,  1 , . . .,   ).
According to the maximum expected hyperparameter estimation, the value of  and  2 can be obtained through iterative algorithm.Consider where   is the th posterior average weight and   = 1 −     , where   is the th diagonal element of the covariance matrix computed by the current  and  2 .
The noise variance  2 can be obtained through iterative algorithm Given a new sample  * ,  * is the corresponding prediction value.The probability distribution of prediction value follows a normal distribution ( * ,  2 ) with mean  * and variance  2 * .Consider where  * is the predictive mean on  * and  2 * is the predictive variance.

The Construction of Combined Kernel Function.
The traditional relevance vector machine model mostly adopts single kernel function to complete the process of feature space mapping, which has achieved good performance in many practical applications.But the single kernel function has great limitations when the sample data contains heterogeneous information.Therefore, this paper integrates the Gaussian kernel function and polynomial kernel function to construct a new combination kernel function.The form of combination kernel function is as follows: where  is weight coefficient, 0 ≤  ≤ 1,  is the kernel width of Gaussian kernel function, and  is the order of polynomial kernel function.Different kernel functions have different advantages; if the weight coefficient of combination kernel function is inappropriate, the performance of combination kernel function may be lower than single kernel function.Therefore, proper weight coefficient is of great importance for the combined kernel function.

Parameter Optimization Based on Genetic Algorithm.
There are three parameters that need to be optimized in the combined kernel function.The commonly used parameter optimization methods mainly include cross validation method [41] and grid search method [42].But these methods have a large amount of calculation and are often trapped in local optimum.Genetic algorithm (GA) [43] is a heuristic scientific method based on Darwin's biological evolutionism, which has been widely applied to solve high dimensional optimization problem for parameter optimization in engineering and science areas.Genetic algorithm differs from traditional search and optimization methods in four significant points:Genetic algorithms search parallel from a population of points.Therefore, it has the ability to avoid being trapped in local optimal solution like traditional methods, which search from a single point.Genetic algorithms use probabilistic selection rules, not deterministic ones.Genetic algorithms work on the chromosome, which is encoded version of potential solutions' parameters, rather than the parameters themselves.Genetic algorithms use fitness score, which is obtained from objective functions, without other derivative or auxiliary information.
Therefore, genetic algorithm is used to obtain the optimal parameters of combination kernel function.The specific steps are as follows.
Step 1 (initialize the parameters).The population size and maximal generation count: the population size is 20, and the maximal generation count is 100.
Step 2 (representation).The parameters to be optimized , , and  are coded in binary to generate the chromosomes.
Step 3 (fitness function definition).The cross validation method is used to prevent overfitting and underfitting.The training data set is randomly divided into  subsets in fold cross validation.The RVM model is built using  − 1 subset as the training set.The performance of the parameters is checked on the th subset.In this paper, fivefold cross validation method is used.The fitness function is defined as the mean absolute percentage error of the fivefold validation method on the training data set.
Step 4 (creating new population).Selection, crossover, and mutation are carried out to generate population.The chromosomes with better fitness function values are selected using the roulette wheel method.The crossover probability of creating new chromosomes is set to 0.8.Mutation probability is set to 0.05.
Step 5 (stopping criteria determine).If the generation count reaches its maximum value, the iteration is stopped.Otherwise, the process is repeated from Step 3 to Step 4.

Phase Space Reconstruction.
Phase space reconstruction is the basis of chaotic time series analysis which affects the prediction performance directly.This paper selects C-C method to complete phase space reconstruction.Figure 2 gives the curve graph between Δ() and . Figure 3 gives the curve graph between  cor () and .From Figure 2, it can be seen that when Δ() get the first minimum, the value of  is 18.Therefore, the value of delay time  is determined to be 18.From Figure 3, when  cor () get the global minimum, the value of  is 113.Therefore, the value of   is 113, and the embedding dimension  is determined to be 7 according to   = ( − 1).
Figure 4 displays the 2D attractor of the reconstructed phase space for traffic flow time series.
From Figure 4, we could see clearly that the 2D attractor for traffic flow time series is well-regulated, which instructs that the C-C method could implement phase space reconstruction of traffic flow time series excellently.

Identification of Chaos.
Among the wide variety of methods available for chaos identification, the most popular one is the largest Lyapunov exponent method.The main methods of calculating largest Lyapunov exponent include Wolf method [44], Jacobian method [45], and small data sets method [46].Due to the smaller amount of calculation and clear principles, the small data sets method is employed to calculate the largest Lyapunov exponent of traffic flow time series.Figure 5 displays the result of small data sets method.The linear range is from 57 to 98, and the largest Lyapunov exponent corresponding to the slope value is obtained after the least-squares fit for the linear range.The largest Lyapunov exponent is found to be 0.0014, and this positive value implies an exponential divergence of the trajectories and hence a strong signature of chaos.

The Number of Neighboring Points.
The number of neighboring points is one of the most important parameters which affects the prediction accuracy and the amount of calculation.If the number of neighboring points is too little, the nonlinear fitting advantage of relevance vector machine model will not be reflected.However, if the number of neighboring points is too much, the amount of calculation will increase greatly and the overfitting phenomenon will appear.Therefore, the Hannan-Quinn criteria [47] are used to determine the number of neighboring points.Figure 6 shows the results of Hannan-Quinn criteria.
According to Hannan-Quinn criteria, when () gets the minimum value, the corresponding  is the number of neighboring points which we need.From Figure 6, we could see that the number of neighboring points is 26.

Parameter Optimization.
Genetic algorithm is used to optimize , , and .The specific parameters of genetic algorithm are as follows: the population size is 20, maximal generation count is 100, the crossover probability is 0.8, and the mutation probability is 0.05. Figure 7 gives the fitness curve.
From Figure 7, we could see that the optimal parameters of combined kernel function are  = 0.57,  = 0.25, and  = 3. 4.6.Performance Evaluation Index.In order to evaluate the performance of the proposed method, two different types of  As shown in Figure 8, the prediction results are quite close to the actual data, and the MAPE are mostly within 10%.However, the MAPE from 0:00 to 4:00 is high, and this is because the actual traffic flow data during that time period is small.Overall, the CKF-RVM model achieves good prediction performance, which could meet the needs of short-term traffic flow prediction.
To describe the superiority of the proposed method detailedly, comparative analysis is carried out.This paper selects Gaussian kernel function relevance vector machine (GKF-RVM) model and Gaussian kernel function support vector machine (GKF-SVM) model as comparative approaches.For the sake of comparison and analysis in terms of macroscopic and microscopic aspects, Figure 9 gives the microscopic comparative results of different methods.Figure 9(a) shows the prediction results for east mainline detector denoted by NBDX11(1), and Figure 9(b) shows the prediction results for west mainline detector denoted by NBXX15(2).Table 1 gives the macroscopic comparative results of different methods.
As shown in Figure 9, we could see clearly that the prediction results of CKF-RVM model have the best fitting performance comparing to GKF-RVM model and GKF-SVM model.Therefore, the CKF-RVM model could further improve the accuracy of short-term traffic flow prediction.
From Table 1, we could see that the overall improvement of CKF-RVM model is obvious comparing to GKF-RVM model and GKF-SVM model.More precisely, the CKF-RVM model has an extra 31.1% improvement over the GKF-RVM model and an extra 52.7% improvement over the GKF-SVM model in the aspect of MAPE.Meanwhile, the CKF-RVM model is also superior to the other two models in  the aspect of EC.Furthermore, the experimental results also demonstrate that the CKF-RVM model achieves good prediction performance for both east mainline data and west mainline data, which proves that CKF-RVM model has strong generalization ability.Overall, the CKF-RVM model is an effective and accurate method for short-time traffic flow prediction, which can provide satisfactory prediction results.

Discussion and Conclusions
This paper proposes a new short-term traffic flow local prediction method based on combined kernel function relevance vector machine model.The proposed method is more in line with the short-term traffic flow characteristic, which are nonlinear, chaotic, and nonstationary.The main contribution of this paper is not the specific techniques but rather the demonstration that the forecasting model should take the dynamic characteristics of short-term traffic flow into consideration.The most important contribution is that this paper provides the new idea and methodology to the relevance vector machine model on how to construct the combined kernel function for the short-term traffic flow forecasting model and how to optimize and identify the model structure parameters efficiently and effectively.
Traffic flow data collected from expressway are employed to evaluate the prediction performance of the proposed method, and the results are encouraging.The theoretical advantage and better performance from our studies indicate that the CKF-RVM model has good potential to be developed and is feasible in applying for short-term traffic flow prediction.In order to have more general and robust conclusions, traffic data from different roadways require further exploration.And future studies need to apply the model to other traffic variable data sets (such as traffic speed, travel time, and average occupancy; this study chooses the traffic flow as the demonstration).Moreover, it will be interesting to test traffic data set in different time intervals in the model.

4. 1 .
Data Source.The experimental traffic flow data come from loop detectors located on the north-south viaduct expressway in Shanghai, China.This segment includes 24 mainline detecting sections and 30 ramp detecting sections, equipped with 88 mainline loop detectors and 60 ramp loop detectors, respectively.The experimental data are collected on five consecutive Mondays from September 1, 2008, to September 29, 2008.The original time interval of collected data is 5 min.Figure1gives the traffic flow time series data from five consecutive Mondays.

Figure 1 :Figure 2 :
Figure 1: The traffic flow time series data from five consecutive Mondays.

Figure 8 :
Figure 8: The prediction performance based on the proposed method.

Figure 9 :
Figure 9: The microscopic comparative results of different methods.
The principle of RVM model is as follows.Consider a data set {  ,   }  =1 , where   ∈   ,   ∈ .The relationship between   and   is as follows: [40]ThePrinciple of RVM Model.The relevance vector machine (RVM) model proposed by Tipping[40]is a sparse probabilistic model based on Bayesian principle.Compared with other intelligent algorithms, RVM owns better performance.For example, the kernel function of RVM model need not be restricted by Mercer's condition.Moreover, it inducts a priori distribution of the weights and then greatly reduces the complexity of calculation.
2),   () = (,   ) is the nonlinear basis function, and (⋅) is the kernel function.Therefore, (  | ) = (  | (  ),  2 ) denotes the normal distribution of   with mean (  ) and variance  2 .Assume   are independent of each other; the likelihood of the complete data set can be written as

Table 1 :
Prediction performance comparison of different methods.denotes the actual value for the th time interval, ŷ denotes the predicted value for the th time interval, and  is the total number of time intervals.4.7.Model Performance and Analysis.Data collected from September 1 to September 22 are used as training samples, and data collected on September 29 are used as test samples to evaluate the performance of prediction model.In order to illustrate the predictive performance of the proposed method intuitively, Figure8presents prediction results based on the proposed method.The black line stands for actual traffic flow data, and the red line stands for the prediction results of CKF-RVM model.Figures8(a) and 8(b), respectively, show the prediction results and the MAPE for east mainline detector denoted by NBDX16(2).Figures 8(c) and 8(d), respectively, show the prediction results and the MAPE for west mainline detector denoted by NBXX10(1).