Short-Term Speed Prediction Using Remote Microwave Sensor Data : Machine Learning versus Statistical Model

1Department of Automation, Tsinghua National Laboratory for Information Science and Technology (TNlist), Tsinghua University, Beijing 100084, China 2Key Laboratory of Road and Traffic Engineering of Ministry of Education, Tongji University, Shanghai 201804, China 3School of Transportation Science and Engineering, Harbin Institute of Technology, Harbin 150001, China 4Department of Civil & Environmental Engineering, University of Washington, P.O. Box 352700, Seattle, WA 98195, USA


Introduction
Collecting high quality traffic information is the key factor to achieve the performance of Intelligent Transportation System (ITS).Accurate prediction of future patterns in traffic flow becomes more important in Advanced Traffic Management System (ATMS) and Advanced Traveler Information Systems (ATIS).Using the forecasted information, such as traffic volume data, travel time data, and traffic condition information, travelers can replan the traveling paths to save their time and cost.Furthermore, transportation agencies can also improve the efficiency of management in traffic system based on forecasted information.Travel speed is an important indicator to estimate traffic conditions in road networks.Compared with general collecting approaches, loop detectors, and GPS equipment, Remote Traffic Microwave Sensor (RTMS) is another important nonintrusive device to directly detect instantaneous travel speed of vehicles.RTMS is installed on the side of the road, and it can directly detect moving or stationary objects without interrupting traffic flow.It can detect traffic volume, occupancy, and speed for multiple lanes simultaneously although sometimes in severe environment.As its high measurement accuracy [1] compared to single loop detector, travel speed data collected from RTMS is used as data source to construct prediction model in this paper.
In order to improve forecasting performance, the neural network model was used to aggregate speed information and acceleration information from the current forecasting segment and adjacent segments.Van Lint et al. [13] proposed a state-space neural network model that utilizes upstream and downstream traffic as model input to predict travel time with respect to missing or corrupt input data.Ma et al. [14] developed a Long Short-Term Memory Neural Network (LSTM) to predict travel speed prediction based on RTMS detection data in Beijing City; the proposed model can capture the long-term temporal dependency for time series and also automatically determine the optimal time window.For the support vector machine, Wu et al. [17] applied support vector regression (SVR) for travel time prediction, and they compared the proposed model with some traditional travel time prediction methods in highway network.Zhang and Liu [18] combined state-space approach and least squares support vector machines (LS-SVMs) to forecast travel time index.Asif et al. [20] firstly analyzed spatiotemporal trends for individual links at the network level and then constructed support vector regression (SVR) to predict travel speed in large interconnected road network.For the Kalman filter theory, Chen and Grant-Muller [22] proposed a Kalman filter type network to predict traffic flow, and they also discussed the effect of starting network parameters to the prediction performance.Chien and Kuchipudi [23] used Kalman filtering algorithm to predict travel time for its significance in continuously updating the state variable as new observations.Their empirical results indicated that the prediction performance based on historic path-based data is better than that based on link-based data during peak hours.Wang et al. [26] proposed a new extended Kalman filter (EKF) based online-learning approach to predict highway travel time.In order to effectively improve prediction performance, many scholars proposed various hybrid models to combine advantages of different kinds of methods.Dimitriou et al. [27] used genetic algorithm to structure an adaptive hybrid fuzzy rule-based system for forecasting traffic flow in urban arterial networks.Zheng et al. [28] introduced a neural network model combined with the theory of conditional probability and Bayes' rule, and the combined model that is demonstrated outperforms the singular predictors from the experimental test of Singapore's Ayer Rajah Expressway.Dong et al. [32] proposed a hybrid support vector machine that combines both statistical and heuristic models to consider the spatial-temporal patterns in traffic flow.
For the statistical model, Cetin and Comert [5] proposed a new statistical change-point detection algorithm to predict short-term traffic flow, in which the expectation maximization and the CUSUM (cumulative sum) algorithms are implemented to detect shifts.Chandra and Al-Deek [7] considered the effect of upstream and downstream locations on the traffic at a specific location into a traditional Autoregressive Integrated Moving Average (ARIMA) model.Williams and Hoel [8] modeled a seasonal ARIMA process to complete traffic flow forecasting.For the neural networks, Ye et al. [11] used a neural network model to forecast traffic flow time series based on GPS data recorded at irregular time intervals.
According to the reviewed literature, most of traffic prediction models are mainly based on statistical methods and machine learning techniques.These two types of models have their different characteristics.The statistical models can provide good theoretical interpretability with clear calculation construction.While machine learning models use a "black box" approach to predict traffic conditions and often lack a good interpretation of the model, however, compared with statistical models, machine learning methods are more flexible with no or little prior assumptions for input variables.In addition, these approaches are more capable of processing outliers, missing and noisy data [33].In this study, we compare the prediction performances between statistical models and machine learning models.In statistical models, we select ARIMA, Vector Autoregression (VAR), and Space-Time (ST).In machine learning models, we chose Artificial Neural Network (ANN), SVM, and Multilinear Regression (MLR) as candidate.The travel speed data come from RTMS detector on fourth ring freeway in Beijing City.The contribution of this paper includes the following: comprehensively compare speed prediction performances of different models in machine learning and statistical method; analyze the prediction accuracy under different forecasting steps ahead; and evaluate models' performance under different scenarios.
The remainder of paper is organized as follows.Section 2 briefly introduces the models used in this study.The data source and analysis are provided in Section 3. Section 4 discusses the results and compares prediction accuracies of different models.Section 5 provides the conclusion of the paper.

Statistical Models.
In this section, we briefly introduce three statistical methods (i.e., ARIMA, VAR, and ST) considered in this study.

ARIMA.
The Autoregressive Integrated Moving Average (ARIMA) (, , ) model contains the following parameters:  is the number of autoregressive terms,  is the number of nonseasonal differences, and  is the number of lagged forecast errors.An ARIMA model can be regarded as a generalization of autoregressive moving average (ARMA) model.The mathematical formulation of an ARMA (, ) process is defined as follows: where {  } is stationary,   is a normal white noise series with mean zero and variance  2  , { 1 , . . .,   } and { 1 , . . .,   } are parameters for the autoregressive and the moving average terms, and the polynomials the ARMA (, ) model can be written as follows: The ARMA model requires that the data series are stationary.
When time series data are nonstationary, the ARIMA model is proposed to model the data which does not show evidence of an ARMA model.In the ARIMA model, the integrated part with order , denoted as (), means the th difference of the original data, which can transform the original data to a stationary series.The mathematical equation of an ARIMA (, , ) model is (3)

VAR.
The Vector Autoregression (VAR) model can capture the linear interdependencies among multiple time series and thus can consider the effect of the neighboring stations in predicting the future speed.Here, a 3-equation VAR() model is used and its formulation is defined as follows: where X +1 = ( +1 ,  +1 ,  +1 )  is the 3 × 1 vector of variables,  0 is the 3 × 1 constant term,  1 through   are 3 × 3 coefficient matrices, and u +1 is the corresponding 3 × 1 independently and identically distributed random vector with (u +1 ) = 0 and time invariant positive definite covariance matrix (u +1 u  +1 ) = Σ u .Before applying the VAR() model, the characteristic polynomial is evaluated to ensure the stability: where  3 is a 3×3 identity matrix.The necessary and sufficient condition for stability is that all characteristic roots lie outside the unit circle.

Space-Time Model.
The Space-Time (ST) model is a probabilistic modeling approach that can provide the point prediction of future observations [10].In probabilistic speed prediction, the commonly used normal distribution is adopted for speed data.Thus, this study assumed that the speed at time  +  at the target station,  + ∼ ( + ,  2 + ), follows a normal distribution.The point prediction of  + is the mean,  + , of the normal distribution.Then,  + is fitted by a linear combination of the present and past values of the speed series at all stations.For example, for station B, when  = 1 (i.e., 2-minute ahead prediction), where  A, ,  B, , and  C, are the 2-minute average speed at stations A, B, and C at time ; stations A and C are the upstream and downstream of station B, and  0 ,  1 , . . .,  5 are model coefficients.Predictor variables for  + are selected based on an analysis of the speed data from first week of the dataset using a stepwise forward search (refer to [10] for details about predictor variable selection algorithm).

Machine Learning Models.
In this section, we select three models, Artificial Neural Network, support vector machine, and Multilinear Regression, to predict travel speed; the following subsections briefly describe these three models.

Artificial Neural Network. Artificial Neural Network
(ANN) is a popular tool for traffic flow prediction because of its capability of handling multidimensional data, flexible model structure, strong generalization and learning ability, and adaptability [33].Different from the statistical methods, ANN does not require underlying assumptions regarding data and is also robust to missing and noisy inputs [33].ANN model is generally constructed as multiplayer system and it is typically defined by three types of parameters: the interconnection pattern between different layers; the learning process for updating the weights for the layers; the activation function that converts input to output activation.An ANN system can be represented as follows: where  and , respectively, represent the number of neurons in the input layer and hidden layer and  and ℎ are the transfer functions for the input layer and hidden layer.The vector matrices of h and u, respectively, refer to the weight values for neurons in both input layer and hidden layer.To minimize the sum of estimated errors from ANN, a number of optimization algorithms were developed including Back Propagation Neural Networks, Levenberg-Marquardt method, and genetic algorithm.The detailed information about ANN is introduced in [11][12][13][14][15]33].As an important member in ANN family, nonlinear autoregressive model with exogenous inputs neural network (NARXNN) allows a delay line on the inputs, and the outputs feed back to the input by another delay line.This is a further extension of the time delay neural network since the NARXNN not only considers its own previous outputs but also incorporates the exogenous inputs [14].

Support Vector Machine.
The main idea of support vector regression is to map data into a high-dimensional feature space through a nonlinear relationship and then construct a linear regression in this space.Given a set of data points ( 1 ,  1 ), ( 2 ,  2 ), . . ., (  ,   ) for regression,  is the number of training samples.The SVM regression function is formulated as follows: where w is a vector in a feature space  and Φ() is called the feature, which maps the input  to a vector in .Assume an -insensitive loss function: Then,  and  are estimated by solving the following optimization problem: where  is the maximum deviation allowed;  represents the associated penalty for expressing deviation during the training process, which evaluates the trade-off between the empirical risk and the smoothness of the model.The positive slack variables  and  * are incorporated, which represent the size of positive and negative excess deviation, respectively.Thus, ( 10) is transformed to the following constrained formulation: subject to: The first term of (11), ‖‖ 2 , is the regularized term.Thus, it controls the function capacity.The second term, ∑  =1 (  +  *  ), is the empirical error measured by -insensitive loss function.By using the appropriate Karush-Kuhn-Tucker (KKT) conditions to (11), we have the following dual form of the optimization problem: subject to: Therefore, the SVM equation for nonlinear predictions becomes where (,   ) is called the kernel function.  and  *  are the solution to the dual problem.There are four conventional kernel functions: linear, radial basis function (RBF), polynomial, and sigmoid.In this study, we select two common functions, linear and RBF, to construct SVM model.The first reason is that these two functions are widely used in prediction and classification.For the second reason, the linear function has advantages of simple construction and low computational time; the RBF function uses nonlinear structure and produces reliable prediction performance based on optimal parameters.For the linear function, For the RBF kernel functions, where  is the parameter in kernel function.

Multilinear Regression.
Compared with the above two supervised algorithms, the construction of multiple linear regressions is simpler and belongs to regression learning category.In MLR, the prediction values can be calculated by the following equation: mlr () represents the prediction value at the th period.The independent variable   ( − ) means the speed data at the previous ( − )th period,  is the number of historical data considered in MLR, and  0 and   are the regression parameters which can be optimized by training samples.The prediction values in testing dataset are estimated from (16).

Data Description
The travel speed data used in the study were collected in 4th ring road in Beijing.The segment we selected stretches from Dongfengbei Bridge to Zhaoyang Bridge, and its total length is approximately 2.74 km.This segment experiences significant traffic congestions during peak hours.The speed data were collected from three adjacent stations, which are shown in Figure 1  The speed values are lower in evening peak hours than other locations, because traffic here is under high pressure and volumes are much higher in evening peak hours.The limitation of data includes erroneous samples and data missing.To the inaccurate data, for example, speed values are higher than speed limit or speed values are negative, we remove those samples from the original dataset.Furthermore, the data missing can be attributed to many natural and man-made reasons, for example, communication failures, malfunctioning devices, incorrect observations, or data transfer problems.So, aimed at the data collection shortcoming, historical averaged based method has been implemented to impute missing and removed data, which ensures that the selected speed samples are appropriate for model validation and evaluation in this study.

Models Comparison and Results Discussion
In ANN, Back Propagation Neural Network (BPNN) and NARXNN are selected as the candidate models in comparison, and they both have one hidden layer with 50 neurons.We use neural network tool in MATLAB to optimize parameters.
In SVM, we use RBF and linear structure as kernel functions.For the parameters optimization, [31] provides detailed introduction.The parameters of ARIMA and VAR models are estimated using the maximum likelihood estimation available in forecast and vars packages in .The coefficients of the ST model are estimated using the minimum continuous ranked probability score (CRPS) estimation [34].When forecasting future speed values, the best order of the ARIMA model is determined by the AIC values using the most recent 21 days of speed data.The VAR model is implemented using a maximal order of 10.And the best order of the VAR model is also selected based on the AIC values using the differenced speed data.For all the prediction algorithms, the data in first 21 days are used for training models and data in last 10 days are used for validating models.Two performance measures including the mean absolute error (MAE) and the mean absolute percentage error (MAPE) are used as indicators to evaluate the multistep prediction performances of different models.The unit of the MAE is km per hour.The equations for calculating MAE and MAPE are shown as follows: where  is the number of observations, V  is the actual speed at time  at station, and V  is the predicted speed.Furthermore, in order to further evaluate the performance of all models, both one-step and multistep ahead prediction (i.e., 3-step (6 minutes), 5-step (10 minutes), and 10-step (20 minutes)) are considered.Tables 1, 2, and 3 provide the MAE and MAPE values of different models for different forecasting horizons.Note that, in Table 1, bold values indicate the smallest MAE and MAPE values.Figure 3 shows the prediction results of models for one forecasting step compared with observed speed data in five days.The left column represents the prediction results of machine learning methods and right column represents the prediction results of statistical models.
In machine learning, we only show comparison among BPNN, SVM-RBF, and MLR.In statistical models, only prediction performances of ARIMA and ST are compared.Figure 4 displays the correlation between observed values and predicted values from five models (three models in machine learning and two models in statistical method) in ten days, and  2 represents the correlation coefficient to evaluate the relevance between observed and predicted values.Figure 5 shows the frequency distribution of predicted errors of five models in ten days.The -axis is the errors (errors = predicted value − observed value), and -axis indicates the frequency in different error ranges.In the figure,  is defined as the rate percentage that errors fell within the range of ±10% to estimate prediction strength of all models.Figure 6 represents the frequency distribution of relative errors of five models in ten days.The -axis is the relative errors (relative errors = (predicted value − observed value)/observed value), and -axis indicates the frequency in different error ranges.Similarly, in Figure 6,  is defined as the rate percentage that relative errors fell within the range of ±10% to estimate the performance of models.
From the observation of four figures, we can gain several conclusions as follows: the BPNN produce better prediction results with higher  2 and  compared with other models; SVM-RBF outperforms ARIMA.The reason is that machine learning models have complex structure and strong learning ability; considering correlation between spatial and temporal  characteristics, ST also produces high prediction accuracy; it has similar prediction accuracy with SVM-RBF; MLR is inferior to ARIMA and SVM-RBF with lowest  2 and  for its simple calculation structure.
Based on the reported values in tables and corresponding figures, several interesting findings can be obtained: (1) As expected, the prediction accuracy of speed deteriorates as the prediction time steps increase for all models.The results in Tables 1, 2, and 3 show that the MAE and MAPE values for 10-step ahead forecasting are significantly larger than the results of 1-step ahead.From the observation of figures, we can obtain similar results: as the step of prediction increases, the predicted values of all models become more fluctuated, and the prediction accuracy and stability of multistep ahead decrease compared to the results of 1-step ahead prediction.
(2) When comparing the results between machine learning and statistical model, the BPNN, NARXNN, and SVM-RBF can clearly outperform two traditional statistical models: ARIMA and VAR.The reason is that these three machine learning models have complex structure and strong learning ability.Considering correlation between spatial and temporal, ST also produces high prediction accuracy.SVM-LIN produces similar prediction results with ARIMA, and MLR is inferior to ARIMA and VAR.Although from the prediction results above we can find that complex machine learning models achieve higher prediction results than statistical methods, it is still a challenging work to select proper model in actual applications.As Karlaftis and Vlahogianni [33] suggested, prediction accuracy is a very important indicator but model simplicity and suitability also should be considered.Kirby et al. [35] stated that accuracy should not be the sole determinant for selecting the proper methodology for prediction; other issues should be considered in selecting the appropriate approach such as the time and effort required for model development, skills and expertise required, transferability of the results, and adaptability to changing behaviors [33,35].Between classical statistical models and machine learning algorithms, researchers frequently prefer higher prediction accuracy over explanatory power of model.Karlaftis and Vlahogianni [33] concluded some criteria for model selection between NN and statistical models.Thus, we applied case study to compare the predicted performance between machine learning and statistical models, and the conclusion we obtained in this study is that complex machine learning models have advantages of nonlinear fitting ability and robustness for missing data but low explanatory power, and statistical methods can reach high prediction performance through inherent explanatory structure.

Conclusions
This paper evaluated the multistep prediction performance of machine learning and statistical models using the speed data collected from three RTMS located on 4th ring road in Beijing City.The data are collected from December 1, 2014, to December 31, 2014, with interval of 2 minutes.In the models performance comparison, we choose five machine learning methods: BPNN, NARXNN, SVM-RBF, SVM-LIN, and MLR, and three conventional statistic models: ARIMA, VAR, and ST model.We firstly provide a brief introduction of each model.In the applications, we then optimize models parameters by using data collected in the first 21 days.Finally, we compare prediction accuracies of different models based

Figure 1 :
Figure 1: Three data collection stations in major ring road, Beijing.

Figure 2 :
Figure 2: Travel speed distribution of three collection locations in 7 days.

Figure 5 :
Figure 5: Frequency distribution of predicted errors for five steps ahead in three stations.

Table 1 :
Prediction accuracy of models for different forecasting steps ahead in station A.

Table 2 :
Prediction accuracy of models for different forecasting steps ahead in station B.

Table 3 :
Prediction accuracy of models for different forecasting steps ahead in station C.