Ensemble Learning for Short-Term Traffic Prediction Based on Gradient Boosting Machine

Short-term traffic prediction is vital for intelligent traffic systems and influenced by neighboring traffic condition. Gradient boosting decision trees (GBDT), an ensemble learningmethod, is proposed tomake short-term traffic prediction based on the traffic volume data collected by loop detectors on the freeway. Each new simple decision tree is sequentially added and trained with the error of the previous whole ensemble model at each iteration. The relative importance of variables can be quantified in the training process of GBDT, indicating the interaction between input variables and response.The influence of neighboring traffic condition on prediction performance is identified through combining the traffic volume data collected by different upstream and downstream detectors as the input, which can also improve prediction performance. The relative importance of input variables for 15 GBDT models is different, and the impact of upstream traffic condition is not balanced with that of downstream.The prediction accuracy of GBDT is generally higher than SVM and BPNN for different steps ahead, and the accuracy of multi-step-ahead models is lower than 1step-ahead models. For 1-step-ahead models, the prediction errors of GBDT are smaller than SVM and BPNN for both peak and nonpeak hours.


Introduction
Massive traffic data have been constantly collected from a variety of sensors, such as inductive loop detectors, GPSequipped vehicles, and mobile phones [1], promoting the development of data-driven intelligent transportation systems (ITS) [2].Short-term traffic prediction is one of the most dynamic and typical researches in ITS, aiming at estimating the traffic state in the near future (within a few minutes) based on the historical traffic data [3,4].The prediction traffic information is essentially useful for travelers to make better travel planning in the pretrip stage or reschedule in the en route trip [5].Accurate short-term traffic prediction is the first important step for real-time route guidance [6] and is quite critical in advanced travelers' information systems (ATIS) and advanced traffic management systems (ATMS) [7].
Traditional statistical approaches for short-term traffic prediction, such as ARIMA [8] and Kalman filtering technique [9], take advantages of the significant temporal dependencies of the historical univariate time series data of traffic variables.These methods usually assume model structures beforehand and estimate model parameters from the historical data later, with enough interpretability.It is easy for the prediction accuracy to be affected by the unstable traffic conditions, such as the traffic condition at peak hours [10].
Nonstationary and nonlinearity are the basic characteristics of traffic variables [11].A variety of data-driven approaches have been applied for short-term traffic prediction, capturing the nonlinear relationship among the variables.Higher prediction accuracy can be acquired by these nonparametric machine learning (ML) methods, including Back Propagation Neural Network (BPNN) [12,13], Support Vector Machine (SVM) [14,15], and -nearest neighbor algorithm (KNN) [16].These methods belong to supervised learning method, and the target variables need to be prepared for the dataset beforehand, focusing on learning the relationship between the response and predictors [17].The underlying information in the massive traffic data can be efficiently captured by these ML methods, achieving good prediction performance, but lacking interpretability [18].
Considering the freeway traffic condition independent of signalization, most short-term traffic prediction algorithms have been conducted and verified based on the freeway traffic data [3].In the past decades, most researches focus on the prediction of traffic variables at one specific site of interest, solely considering the effect of its own previous traffic information.Actually, the traffic prediction performance for the given site is considerably influenced by the neighboring traffic condition.Spatial and temporal correlations were taken into account when performing short-term traffic prediction [6,19,20].The traffic condition at a specific site is closely related to that of the upstream and downstream traffic condition.Multivariate traffic flow prediction model was constructed, improving the prediction performance by incorporating upstream traffic flow series as the transfer function input of ARIMA [21].The influence of upstream and downstream traffic on the traffic condition of the given site is not symmetric [22].The relationship between the current traffic speed at the given location and the past traffic speeds at the upstream and downstream locations was explored through cross correlation analysis [10].
The information provided by the traffic variables of neighboring sites can be used to improve the traffic prediction performance for the given site [10].In this study, based on the freeway traffic data collected by the detectors, the historical upstream and downstream traffic volume are considered into the variables of prediction models.Actually, the traffic state variation of adjacent detectors is correlative.For many ML models, the effects of the input variables on the model output are difficult to interpret, and when the redundant or irrelevant variables are added, the prediction performance may get worse.
In order to capture the complex nonlinearity of traffic variation and identify the importance of variables, gradient boosting decision trees (GBDT) method, a tree-based ensemble learning method, is proposed to make short-term traffic prediction in this study.GBDT is a relatively new robust and accurate method in the machine learning field, which can cover different types of variables and identify the effects of upstream or downstream traffic on the traffic prediction of the given site, achieving excellent performance over classical methods.The main goal of this study is to identify the relative importance of input variables and enhance the accuracy of short-term traffic prediction.
Ensemble learning is one of the most popular and promising machine learning methods, which can improve the prediction performance by combining large numbers of weak base models [23].The most commonly used ensemble techniques include boosting, bagging, and stacking.Different with other ML methods, the interaction between the input variables and prediction models can be interpreted, and the relative importance of critical factors can be identified by ensemble learning [24].Tree-based ensemble methods, combining multiple simple decision trees, have been applied to handle prediction and classification problems in the transportation field, such as random forest, gradient boosting machine, and boosted regression trees.The prediction or classification output of model is the weighted summation or voting of the prediction of base trees.Random forests algorithm into AdaBoost algorithm is applied to estimate and predict traffic flow and congestion [25].Stochastic gradient boosting is used to identify crashes with a superior classification performance [26].The nonlinear relationships in the traffic accident data and the main effects of crucial variables are investigated by the boosted regression trees [27].
Additionally, the tree-based models on the basis of the random forest algorithm in the bagging framework are independently trained by uniformly and randomly sampling with replacement from the original dataset, strengthening the robustness, which can be trained by parallel computing.For each splitting node of the based trees, features are randomly selected [28].Significantly different from the random forest, the tree-based models of GBDT are trained sequentially, and each base model is added to correct the error produced by its previous tree models.For each step, the samples misclassified by previous models are more likely to be selected as the train data, producing more accurate prediction performance.Comparing with the simple single tree model, GBDT is more stable with better prediction performance and interpretability by combining the output results of base trees [24].
The main contribution of this study is that the shortterm traffic flow prediction models on the basis of gradient boosting machine are constructed, focusing on the influence of upstream and downstream traffic condition simultaneously and achieving a higher prediction accuracy than conventional machine learning methods.GBDT algorithm provides a flexible framework to adopt different combinations of the upstream and downstream historical traffic volumes as the input variables, which can capture the complex traffic nonlinearity, cover the hidden traffic patterns, and identify the relative importance of variables, and is of good interpretability.In addition, GBDT can resist the outliers of variables and perform well with partly erroneous data without cleaning [26].

Methodology
Single decision tree is a fast but instable algorithm, easily affected by the small perturbations in the training data [18], but the performance can be significantly improved by ensemble techniques [26].Gradient boosting regression trees algorithm (GBDT) is viewed as combining the strengths of boosting algorithms and decision trees.Friedman [29] proposed the gradient boosting machines (GBM), based on a gradient descent formulation of boosting methods, which is suitable for regression and classification problems.Boosting framework is essentially a constructive strategy of ensemble formation, sequentially adding new weak base models which are trained with respect to the error of the former whole ensemble model for each iteration, and these base learners just produce a slightly lower error rate than random guessing [30].
The approximation accuracy and execution speed of gradient boosting can be generally improved by randomly subsampling the training data to fit the base learner at each iteration, also called stochastic gradient boosting [31], which is employed to make the short-term traffic volume prediction in this study, simultaneously considering the influence of the upstream and downstream traffic.The output of short-term traffic prediction model is the traffic volume of the future time at the given site, and the input is the historical volume at the past 1 or 2 or 3 time steps of the given site and its adjacent sites.Similar to other supervised learning methods, GBDT needs to be trained by the dataset with target labels, denoted as (x, y)  , and x = ( 1 , . . .,   ) are the input variables and y = ( 1 , . . .,   ) are the corresponding labels of the response variable.To find out the optimal combination of trees, GBDT algorithm adopts the forward stagewise technique and minimizes the loss function by sequentially adding a new base learner (single tree) to the expansion at each iteration without adjusting the parameters of the existing trees that have already been added [23].The loss function () in using the estimated function () to predict y based on the training data is defined as With regard to the continuous response variables, the classical squared-error  2 loss is employed in this prediction model, resulting in consecutive error-fitting in the process: In the boosting framework, when the algorithm is repeated for  iterations, the overall ensemble function estimate f() is expressed in the additive functional form: where f0 () is the initial guess and f () ( = 1, 2 ⋅ ⋅ ⋅ ) are the function increments.The new base learners are constructed to be maximally correlated with the negative gradient of the loss function [30].For the th iteration, the negative gradient is defined as () is the local direction where () decreases the most rapidly at () =  −1 ().ℎ  () denotes the base learner model and the gradient descent step length   is computed as For each step, adding a new base tree is to correct the mistakes made by its previous base learners [18].Thus, the current model is updated as To sum up, the generic gradient boosting decision trees algorithm for regression is shown in Algorithm 1. ( 0 () is just a single terminal node decision tree.) In the process of gradient boosting, weighted resampling is carried out to put emphasis on observations which are more difficult to predict accurately.The value of each observation is reestimated once the new regression tree is added.The observations with lower prediction accuracy are assigned with a higher weight.The sampling weight is updated at the end of each iteration, and the observations with lower accuracy would be sampled with higher probability at the next iteration [26].
The input variables are seldom of equal relevance for the prediction performance, and usually only some of them have substantial influence on the model output [32].Breiman et al. [33] proposed a measure method of relative variable importance for the single decision tree models.The importance of the variable   is denoted as  2  (), which is based on the number of times that a variable is selected for splitting in the tree weighted by the squared improvement to the model as a result of each split [32].As a tree based ensemble method, the importance of the variable   for the GBDT model is simply averaged over all trees:  The importance of all the input variable is further standardized to make sure that they add up to 100%, which can be used for feature selection procedures [30].

Data Description
The data used in this study is downloaded from the openaccess traffic flow database of Caltrans Performance Measurement System (PeMS) (http://pems.dot.ca.gov/).We collected the traffic volume data of 9 loop detectors located in State Route 22, Garden Grove, USA, from April 4 to June 5, 2016, lasting for 9 weeks.The detailed located information of the selected road segment and 9 detectors is shown in Figure 1.The traffic volume of four lanes is aggregated into one time series, recorded every 5 minutes.The traffic volume data of first eight weeks are used to train the traffic prediction model based on GBDT, while the last week of data serves as the testing set to identify the prediction accuracy of models.Detector  (1202724) is the target detector for traffic prediction, and Detectors  1 (1202738),  2 (1214972),  3 (1214987), and  4 (1215002) are the upstream detectors of , while  1 (1202701),  2 (1214954),  3 (1214939), and  4 (1202676) are the downstream detectors of .The length of the selected segment is 4552 m, with three exits and two entrances.The distance between two adjacent detectors is shown in Figure 1(b).The traffic volume variation of the given site is closely related to the upstream and downstream traffic condition.The traffic volume profile of 9 detectors on Wednesday, June 1, 2016, is shown in Figure 2. The basic statistics of the collected data for each detector is shown in Table 1, and the traffic volume values of 9 detectors are similar, and tiny differences of the 7 statistical indicators are mainly generated by the traffic flow at exits and entrances.The "25th," "50th," and "75th" are the 25th, 50th, and 75th percentiles of observations when ranking the traffic volume data in an ascending sort order for each detector.
The predictor response of the short-term traffic volume prediction models is the traffic volume of Detector  at time step t, denoted as   , which is related to the previous historical traffic volume of Detectors ,  1 ,  2 ,  3 ,  4 ,  1 ,  2 ,  3 , and  4 .All the possible variables used as the input are as follows:  -1 ,  -2 ,  -3 are the traffic volume of Detector  at time steps -1, -2, and -3; , and  4 - -3 are the traffic

Experiments and Discussion
In this section, the experiment results of the short-term traffic prediction models based on GBDT are discussed in detail.The subsampling fraction is set as 0.5, signifying that 50% of the training data observations are randomly selected to propose the next tree in the expansion at each iteration.On account of randomness, similar but different fits are acquired when running the same model, and thus the prediction accuracy for each model is set as the average of 20 groups of experimental results slightly fluctuating in a small range.The minimum number of observations in the tree terminal nodes is set as 10.

Parameter Optimization.
The performance of GBDT algorithm varies with the different parameter settings, including number of trees , the maximum depth of variable interactions , and learning rate .In order to acquire the optimal prediction model, the effect of different parameter setting on the prediction performance is studied in this section.To uncover the influence of parameters setting on the prediction performance, the input variables and data are set to be the same for the experiments.
where   and V are the real and predicted traffic volume at time  of the given site, respectively.The maximum depth of variable interactions  refers to the number of nodes in a tree, signifying the tree complexity.More complex variable interactions hid in data can be captured by the larger .Number of trees  is equivalent to the number of iterations and the number of base models in the additive expansion.When the other parameters are fixed, the larger  is, the more complex the model is, and more computational time will be required, which may cause overfitting more easily and produce poor performance on the observations not included in the training dataset [18].In order to prevent the overfitting, the number of gradient boosting iterations needs to be controlled.In this study, 5-fold cross-validation is applied to check the prediction performance and acquire the optimal iteration number.For example, with the parameter setting of  = 3 and  = 0.05, MAPE and MAE varying with the increasing of  are shown in Figure 3, and it can be seen that when  > 100, the errors fluctuate slightly.
In order to achieve a better prediction performance, the range of  and  is set as 3 ≤  ≤ 6 and 0.001 ≤  ≤ 0.5 through conducting the preliminary experiments.Figure 4 indicates the influence of variable interactions  and learning rate  on the optimal iteration number and prediction errors.The complexity of base trees is represented by the variable interaction .For a given learning rate R, the higher  is, the more complex the model is, and the fewer trees are needed to be added.Thus, the larger iteration number is preferable when setting a smaller  to produce high prediction accuracy.
The contribution of each base model can be adjusted by learning rate .When the learning rate  is set to be a higher value, the prediction errors dropped to the lowest with fewer iterations, but the prediction errors are significantly higher than those with a smaller  setting.For example, when  = 0.5, the optimal iteration number is less than 200, but MAPE is higher than 0.08, and MAE is higher than 18.More trees need to be added for the smaller  setting, requiring more computational time.Overall, through weighing the computation and accuracy,  = 0.01 or 0.05 is more suitable for these traffic prediction models to produce better prediction performance with fewer iterations.Overfitting can be prevented by setting a smaller  to restrict the contribution of each base tree.In addition, the optimal setting of the parameters , J,  varies with training datasets, and the prediction models based on GBDT need to be retrained for other road segments.

Prediction Performance. GBDT provides a flexible framework to adopt various combination of different types of attributes as input variables for the prediction models.
Firstly, 5 min (1-step) ahead short-term prediction models based on GBDT algorithm are built to uncover the effects of the upstream and downstream traffic condition on the prediction accuracy.The detailed information of 15 models is shown in Table 2.In order to compare the prediction performance of different models, balancing the computation and prediction accuracy, the parameters setting for the 15 models is  = 5 and  = 0.05.The input variables are the different combinations of historical traffic volume of Detectors ,  1 ,  2 ,  3 ,  4 ,  1 ,  2 ,  3 , and  4 at time steps -1, -2, and -3, and the response is the traffic volume of Detector  at the next time step .Through comparing the prediction accuracy of different models, the optimal variable combination can be acquired for the freeway short-term traffic prediction model.
The prediction accuracy of GBDT models is ranked as shown in Table 2.The top three high-accuracy models are Model 10, Model 15, and Model 4, signifying that the upstream traffic condition has more positive impact on the prediction accuracy of GBDT models.In particular, MAPE and MAE reach the minimum at Model 10, just considering the influence of upstream historical traffic volume.Interestingly, the prediction accuracy of Model 11 and 14 is the lowest, just taking the downstream traffic volume as the input of GBDT.Generally, the prediction accuracy of GBDT models is lower when considering more downstream traffic variables as the input variables.
Furthermore, the prediction accuracy of short-term prediction for a given site is influenced by the upstream and downstream traffic condition on the freeway.The GBDT models considering the neighboring traffic condition tend to outperform the traditional simple temporal prediction models (Model 1), and the prediction performance can be enhanced by adding the neighboring traffic information to the input of models.

Relative Importance of Variables.
In the training process of the GBDT models, the number of times that a variable is selected for splitting in the trees can be described by the relative importance.The relative importance of each variable for Models 1∼15 based on GBDT can be conveniently computed, identifying the effects of input variables on the model output and prediction accuracy, as shown in Figure 5.
The contribution of the same variables to the performance of different models is diverse.For example, the relative importance of  -1 in Model 1 is 76.7%, while that in Model 9 is 55.0%.The ranking of the variable importance also varies greatly among different models.For example, the importance of  2 - -1 ranks fourth in Model 8 and ranks second in Model 9.
The immediate previous traffic volume  -1 of Detector  is the most important variable for the 15 GBDT models, and we could consider that  -1 is the most frequently selected variable to split the terminal nodes in decision trees when training the GBDT models, which is also in accordance with the actual situation that the traffic state in the near future tends to be influenced by the traffic just happening previously [18].The variable  -2 of Detector  is the second important input variable for Model 1, Model 2, Model 3, and Model 11, while  2 - -1 is for Model 4, Model 7, Model 10, and Model 13,  1 - -1 is for Model 6, Model 8, and Model 12,  2 - -1 is for Model 5 and Model 9, and  4 - -1 is for Model 14 and Model 15.Moreover, when more variables of upstream or downstream detectors are considered for prediction, the models show less reliance on the historical temporal variables of themselves.For example, the importance of  -1 in Model 15 is about 45%, much lower than that of the other models.
With the increasing of the neighboring traffic information, the importance of upstream and downstream traffic variables is improved in the GBDT models, and the prediction performance is enhanced simultaneously.The importance of upstream traffic condition on the traffic prediction accuracy for the given site is not equal to that of downstream traffic.Considering both the prediction accuracy and importance ranking of variables, the historical traffic variables of adjacent detectors should be added to the short-term traffic prediction models.
From the temporal perspective, the importance of traffic volume of the 9 detectors at time steps -2 and -3 is lower than that at time step -1 in GBDT models.The importance of variable Time is significant for the 15 models, for the reason that the traffic volume of each detector varies greatly across the different time periods, and the fluctuation in the short term is irregular and complex.Therefore, the prediciton models for peak and nonpeak hours would be discussed in the following.

Multi-
Step-Ahead Traffic Prediction Models.The Support Vector Machine (SVM) and Back Propagation Neural Network (BPNN) have been widely used in short-term traffic prediction on the freeway, which are trained for each combination of input variables in Table 2.The accuracy of 5 min (1-step) ahead GBDT prediction models is compared with that of SVM and BPNN based on 20 groups of repeated experiments for each model, as shown in Figure 6.The prediction errors of GBDT are significantly smaller than those of SVM and BPNN.
To identify the performance of GBDT, SVM, and BPNN approaches with different prediction horizons, the 10 min (2-step) and 15 min (3-step) ahead traffic prediction models are built to compare with the 5 min (1-step) ahead prediction model.The accuracy of 10 min and 15 min ahead Relative importance Relative importance Relative importance   prediction models is shown in Figures 7 and 8, respectively.It is obvious that the prediction accuracy tends to be reduced for the multi-step-ahead models in comparison with 1-step-ahead models.As a whole, the prediction errors of 5 min ahead prediction models are smaller than those of 10 min and 15 min ahead prediction models.Generally, from the perspective of prediction accuracy, GBDT models perform relatively better than SVM and BPNN models in the short-term traffic prediction for the three horizons.

GBDT SVM BPNN
The computational time for 5 min, 10 min, and 15 min ahead traffic prediction models based on GBDT, SVM, and BPNN is shown in Figure 9. GBDT algorithm costs more time than SVM for the reason that it needs to train large numbers of decision trees.As for BPNN models, the computational time varies greatly for different input variables.Among all the models, the prediction errors reach the minimum at Model 10 (5 min ahead) based on GBDT, considering the historical traffic volume of ,  1 ,  2 , and  3 as the input.The traffic volume of the 9th week at Detector  is estimated based on Model 10 (5 min ahead) for GBDT, SVM, and BPNN methods respectively.The predicted traffic volume is compared with the real observations, with the total number of time steps 288 × 7 = 2016, as shown in Figure 10.
The prediction accuracy comparison of GBDT, SVM, and BPNN (5 min ahead) models for peak hours (7:00-9:00, 17:00-19:00) and nonpeak hours (4:00-6:00, 21:00-23:00) is shown in Figures 11 and 12. Generally, the prediction errors of GBDT models are lower than SVM and BPNN for the traffic condition at both peak hours and nonpeak hours.Moreover, MAPE of the prediction models at peak hours is lower than that of nonpeak hours for GBDT, SVM, and BPNN models, while MAE is the opposite.
Computational time comparison of GBDT, SVM, and BPNN models (5 min ahead) for peak and nonpeak hours is shown in Figure 13.As a whole, GBDT algorithm costs more time than that of SVM and less time than BPNN for peak and nonpeak hours.Generally, three prediction models for nonpeak hours cost less computational time than those for peak hours.In addition, the prediction performance is significantly improved by training the prediction models separately for different time periods of one day, such as peak or nonpeak hours, comparing with the traffic prediction models for the whole day.

Conclusions
This study indicates that gradient boosting machine is suitable for the short-term traffic prediction of freeway, providing a flexible framework to adopt different combinations of variables referring to the neighboring traffic information for the prediction models.The performance of GBDT is influenced by the parameter settings.Considering the computation and accuracy, the three main parameters , ,  are optimized to produce better prediction performance with fewer iterations and avoid overfitting.
GBDT models perform better than the classical SVM and BPNN models in the short-term traffic prediction.The prediction accuracy is affected by adding the upstream or downstream traffic information to the prediction models, and the highest accuracy is produced by Model 10 for GBDT algorithm, just considering the influence of upstream traffic condition.The relative importance of variables varies considerably in the GBDT models with different variable combination.The previous traffic volume of the same site  -1 is the most important variable for the GBDT models, and the importance of upstream traffic condition on the traffic prediction of the current site is not equal to that of downstream traffic condition.From the temporal perspective, the importance of traffic condition at time steps -2 and -3 is lower than that at time step -1.
As a whole, GBDT performs relatively better than SVM and BPNN algorithms for the 5 min, 10 min, and 15 min ahead prediction models, and the prediction errors of 5 min ahead prediction models are smaller than that of 10 min and  15 min ahead prediction models.The prediction errors of GBDT models are lower than SVM and BPNN for the traffic condition at peak and nonpeak hours.
Overall, the superior prediction performance and model interpretability can be achieved by GBDT for the short-term traffic prediction, simultaneously considering the neighboring traffic condition.Short-term traffic prediction is of crucial importance for the traffic management and route guidance at the road network level.Considering the high efficiency and robustness of GBDT algorithm, more spatial and temporal traffic information could be taken into account for the accurate traffic prediction in a larger scale road network in the future work.

Figure 2 :
Figure 2: Traffic volume profile of 9 detectors for one day.
and  4 - -3 are the traffic volume of the 4 downstream detectors at time steps -1, -2, and -3; lastly, considering that the traffic volume varies greatly across different time period during one day, the time of day should be considered as an input variable, which is represented by time step Time.Each time step is 5 min, and there are 288 time steps for one day.